UK National HPC Service
Home \| Helpdesk \| Machine Status \| Search \| Apply

Newton (SGI Altix 3700)

Floating Point Performance Issues on Itanium 2 Processors

The Itanium 2 processor is capable of delivering 2 floating point multiply-adds per clock cycle, delivering a peak performance of 6 GFlops (on the 1.5GHz processors). In many codes this figure is not achievable as getting data to the processor cannot be done quickly enough. A further complication can also slow the performance down by a more significant amount (several orders of magnitude) if left untreated and that is operations on denormal numbers.

Denormal numbers, also called subnormal or underflow numbers, are defined in the IEEE standard for floating point as those below the normal range (below the figure returned by the Fortran intrinsic TINY). Operations involving these numbers cannot be performed on the processor and need to be performed by the operating system and there is a huge penalty in doing this.

Denormal Numbers

A floating point number is represented in binary using 32 or 64 bits, which consist of a mantissa and exponent. The range and accuracy are governed by the number of bits that make up each of these components of the floating point number.

A "normal" number takes the form:

Mantissa x 2 ^exponent

where the mantissa is the string of bits after used in (-)1.xxxxxxxxxxx (the number of digits is 24 for 32 bit numbers and 53 for 64 bit numbers).

Thus the minimum number that can be represented by this model is when the mantissa is zero and the exponent is the largest negative number (G).

1. x 2 ^(G)

It is however possible to store numbers smaller than this and these are called denormal numbers.

These numbers are now incredibly small and do not have precision that the normal numbers do.

Example

If we take the value returned by the FORTRAN intrinsic TINY which in binary is represented by 00000000 10000000 00000000 00000000 (exponent of 8 bits followed by a mantissa of 24 bits) which in decimal is 1.1754944E-38 and then divide by 2 to obtain a denormal number we obtain the binary number

00000000 01000000 00000000 00000000 which in decimal is 5.8774718E-39 as we expect but the leading bit of the mantissa is no longer a 1.

The calculation of this denormal value needs to be done by the operating system as results clearly demonstrate. The time required to calculate 10 million floating point divides takes 7.89 seconds when the result is a denormalized number and 0.31 seconds when it is not, a performance improvement factor of 25.

Where do denormal numbers come from?

Denormalized numbers are being created all the time on the system and are part of people's codes. They are being used or created in a number of ways:

A result of a prior computation in the program
Operating with uninitialized data
Supplied as an input to the program (ie an external file or data source)
When the compiler speculatively executes code and the speculated code reads from memory that has bogus data (e.g. the wrong branch of an if-then-else condition)

How can I detect them and fix them?

There are a number of ways to fix this issue and which one you choose depends on why they are arising:

Increase the precision of the arithmetic from 32 bit values to 64 bit values.
Correct the coding that operates on uninitialized data.
Scales the values to be in the range of the normal numbers.
Flush denormal results to zero using compiler flags, this is automatically done with -O3.

Detecting them is also a fairly simple process. It is possible to count the number of calculations involving denormalized numbers using the pfmon profiler and the performance metric FP_TRUE_SIRSTALL.

For example

newtd> export LD_ASSUME_KERNEL=2.4.19

newtd> pfmon --system-wide --events=FP_TRUE_SIRSTALL,FP_OPS_RETIRED -t 10 &

<session to end in 10 seconds>

newtd> mpirun -np 1 ./a.out

CPU0 9327535 FP_TRUE_SIRSTALL

CPU0 9430008 FP_OPS_RETIRED

CPU1 0 FP_TRUE_SIRSTALL

CPU1 157493 FP_OPS_RETIRED

CPU2 0 FP_TRUE_SIRSTALL

CPU2 16934 FP_OPS_RETIRED

CPU3 0 FP_TRUE_SIRSTALL

CPU3 0 FP_OPS_RETIRED

Here my serial process ran on CPU 0 and most of the floating point calculations were operating on denormal numbers.

Having found that my program is creating or using these denormalized numbers the next step is to discover where and this is done by running the application under the control of the command prctl which will cause the code to crash at the first place it discovers one. For example prefix the command with prctl --fpemu=signal.

prctl --fpemu=signal mpirun -np 2 ./a.out

Or run using totalview:

prctl --fpemu=signal

totalview mpirun -a -np 2 ./a.out

Page maintained by csar-advice@cfs.ac.uk This page last updated: Monday, 28-Nov-2005 11:13:03 GMT