Newton (SGI Altix 3700)
Floating Point Performance Issues on Itanium 2 Processors
The Itanium 2 processor is capable of delivering 2 floating point multiply-adds per clock cycle, delivering a peak performance of 6 GFlops (on the 1.5GHz processors). In many codes this figure is not achievable as getting data to the processor cannot be done quickly enough. A further complication can also slow the performance down by a more significant amount (several orders of magnitude) if left untreated and that is operations on denormal numbers.
Denormal numbers, also called subnormal or underflow numbers, are defined in the IEEE standard for floating point as those below the normal range (below the figure returned by the Fortran intrinsic TINY). Operations involving these numbers cannot be performed on the processor and need to be performed by the operating system and there is a huge penalty in doing this.
A floating point number is represented in binary using 32 or 64 bits, which consist of a mantissa and exponent. The range and accuracy are governed by the number of bits that make up each of these components of the floating point number.
A "normal" number takes the form:
Mantissa x 2 exponent
where the mantissa is the string of bits after used in (-)1.xxxxxxxxxxx (the number of digits is 24 for 32 bit numbers and 53 for 64 bit numbers).
Thus the minimum number that can be represented by this model is when the mantissa is zero and the exponent is the largest negative number (G).
1. x 2 (G)
It is however possible to store numbers smaller than this and these are called denormal numbers.
These numbers are now incredibly small and do not have precision that the normal numbers do.
If we take the value returned by the FORTRAN intrinsic TINY which in binary is represented by 00000000 10000000 00000000 00000000 (exponent of 8 bits followed by a mantissa of 24 bits) which in decimal is 1.1754944E-38 and then divide by 2 to obtain a denormal number we obtain the binary number
00000000 01000000 00000000 00000000 which in decimal is 5.8774718E-39 as we expect but the leading bit of the mantissa is no longer a 1.
The calculation of this denormal value needs to be done by the operating system as results clearly demonstrate. The time required to calculate 10 million floating point divides takes 7.89 seconds when the result is a denormalized number and 0.31 seconds when it is not, a performance improvement factor of 25.
Where do denormal numbers come from?
Denormalized numbers are being created all the time on the system and are part of people's codes. They are being used or created in a number of ways:
How can I detect them and fix them?
There are a number of ways to fix this issue and which one you choose depends on why they are arising:
Detecting them is also a fairly simple process. It is possible to count
the number of calculations involving denormalized numbers using the
pfmon profiler and the performance metric
Here my serial process ran on CPU 0 and most of the floating point calculations were operating on denormal numbers.
Having found that my program is creating or using
these denormalized numbers the next step is to discover where and this
is done by running the application under the control of the command
prctl which will cause the code to crash at the first place it discovers
one. For example prefix the command with
Or run using totalview:
Page maintained by email@example.com This page last updated: Monday, 28-Nov-2005 11:13:03 GMT