Floating-point arithmetic is arithmetic with floating-point numbers, which is represented by a fixed number of significant digits (the significand) and scaled by an exponent in some base. This is most useful when you need to do arithmetic on very big or very small numbers. Representable numbers have the form significand*baseexponent, where significand and exponent are integers and base is an integer greater than or equal to two.
The most common representation is the one defined by the IEEE 754 standard.
A floating-point unit is specially designed to carry out operations on floating-point numbers.
The IEEE 754 standard defines
- Arithmetic formats
- Interchange formats
- Rounding rules
- Exception handling
|Format||Name||Base||Significand bits||Exponent bits||Exponent bias|
The floating point number is internally represented as [Sign|Exponent|Significand] with the number of bits as shown in the table above. Converting the binary formats to something readable is done as follows
(-1)sign * (1.significand)2 * 2exp - bias
For example, the binary32 number 0x41880000:
0 | 10000011 | (1)00010000000000000000000 (-1)0 * (1.00010000000000000000000)2 * 2100000112 - 127 (1.00010000000000000000000)2 * 2131 - 127 (1.00010000000000000000000)2 * 24 (20 + 2-4) * 24 = 17
All IEEE 754 floating point numbers (except denormalized numbers, which have an exponent of -Emax, or binary 00000000) have an implied/invisible 1, which is not stored, as shown by the (1) in the example above.
The standard defines five rounding rules.
- Round to nearest, ties to even - rounds to the nearest value; if the number falls midway it is rounded to the nearest value with an even (zero) least significant bit
- Round to nearest, ties away from zero - rounds to the nearest value; if the number falls midway it is rounded towards the nearest value, above for positive numbers and down for negative numbers
- Round toward 0 - directed rounding towards zero (truncation)
- Round toward +∞ - directed rounding towards positive infinity (ceiling)
- Round toward −∞ - directed rounding towards negative infinity (floor)
- Not a Number (NaN) - returned by undefined operations such as 0/0, or sqrt(-1)
- Quiet NaN - the default NaN, will be propagated by most operations (e.g. if any input is NaN, the result will be NaN)
- Signaling NaN - will cause an invalid exception to be signaled if it is encountered in any arithmetic operation. There are some unspecified bits in the NaN format which may be used to encode the source of the error.
- Infinity - representation of the real infinity numbers. They are often used as overflow values, though they are not error values in any way. A divide-by-zero operation will return a signed infinity
- Signed zero - in the IEEE 754 standard, zero is always signed. Some arithmetic operations will behave differently for +0 and -0, for example, the identity 1/(1/±∞) = ±∞ is maintained
- Subnormal numbers - subnormal values fill the underflow gap with values where the absolute distance between them is the same as for adjacent values just outside the underflow gap
Floating-point Arithmetic Operations
Adding floating point numbers requires a few more steps than adding integers.
1.100100000000000000000002 * 23 (= 12.5) 1.011100000000000000000002 * 24 (= 23)
First, the exponents need to be made equal. Then the significands can be added as if they were integers.
0.11001000000000000000000 02 * 24 1.011100000000000000000002 * 24 +------------------------------------- 10.00111000000000000000000 02 * 24
This should be shifted, right if there is a carry, or as many places left as there are leading zeroes (not counting the implicit one). Additionally, the exponent should be incremented (when shifting right) or decremented (when shifting left) the same number of places as the shift.
1.000111000000000000000000 002 * 25 (= 35.5)
The last step would be rounding the result based on the chosen rounding rules. The above example does not need to be rounded, as the 2 lowest bits are both zero.
Take for example the number 4.0 and 2.5*10-7, adding them will result in the non-significant bits being non-zero.
1.0000000000000000000000002 * 22 (= 4.0) 1.0101011010111111100101012 * 2-23 (= 0.00000025) 1.0000000000000000000000002 * 22 0.000000000000000000000000 101010112 * 22 +------------------------------------------ 1.000000000000000000000000 101010112 * 22 (= 4.00000025)
This will be rounded differently depending on the chosen rounding rule.
- Round to nearest, ties to even: 1.0000000000000000000000002 * 22 (= 4.0)
- Round to nearest, ties away from zero: 1.0000000000000000000000012 * 22 (= 4.0000005)
- Round toward zero: 1.0000000000000000000000002 * 22 (= 4.0)
- Round toward +∞: 1.0000000000000000000000012 * 22 (= 4.0000005)
- Round toward −∞: 1.0000000000000000000000002 * 22 (= 4.0)
-  - IEEE-754 Floating Point converter (binary, hexadecimal, individual fields, and rounding error)