Floating Point#
A floating point is a binary representation to include non-integer numbers along with \(\pm \infty\) and NaN
.
The format of the floating point in binary is always,
| Sign | Exponent | Manitssa |
Floating Point Single Precision#
The single precision floating point is a 32-bit representation of the floating point.
Where we have,
Symbol |
Size (bits) |
Description |
|
---|---|---|---|
Sign |
\(S\) |
1 |
the sign where |
Exponent |
\(E\) |
8 |
|
Mantissa |
\(M\) |
23 |
Floating Point to Decimal#
Giving the floating point \(F_2\) one may disect it from the definition that \(F_2 \equiv S_2||E_2||M_2\). The following equation is used to construct the decimal representation \(D\) given the floating point.
\(\{S, E, M\}\) is the decimal form of \(\{S_2, E_2, M_2\}\).
Note: any number without a subscript is represented in decimal form.
Decimal to Floating Point#
Given a decimal representation \(D\) we may convert the floating point by determining each part of the floating point (sign, exponent and mantissa).
Three useful equations is given as so,
Sign |
$\((-1)^S = \frac{D}{|D|}\)$ |
Exponent |
$\(E = \lfloor{\log_2{D} + 127}\rfloor\)$ |
Mantissa |
$\(M = 0.M \cdot 2^{23} = \frac{D-2^{E-127}}{2^{E-127}}\cdot 2^{23}\)$ |
The first line is redundant math; just take the sign of \(D\) and consider what \(S\) is
\(\lfloor x \rfloor\) is the floor (round down) of \(x\)