BFloat16 represents 16-bit floating-point values.
This type does not actually support arithmetic directly. The expected use case is to convert to Float to perform any actual arithmetic, then convert back to a BFloat16 if needed.
Binary representation:
sign (1 bit) | | exponent (8 bits) | | |:-| | | mantissa (7 bits) | | | x xxxxxxxx xxxxxxx
Value interpretation (in order of precedence, with _ wild):
0 00000000 0000000 (positive) zero 1 00000000 0000000 negative zero _ 00000000 _______ subnormal number _ 11111111 0000000 +/- infinity _ 11111111 _______ not-a-number _ ________ _______ normal number
An exponent of all 1s signals a sentinel (NaN or infinity), and all 0s signals a subnormal number. So the working "real" range of exponents we can express is [-126, +127].
For non-zero exponents, the mantissa has an implied leading 1 bit, so 7 bits of data provide 8 bits of precision for normal numbers.
For normal numbers:
x = (1 - sign*2) * 2^exponent * (1 + mantissa/128)
For subnormal numbers, the implied leading 1 bit is absent. Thus, subnormal numbers have the same exponent as the smallest normal numbers, but without an implied 1 bit.
So for subnormal numbers:
x = (1 - sign*2) * 2^(-127) * (mantissa/128)
- Companion
- object
Value members
Concrete methods
Whether this BFloat16 value is finite or not.
Whether this BFloat16 value is finite or not.
For the purposes of this method, infinities and NaNs are considered non-finite. For those values it returns false and for all other values it returns true.
Returns if this is a zero value (positive or negative).
Returns if this is a zero value (positive or negative).
Return the sign of a BFloat16 value as a Float.
Return the sign of a BFloat16 value as a Float.
There are five possible return values:
- NaN: the value is BFloat16.NaN (and has no sign) * -1F: the value is a non-zero negative number * -0F: the value is BFloat16.NegativeZero * 0F: the value is BFloat16.Zero * 1F: the value is a non-zero positive number
PositiveInfinity and NegativeInfinity return their expected signs.
Convert this BFloat16 value to the nearest Float.
Convert this BFloat16 value to the nearest Float.
Unlike Float16, since BFloat16 has the same size exponents as Float32 it means that all we have to do is add some extra zeros to the mantissa.