1$NetBSD: softfloat.txt,v 1.2 2006/11/24 19:46:58 christos Exp $ 2 3SoftFloat Release 2a General Documentation 4 5John R. Hauser 61998 December 13 7 8 9------------------------------------------------------------------------------- 10Introduction 11 12SoftFloat is a software implementation of floating-point that conforms to 13the IEC/IEEE Standard for Binary Floating-Point Arithmetic. As many as four 14formats are supported: single precision, double precision, extended double 15precision, and quadruple precision. All operations required by the standard 16are implemented, except for conversions to and from decimal. 17 18This document gives information about the types defined and the routines 19implemented by SoftFloat. It does not attempt to define or explain the 20IEC/IEEE Floating-Point Standard. Details about the standard are available 21elsewhere. 22 23 24------------------------------------------------------------------------------- 25Limitations 26 27SoftFloat is written in C and is designed to work with other C code. The 28SoftFloat header files assume an ISO/ANSI-style C compiler. No attempt 29has been made to accommodate compilers that are not ISO-conformant. In 30particular, the distributed header files will not be acceptable to any 31compiler that does not recognize function prototypes. 32 33Support for the extended double-precision and quadruple-precision formats 34depends on a C compiler that implements 64-bit integer arithmetic. If the 35largest integer format supported by the C compiler is 32 bits, SoftFloat is 36limited to only single and double precisions. When that is the case, all 37references in this document to the extended double precision, quadruple 38precision, and 64-bit integers should be ignored. 39 40 41------------------------------------------------------------------------------- 42Contents 43 44 Introduction 45 Limitations 46 Contents 47 Legal Notice 48 Types and Functions 49 Rounding Modes 50 Extended Double-Precision Rounding Precision 51 Exceptions and Exception Flags 52 Function Details 53 Conversion Functions 54 Standard Arithmetic Functions 55 Remainder Functions 56 Round-to-Integer Functions 57 Comparison Functions 58 Signaling NaN Test Functions 59 Raise-Exception Function 60 Contact Information 61 62 63 64------------------------------------------------------------------------------- 65Legal Notice 66 67SoftFloat was written by John R. Hauser. This work was made possible in 68part by the International Computer Science Institute, located at Suite 600, 691947 Center Street, Berkeley, California 94704. Funding was partially 70provided by the National Science Foundation under grant MIP-9311980. The 71original version of this code was written as part of a project to build 72a fixed-point vector processor in collaboration with the University of 73California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek. 74 75THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort 76has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT 77TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO 78PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY 79AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE. 80 81 82------------------------------------------------------------------------------- 83Types and Functions 84 85When 64-bit integers are supported by the compiler, the `softfloat.h' header 86file defines four types: `float32' (single precision), `float64' (double 87precision), `floatx80' (extended double precision), and `float128' 88(quadruple precision). The `float32' and `float64' types are defined in 89terms of 32-bit and 64-bit integer types, respectively, while the `float128' 90type is defined as a structure of two 64-bit integers, taking into account 91the byte order of the particular machine being used. The `floatx80' type 92is defined as a structure containing one 16-bit and one 64-bit integer, with 93the machine's byte order again determining the order of the `high' and `low' 94fields. 95 96When 64-bit integers are _not_ supported by the compiler, the `softfloat.h' 97header file defines only two types: `float32' and `float64'. Because 98ISO/ANSI C guarantees at least one built-in integer type of 32 bits, 99the `float32' type is identified with an appropriate integer type. The 100`float64' type is defined as a structure of two 32-bit integers, with the 101machine's byte order determining the order of the fields. 102 103In either case, the types in `softfloat.h' are defined such that if a system 104implements the usual C `float' and `double' types according to the IEC/IEEE 105Standard, then the `float32' and `float64' types should be indistinguishable 106in memory from the native `float' and `double' types. (On the other hand, 107when `float32' or `float64' values are placed in processor registers by 108the compiler, the type of registers used may differ from those used for the 109native `float' and `double' types.) 110 111SoftFloat implements the following arithmetic operations: 112 113-- Conversions among all the floating-point formats, and also between 114 integers (32-bit and 64-bit) and any of the floating-point formats. 115 116-- The usual add, subtract, multiply, divide, and square root operations 117 for all floating-point formats. 118 119-- For each format, the floating-point remainder operation defined by the 120 IEC/IEEE Standard. 121 122-- For each floating-point format, a ``round to integer'' operation that 123 rounds to the nearest integer value in the same format. (The floating- 124 point formats can hold integer values, of course.) 125 126-- Comparisons between two values in the same floating-point format. 127 128The only functions required by the IEC/IEEE Standard that are not provided 129are conversions to and from decimal. 130 131 132------------------------------------------------------------------------------- 133Rounding Modes 134 135All four rounding modes prescribed by the IEC/IEEE Standard are implemented 136for all operations that require rounding. The rounding mode is selected 137by the global variable `float_rounding_mode'. This variable may be set 138to one of the values `float_round_nearest_even', `float_round_to_zero', 139`float_round_down', or `float_round_up'. The rounding mode is initialized 140to nearest/even. 141 142 143------------------------------------------------------------------------------- 144Extended Double-Precision Rounding Precision 145 146For extended double precision (`floatx80') only, the rounding precision 147of the standard arithmetic operations is controlled by the global variable 148`floatx80_rounding_precision'. The operations affected are: 149 150 floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt 151 152When `floatx80_rounding_precision' is set to its default value of 80, these 153operations are rounded (as usual) to the full precision of the extended 154double-precision format. Setting `floatx80_rounding_precision' to 32 155or to 64 causes the operations listed to be rounded to reduced precision 156equivalent to single precision (`float32') or to double precision 157(`float64'), respectively. When rounding to reduced precision, additional 158bits in the result significand beyond the rounding point are set to zero. 159The consequences of setting `floatx80_rounding_precision' to a value other 160than 32, 64, or 80 is not specified. Operations other than the ones listed 161above are not affected by `floatx80_rounding_precision'. 162 163 164------------------------------------------------------------------------------- 165Exceptions and Exception Flags 166 167All five exception flags required by the IEC/IEEE Standard are 168implemented. Each flag is stored as a unique bit in the global variable 169`float_exception_flags'. The positions of the exception flag bits within 170this variable are determined by the bit masks `float_flag_inexact', 171`float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and 172`float_flag_invalid'. The exception flags variable is initialized to all 0, 173meaning no exceptions. 174 175An individual exception flag can be cleared with the statement 176 177 float_exception_flags &= ~ float_flag_<exception>; 178 179where `<exception>' is the appropriate name. To raise a floating-point 180exception, the SoftFloat function `float_raise' should be used (see below). 181 182In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess 183for underflow either before or after rounding. The choice is made by 184the global variable `float_detect_tininess', which can be set to either 185`float_tininess_before_rounding' or `float_tininess_after_rounding'. 186Detecting tininess after rounding is better because it results in fewer 187spurious underflow signals. The other option is provided for compatibility 188with some systems. Like most systems, SoftFloat always detects loss of 189accuracy for underflow as an inexact result. 190 191 192------------------------------------------------------------------------------- 193Function Details 194 195- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 196Conversion Functions 197 198All conversions among the floating-point formats are supported, as are all 199conversions between a floating-point format and 32-bit and 64-bit signed 200integers. The complete set of conversion functions is: 201 202 int32_to_float32 int64_to_float32 203 int32_to_float64 int64_to_float32 204 int32_to_floatx80 int64_to_floatx80 205 int32_to_float128 int64_to_float128 206 207 float32_to_int32 float32_to_int64 208 float32_to_int32 float64_to_int64 209 floatx80_to_int32 floatx80_to_int64 210 float128_to_int32 float128_to_int64 211 212 float32_to_float64 float32_to_floatx80 float32_to_float128 213 float64_to_float32 float64_to_floatx80 float64_to_float128 214 floatx80_to_float32 floatx80_to_float64 floatx80_to_float128 215 float128_to_float32 float128_to_float64 float128_to_floatx80 216 217Each conversion function takes one operand of the appropriate type and 218returns one result. Conversions from a smaller to a larger floating-point 219format are always exact and so require no rounding. Conversions from 32-bit 220integers to double precision and larger formats are also exact, and likewise 221for conversions from 64-bit integers to extended double and quadruple 222precisions. 223 224Conversions from floating-point to integer raise the invalid exception if 225the source value cannot be rounded to a representable integer of the desired 226size (32 or 64 bits). If the floating-point operand is a NaN, the largest 227positive integer is returned. Otherwise, if the conversion overflows, the 228largest integer with the same sign as the operand is returned. 229 230On conversions to integer, if the floating-point operand is not already an 231integer value, the operand is rounded according to the current rounding 232mode as specified by `float_rounding_mode'. Because C (and perhaps other 233languages) require that conversions to integers be rounded toward zero, the 234following functions are provided for improved speed and convenience: 235 236 float32_to_int32_round_to_zero float32_to_int64_round_to_zero 237 float64_to_int32_round_to_zero float64_to_int64_round_to_zero 238 floatx80_to_int32_round_to_zero floatx80_to_int64_round_to_zero 239 float128_to_int32_round_to_zero float128_to_int64_round_to_zero 240 241These variant functions ignore `float_rounding_mode' and always round toward 242zero. 243 244- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 245Standard Arithmetic Functions 246 247The following standard arithmetic functions are provided: 248 249 float32_add float32_sub float32_mul float32_div float32_sqrt 250 float64_add float64_sub float64_mul float64_div float64_sqrt 251 floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt 252 float128_add float128_sub float128_mul float128_div float128_sqrt 253 254Each function takes two operands, except for `sqrt' which takes only one. 255The operands and result are all of the same type. 256 257Rounding of the extended double-precision (`floatx80') functions is affected 258by the `floatx80_rounding_precision' variable, as explained above in the 259section _Extended_Double-Precision_Rounding_Precision_. 260 261- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 262Remainder Functions 263 264For each format, SoftFloat implements the remainder function according to 265the IEC/IEEE Standard. The remainder functions are: 266 267 float32_rem 268 float64_rem 269 floatx80_rem 270 float128_rem 271 272Each remainder function takes two operands. The operands and result are all 273of the same type. Given operands x and y, the remainder functions return 274the value x - n*y, where n is the integer closest to x/y. If x/y is exactly 275halfway between two integers, n is the even integer closest to x/y. The 276remainder functions are always exact and so require no rounding. 277 278Depending on the relative magnitudes of the operands, the remainder 279functions can take considerably longer to execute than the other SoftFloat 280functions. This is inherent in the remainder operation itself and is not a 281flaw in the SoftFloat implementation. 282 283- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 284Round-to-Integer Functions 285 286For each format, SoftFloat implements the round-to-integer function 287specified by the IEC/IEEE Standard. The functions are: 288 289 float32_round_to_int 290 float64_round_to_int 291 floatx80_round_to_int 292 float128_round_to_int 293 294Each function takes a single floating-point operand and returns a result of 295the same type. (Note that the result is not an integer type.) The operand 296is rounded to an exact integer according to the current rounding mode, and 297the resulting integer value is returned in the same floating-point format. 298 299- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 300Comparison Functions 301 302The following floating-point comparison functions are provided: 303 304 float32_eq float32_le float32_lt 305 float64_eq float64_le float64_lt 306 floatx80_eq floatx80_le floatx80_lt 307 float128_eq float128_le float128_lt 308 309Each function takes two operands of the same type and returns a 1 or 0 310representing either _true_ or _false_. The abbreviation `eq' stands for 311``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands 312for ``less than'' (<). 313 314The standard greater-than (>), greater-than-or-equal (>=), and not-equal 315(!=) functions are easily obtained using the functions provided. The 316not-equal function is just the logical complement of the equal function. 317The greater-than-or-equal function is identical to the less-than-or-equal 318function with the operands reversed; and the greater-than function can be 319obtained from the less-than function in the same way. 320 321The IEC/IEEE Standard specifies that the less-than-or-equal and less-than 322functions raise the invalid exception if either input is any kind of NaN. 323The equal functions, on the other hand, are defined not to raise the invalid 324exception on quiet NaNs. For completeness, SoftFloat provides the following 325additional functions: 326 327 float32_eq_signaling float32_le_quiet float32_lt_quiet 328 float64_eq_signaling float64_le_quiet float64_lt_quiet 329 floatx80_eq_signaling floatx80_le_quiet floatx80_lt_quiet 330 float128_eq_signaling float128_le_quiet float128_lt_quiet 331 332The `signaling' equal functions are identical to the standard functions 333except that the invalid exception is raised for any NaN input. Likewise, 334the `quiet' comparison functions are identical to their counterparts except 335that the invalid exception is not raised for quiet NaNs. 336 337- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 338Signaling NaN Test Functions 339 340The following functions test whether a floating-point value is a signaling 341NaN: 342 343 float32_is_signaling_nan 344 float64_is_signaling_nan 345 floatx80_is_signaling_nan 346 float128_is_signaling_nan 347 348The functions take one operand and return 1 if the operand is a signaling 349NaN and 0 otherwise. 350 351- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 352Raise-Exception Function 353 354SoftFloat provides a function for raising floating-point exceptions: 355 356 float_raise 357 358The function takes a mask indicating the set of exceptions to raise. No 359result is returned. In addition to setting the specified exception flags, 360this function may cause a trap or abort appropriate for the current system. 361 362- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 363 364 365------------------------------------------------------------------------------- 366Contact Information 367 368At the time of this writing, the most up-to-date information about 369SoftFloat and the latest release can be found at the Web page `http:// 370HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'. 371 372 373