1$NetBSD: softfloat.txt,v 1.2 2006/11/24 19:46:58 christos Exp $
2
3SoftFloat Release 2a General Documentation
4
5John R. Hauser
61998 December 13
7
8
9-------------------------------------------------------------------------------
10Introduction
11
12SoftFloat is a software implementation of floating-point that conforms to
13the IEC/IEEE Standard for Binary Floating-Point Arithmetic.  As many as four
14formats are supported:  single precision, double precision, extended double
15precision, and quadruple precision.  All operations required by the standard
16are implemented, except for conversions to and from decimal.
17
18This document gives information about the types defined and the routines
19implemented by SoftFloat.  It does not attempt to define or explain the
20IEC/IEEE Floating-Point Standard.  Details about the standard are available
21elsewhere.
22
23
24-------------------------------------------------------------------------------
25Limitations
26
27SoftFloat is written in C and is designed to work with other C code.  The
28SoftFloat header files assume an ISO/ANSI-style C compiler.  No attempt
29has been made to accommodate compilers that are not ISO-conformant.  In
30particular, the distributed header files will not be acceptable to any
31compiler that does not recognize function prototypes.
32
33Support for the extended double-precision and quadruple-precision formats
34depends on a C compiler that implements 64-bit integer arithmetic.  If the
35largest integer format supported by the C compiler is 32 bits, SoftFloat is
36limited to only single and double precisions.  When that is the case, all
37references in this document to the extended double precision, quadruple
38precision, and 64-bit integers should be ignored.
39
40
41-------------------------------------------------------------------------------
42Contents
43
44    Introduction
45    Limitations
46    Contents
47    Legal Notice
48    Types and Functions
49    Rounding Modes
50    Extended Double-Precision Rounding Precision
51    Exceptions and Exception Flags
52    Function Details
53        Conversion Functions
54        Standard Arithmetic Functions
55        Remainder Functions
56        Round-to-Integer Functions
57        Comparison Functions
58        Signaling NaN Test Functions
59        Raise-Exception Function
60    Contact Information
61
62
63
64-------------------------------------------------------------------------------
65Legal Notice
66
67SoftFloat was written by John R. Hauser.  This work was made possible in
68part by the International Computer Science Institute, located at Suite 600,
691947 Center Street, Berkeley, California 94704.  Funding was partially
70provided by the National Science Foundation under grant MIP-9311980.  The
71original version of this code was written as part of a project to build
72a fixed-point vector processor in collaboration with the University of
73California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek.
74
75THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE.  Although reasonable effort
76has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT
77TIMES RESULT IN INCORRECT BEHAVIOR.  USE OF THIS SOFTWARE IS RESTRICTED TO
78PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY
79AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE.
80
81
82-------------------------------------------------------------------------------
83Types and Functions
84
85When 64-bit integers are supported by the compiler, the `softfloat.h' header
86file defines four types:  `float32' (single precision), `float64' (double
87precision), `floatx80' (extended double precision), and `float128'
88(quadruple precision).  The `float32' and `float64' types are defined in
89terms of 32-bit and 64-bit integer types, respectively, while the `float128'
90type is defined as a structure of two 64-bit integers, taking into account
91the byte order of the particular machine being used.  The `floatx80' type
92is defined as a structure containing one 16-bit and one 64-bit integer, with
93the machine's byte order again determining the order of the `high' and `low'
94fields.
95
96When 64-bit integers are _not_ supported by the compiler, the `softfloat.h'
97header file defines only two types:  `float32' and `float64'.  Because
98ISO/ANSI C guarantees at least one built-in integer type of 32 bits,
99the `float32' type is identified with an appropriate integer type.  The
100`float64' type is defined as a structure of two 32-bit integers, with the
101machine's byte order determining the order of the fields.
102
103In either case, the types in `softfloat.h' are defined such that if a system
104implements the usual C `float' and `double' types according to the IEC/IEEE
105Standard, then the `float32' and `float64' types should be indistinguishable
106in memory from the native `float' and `double' types.  (On the other hand,
107when `float32' or `float64' values are placed in processor registers by
108the compiler, the type of registers used may differ from those used for the
109native `float' and `double' types.)
110
111SoftFloat implements the following arithmetic operations:
112
113-- Conversions among all the floating-point formats, and also between
114   integers (32-bit and 64-bit) and any of the floating-point formats.
115
116-- The usual add, subtract, multiply, divide, and square root operations
117   for all floating-point formats.
118
119-- For each format, the floating-point remainder operation defined by the
120   IEC/IEEE Standard.
121
122-- For each floating-point format, a ``round to integer'' operation that
123   rounds to the nearest integer value in the same format.  (The floating-
124   point formats can hold integer values, of course.)
125
126-- Comparisons between two values in the same floating-point format.
127
128The only functions required by the IEC/IEEE Standard that are not provided
129are conversions to and from decimal.
130
131
132-------------------------------------------------------------------------------
133Rounding Modes
134
135All four rounding modes prescribed by the IEC/IEEE Standard are implemented
136for all operations that require rounding.  The rounding mode is selected
137by the global variable `float_rounding_mode'.  This variable may be set
138to one of the values `float_round_nearest_even', `float_round_to_zero',
139`float_round_down', or `float_round_up'.  The rounding mode is initialized
140to nearest/even.
141
142
143-------------------------------------------------------------------------------
144Extended Double-Precision Rounding Precision
145
146For extended double precision (`floatx80') only, the rounding precision
147of the standard arithmetic operations is controlled by the global variable
148`floatx80_rounding_precision'.  The operations affected are:
149
150   floatx80_add   floatx80_sub   floatx80_mul   floatx80_div   floatx80_sqrt
151
152When `floatx80_rounding_precision' is set to its default value of 80, these
153operations are rounded (as usual) to the full precision of the extended
154double-precision format.  Setting `floatx80_rounding_precision' to 32
155or to 64 causes the operations listed to be rounded to reduced precision
156equivalent to single precision (`float32') or to double precision
157(`float64'), respectively.  When rounding to reduced precision, additional
158bits in the result significand beyond the rounding point are set to zero.
159The consequences of setting `floatx80_rounding_precision' to a value other
160than 32, 64, or 80 is not specified.  Operations other than the ones listed
161above are not affected by `floatx80_rounding_precision'.
162
163
164-------------------------------------------------------------------------------
165Exceptions and Exception Flags
166
167All five exception flags required by the IEC/IEEE Standard are
168implemented.  Each flag is stored as a unique bit in the global variable
169`float_exception_flags'.  The positions of the exception flag bits within
170this variable are determined by the bit masks `float_flag_inexact',
171`float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and
172`float_flag_invalid'.  The exception flags variable is initialized to all 0,
173meaning no exceptions.
174
175An individual exception flag can be cleared with the statement
176
177    float_exception_flags &= ~ float_flag_<exception>;
178
179where `<exception>' is the appropriate name.  To raise a floating-point
180exception, the SoftFloat function `float_raise' should be used (see below).
181
182In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess
183for underflow either before or after rounding.  The choice is made by
184the global variable `float_detect_tininess', which can be set to either
185`float_tininess_before_rounding' or `float_tininess_after_rounding'.
186Detecting tininess after rounding is better because it results in fewer
187spurious underflow signals.  The other option is provided for compatibility
188with some systems.  Like most systems, SoftFloat always detects loss of
189accuracy for underflow as an inexact result.
190
191
192-------------------------------------------------------------------------------
193Function Details
194
195- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
196Conversion Functions
197
198All conversions among the floating-point formats are supported, as are all
199conversions between a floating-point format and 32-bit and 64-bit signed
200integers.  The complete set of conversion functions is:
201
202   int32_to_float32      int64_to_float32
203   int32_to_float64      int64_to_float32
204   int32_to_floatx80     int64_to_floatx80
205   int32_to_float128     int64_to_float128
206
207   float32_to_int32      float32_to_int64
208   float32_to_int32      float64_to_int64
209   floatx80_to_int32     floatx80_to_int64
210   float128_to_int32     float128_to_int64
211
212   float32_to_float64    float32_to_floatx80   float32_to_float128
213   float64_to_float32    float64_to_floatx80   float64_to_float128
214   floatx80_to_float32   floatx80_to_float64   floatx80_to_float128
215   float128_to_float32   float128_to_float64   float128_to_floatx80
216
217Each conversion function takes one operand of the appropriate type and
218returns one result.  Conversions from a smaller to a larger floating-point
219format are always exact and so require no rounding.  Conversions from 32-bit
220integers to double precision and larger formats are also exact, and likewise
221for conversions from 64-bit integers to extended double and quadruple
222precisions.
223
224Conversions from floating-point to integer raise the invalid exception if
225the source value cannot be rounded to a representable integer of the desired
226size (32 or 64 bits).  If the floating-point operand is a NaN, the largest
227positive integer is returned.  Otherwise, if the conversion overflows, the
228largest integer with the same sign as the operand is returned.
229
230On conversions to integer, if the floating-point operand is not already an
231integer value, the operand is rounded according to the current rounding
232mode as specified by `float_rounding_mode'.  Because C (and perhaps other
233languages) require that conversions to integers be rounded toward zero, the
234following functions are provided for improved speed and convenience:
235
236   float32_to_int32_round_to_zero    float32_to_int64_round_to_zero
237   float64_to_int32_round_to_zero    float64_to_int64_round_to_zero
238   floatx80_to_int32_round_to_zero   floatx80_to_int64_round_to_zero
239   float128_to_int32_round_to_zero   float128_to_int64_round_to_zero
240
241These variant functions ignore `float_rounding_mode' and always round toward
242zero.
243
244- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
245Standard Arithmetic Functions
246
247The following standard arithmetic functions are provided:
248
249   float32_add    float32_sub    float32_mul    float32_div    float32_sqrt
250   float64_add    float64_sub    float64_mul    float64_div    float64_sqrt
251   floatx80_add   floatx80_sub   floatx80_mul   floatx80_div   floatx80_sqrt
252   float128_add   float128_sub   float128_mul   float128_div   float128_sqrt
253
254Each function takes two operands, except for `sqrt' which takes only one.
255The operands and result are all of the same type.
256
257Rounding of the extended double-precision (`floatx80') functions is affected
258by the `floatx80_rounding_precision' variable, as explained above in the
259section _Extended_Double-Precision_Rounding_Precision_.
260
261- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
262Remainder Functions
263
264For each format, SoftFloat implements the remainder function according to
265the IEC/IEEE Standard.  The remainder functions are:
266
267   float32_rem
268   float64_rem
269   floatx80_rem
270   float128_rem
271
272Each remainder function takes two operands.  The operands and result are all
273of the same type.  Given operands x and y, the remainder functions return
274the value x - n*y, where n is the integer closest to x/y.  If x/y is exactly
275halfway between two integers, n is the even integer closest to x/y.  The
276remainder functions are always exact and so require no rounding.
277
278Depending on the relative magnitudes of the operands, the remainder
279functions can take considerably longer to execute than the other SoftFloat
280functions.  This is inherent in the remainder operation itself and is not a
281flaw in the SoftFloat implementation.
282
283- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
284Round-to-Integer Functions
285
286For each format, SoftFloat implements the round-to-integer function
287specified by the IEC/IEEE Standard.  The functions are:
288
289   float32_round_to_int
290   float64_round_to_int
291   floatx80_round_to_int
292   float128_round_to_int
293
294Each function takes a single floating-point operand and returns a result of
295the same type.  (Note that the result is not an integer type.)  The operand
296is rounded to an exact integer according to the current rounding mode, and
297the resulting integer value is returned in the same floating-point format.
298
299- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
300Comparison Functions
301
302The following floating-point comparison functions are provided:
303
304   float32_eq    float32_le    float32_lt
305   float64_eq    float64_le    float64_lt
306   floatx80_eq   floatx80_le   floatx80_lt
307   float128_eq   float128_le   float128_lt
308
309Each function takes two operands of the same type and returns a 1 or 0
310representing either _true_ or _false_.  The abbreviation `eq' stands for
311``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands
312for ``less than'' (<).
313
314The standard greater-than (>), greater-than-or-equal (>=), and not-equal
315(!=) functions are easily obtained using the functions provided.  The
316not-equal function is just the logical complement of the equal function.
317The greater-than-or-equal function is identical to the less-than-or-equal
318function with the operands reversed; and the greater-than function can be
319obtained from the less-than function in the same way.
320
321The IEC/IEEE Standard specifies that the less-than-or-equal and less-than
322functions raise the invalid exception if either input is any kind of NaN.
323The equal functions, on the other hand, are defined not to raise the invalid
324exception on quiet NaNs.  For completeness, SoftFloat provides the following
325additional functions:
326
327   float32_eq_signaling    float32_le_quiet    float32_lt_quiet
328   float64_eq_signaling    float64_le_quiet    float64_lt_quiet
329   floatx80_eq_signaling   floatx80_le_quiet   floatx80_lt_quiet
330   float128_eq_signaling   float128_le_quiet   float128_lt_quiet
331
332The `signaling' equal functions are identical to the standard functions
333except that the invalid exception is raised for any NaN input.  Likewise,
334the `quiet' comparison functions are identical to their counterparts except
335that the invalid exception is not raised for quiet NaNs.
336
337- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
338Signaling NaN Test Functions
339
340The following functions test whether a floating-point value is a signaling
341NaN:
342
343   float32_is_signaling_nan
344   float64_is_signaling_nan
345   floatx80_is_signaling_nan
346   float128_is_signaling_nan
347
348The functions take one operand and return 1 if the operand is a signaling
349NaN and 0 otherwise.
350
351- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
352Raise-Exception Function
353
354SoftFloat provides a function for raising floating-point exceptions:
355
356    float_raise
357
358The function takes a mask indicating the set of exceptions to raise.  No
359result is returned.  In addition to setting the specified exception flags,
360this function may cause a trap or abort appropriate for the current system.
361
362- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
363
364
365-------------------------------------------------------------------------------
366Contact Information
367
368At the time of this writing, the most up-to-date information about
369SoftFloat and the latest release can be found at the Web page `http://
370HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'.
371
372
373