SwiftDtoa v2: Better, Smaller, Faster floating-point formatting #35299

tbkka · 2021-01-07T22:01:35Z

SwiftDtoa is the C/C++ code used in the Swift runtime to produce the textual representations used by the description and debugDescription properties of the standard Swift floating-point types. This update includes a number of algorithmic improvements to SwiftDtoa to improve portability, reduce code size, and improve performance but does not change the actual output.

About SwiftDtoa

In early versions of Swift, the description properties used the C library sprintf functionality with a fixed number of digits. In April 2018, PR #15474 replaced that logic with the first version of SwiftDtoa which used used a fast, adaptive algorithm to automatically choose the correct number of digits for a particular value. The resulting decimal output is always:

Accurate. Parsing the decimal form will yield exactly the same binary floating-point value again. This guarantee holds for any parser that accurately implements IEEE 754. In particular, the Swift standard library can guarantee that for any Double d that is not a NaN, Double(d.description) == d.
Short. Among all accurate forms, this form has the fewest significant digits. (Caution: Surprisingly, this is not the same as minimizing the number of characters. In some cases, minimizing the number of characters requires producing additional significant digits.)
Close. If there are multiple accurate, short forms, this code chooses the decimal form that is closest to the exact binary value. If there are two exactly the same distance, the one with an even final digit will be used.

Algorithms that can produce this "optimal" output have been known since at least 1990, when Steele and White published their Dragon4 algorithm. However, Dragon4 and other algorithms from that period relied on high-precision integer arithmetic, which made them slow. More recently, a surge of interest in this problem has produced dramatically better algorithms that can produce the same results using only fast fixed-precision arithmetic.

This format is ideal for JSON and other textual interchange: accuracy ensures that the value will be correctly decoded, shortness minimizes network traffic, and the existence of high-performance algorithms allows this form to be generated more quickly than many printf-based implementations.

This format is also ideal for logging, debugging, and other general display. In particular, the shortness guarantee avoids the confusion of unnecessary additional digits, so that the result of 1.0 / 10.0 consistently displays as 0.1 instead of 0.100000000000000000001.

About SwiftDtoa v2

Compared to the original SwiftDtoa code, this update is:

Better: The core logic is implemented using only C99 features with 64-bit and smaller integer arithmetic. If available, 128-bit integers are used for better performance. The core routines do not require any floating-point support from the C/C++ standard library and with only minor modifications should be usable on systems with no hardware or software floating-point support at all. This version also has experimental support for IEEE 754 binary128 format, though this support is obviously not included when compiling for the Swift standard library.

Smaller: Code size reduction compared to the earlier versions was a primary goal for this effort. In particular, float80 and binary128 share most of their core code, avoiding a full additional copy of the primary algorithm.

Faster: Even with the code size reductions, all formats are noticeably faster than before. The primary performance gains come from three major changes: ASCII digits are now emitted directly in the core routine in a form that requires only minimal adjustment to produce the final text. The main digit generation produces 2, 4, or 8 digits at a time when possible. Finally, the double logic optimistically produces 7 digits in the initial scaling with a Ryu-inspired backtracking when fewer digits suffice.

SwiftDtoa's algorithms

SwiftDtoa started out as a variation of Florian Loitsch' Grisu2 that addressed the shortness failures of that algorithm. Specifically, it uses wider arithmetic (128-bit for binary64) and a novel initial interval scaling that together address all of the weakness of Grisu2. Subsequent work has incorporated ideas from Errol3, Ryu, and other sources to yield a production-quality implementation that is performance- and size-competitive with current research code.

Those who wish to understand the details can read the extensive comments included in the code. Note that float16 actually uses a different algorithm than the other formats, as the extremely limited range can be handled with much simpler techniques. The float80/binary128 logic sacrifices some performance optimizations in order to minimize the code size for these less-used formats; the goal for SwiftDtoa v2 has been to match the float80 performance of earlier implementations while reducing code size and widening the arithmetic routines sufficiently to support binary128.

SwiftDtoa Testing

A newly-developed test harness generates several large files of test data that include known-correct results computed with high-precision arithmetic routines. The test files include:

Critical values generated by the algorithm presented in the Errol paper (about 48 million cases for binary128)
Values for which the optimal decimal form is exactly midway between two binary floating-point values.
All exact powers of two representable in this format.
Floating-point values that are close to exact powers of ten.

In addition, several billion random values for each format were compared to the results from other implementations. For binary16 and binary32 this provided exhaustive validation of every possible input value.

Code Size and Performance

The tables below summarize the code size and performance for the SwiftDtoa C library module on several different processor architectures, with comparisons to several other implementations. When used from Swift, the .description and .debugDescription implementations incur additional overhead for creating and returning Swift strings that are not captured here.

The code size tables show the total size in bytes of the compiled and stripped .o object files for a particular version of that code. The headings indicate the floating-point formats supported by that particular build (e.g., "16,32" for a version that supports binary16 and binary32 but no other formats).

The performance numbers below were obtained from a custom test harness that generates random bit patterns, interprets them as the corresponding floating-point value, and averages the overall time. For float80, the random bit patterns were generated in a way that avoids generating invalid values.

All code was compiled with the system C/C++ compiler using -O2 optimization.

A few notes about particular implementations mentioned below:

SwiftDtoa v1 is the original SwiftDtoa implementation as committed to the Swift runtime in April 2018.
SwiftDtoa v1a is the same as SwiftDtoa v1 with added binary16 support.
SwiftDtoa v2 can be configured with preprocessor macros to support any subset of the supported formats. I've provided sizes here for several different build configurations.
Ryu (Ulf Anders) implements binary32 and binary64 as completely independent source files. The size here is the total size of the two .o object files. The Ryu(size) version was built with the RYU_OPTIMIZE_SIZE option.
Dragonbox (Junekey Jeon) is provided as a header-only C++ template implementation. The size here is the compiled size of a simple .cpp file that instantiates the template for the specified formats. The Dragonbox(size) version used the cache::compressed policy to reduce the size of the lookup tables.
gdtoa (David M. Gay) has a very large feature set. For this reason, I excluded it from the code size comparison since I didn't consider the numbers to be comparable to the others.

x86_64

These were built using Apple clang 12.0.5 on a 2019 16" MacBook Pro (2.4GHz 8-core Intel Core i9) running macOS 11.1.

Code Size

Bold numbers here indicate the configurations that have shipped as part of the Swift runtime.

	16,32,64,80	32,64,80	32,64
SwiftDtoa v1		15128
SwiftDtoa v1a	16888
SwiftDtoa v2	20220	18628	8248
Ryu			40408
Ryu(size)			23836
Dragonbox			23176
Dragonbox(size)			15132

Performance

	binary16	binary32	binary64	float80	binary128
SwiftDtoa v1		25ns	46ns	82ns
SwiftDtoa v1a	37ns	26ns	47ns	83ns
SwiftDtoa v2	22ns	19ns	31ns	72ns	90ns
Ryu		19ns	26ns
Ryu(size)		17ns	24ns
Dragonbox		19ns	24ns
Dragonbox(size)		19ns	29ns
gdtoa	220ns	381ns	1184ns	16044ns	22800ns

ARM64

These were built using Apple clang 12.0.0 on a 2020 M1 Mac Mini running macOS 11.1.

Code Size

	16,32,64	32,64
SwiftDtoa v1		7436
SwiftDtoa v1a	9124
SwiftDtoa v2	9964	8228
Ryu		35764
Ryu(size)		17872
Dragonbox		27108
Dragonbox(size)		19172

Performance

	binary16	binary32	binary64	float80	binary128
SwiftDtoa v1		21ns	39ns
SwiftDtoa v1a	17ns	21ns	39ns
SwiftDtoa v2	15ns	17ns	29ns	54ns	71ns
Ryu		15ns	19ns
Ryu(size)		29ns	24ns
Dragonbox		16ns	24ns
Dragonbox(size)		15ns	34ns
gdtoa	143ns	242ns	858ns	25129ns	36195ns

ARM32

These were built using clang 8.0.1 on a BeagleBone Black (500MHz ARMv7) running FreeBSD 12.1-RELEASE.

Code Size

	16,32,64	32,64
SwiftDtoa v1		8668
SwiftDtoa v1a	10356
SwiftDtoa v2	9796	8340
Ryu		32292
Ryu(size)		14592
Dragonbox		29000
Dragonbox(size)		21980

Performance

	binary16	binary32	binary64	float80	binary128
SwiftDtoa v1		459ns	1152ns
SwiftDtoa v1a	383ns	451ns	1148ns
SwiftDtoa v2	202ns	357ns	715ns	2720ns	3379ns
Ryu		345ns	5450ns
Ryu(size)		786ns	5577ns
Dragonbox		300ns	904ns
Dragonbox(size)		294ns	1021ns
gdtoa	2180ns	4749ns	18742ns	293000ns	440000ns

EDITED 1/8/2021: Thanks to Junekey Jeon for pointing out that I was not building Ryu or Dragonbox with the correct options to get a smaller table size. I added new rows to some of the tables above with numbers for x86_64 and ARM32.

EDITED 1/8/2021: Turns out I was building some code with -g and some without. I've rebuilt everything without -g and updated the numbers above.

EDITED 1/20/2021: Updated the benchmark results to track some additional code changes: Binary32 now has a separate core implementation, bringing perf in line with Ryu and dragon box, with a corresponding increase in code size. Minor perf/size improvements to binary64, float80, binary128.

tbkka · 2021-01-07T22:02:56Z

@swift-ci Please test

tbkka · 2021-01-07T22:03:06Z

@swift-ci Please benchmark

compnerd · 2021-01-07T22:07:48Z

@swift-ci please test Windows platform

tbkka · 2021-01-07T22:11:21Z

Note: I still need to edit the header to ensure that binary128 is truly disabled for all Swift builds. Currently, it would get built (but not used) on a platform where long double happens to be binary128.

compnerd · 2021-01-07T22:47:28Z

@tbkka - the only platforms we need to worry about for that is PPC and AArch64 (non-Darwin). Windows AArch64 does not support FP128 and IIRC, neither does Darwin AArch64.

swift-ci · 2021-01-07T23:17:25Z

Performance: -O

Regression	OLD	NEW	DELTA	RATIO
LessSubstringSubstringGenericComparable	35	39	+11.4%	0.90x
ObjectiveCBridgeFromNSSetAnyObjectToString	61000	67500	+10.7%	0.90x (?)
FloatingPointPrinting_Float_description_uniform	5000	5500	+10.0%	0.91x (?)
LessSubstringSubstring	35	38	+8.6%	0.92x (?)
EqualSubstringSubstring	35	38	+8.6%	0.92x (?)
EqualSubstringString	35	38	+8.6%	0.92x

Improvement	OLD	NEW	DELTA	RATIO
CharIteration_utf16_unicodeScalars	4600	3920	-14.8%	1.17x (?)
FloatingPointPrinting_Double_description_small	18200	16000	-12.1%	1.14x
FloatingPointPrinting_Double_description_uniform	19200	17200	-10.4%	1.12x
RandomShuffleLCG2	432	400	-7.4%	1.08x

Code size: -O

Performance: -Osize

Regression	OLD	NEW	DELTA	RATIO
LessSubstringSubstring	35	39	+11.4%	0.90x
LessSubstringSubstringGenericComparable	35	39	+11.4%	0.90x
ObjectiveCBridgeFromNSStringForced	2220	2465	+11.0%	0.90x (?)
FloatingPointPrinting_Float_description_small	4644	5076	+9.3%	0.91x (?)
UTF8Decode_InitFromData_ascii_as_ascii	635	690	+8.7%	0.92x (?)
EqualSubstringSubstring	35	38	+8.6%	0.92x (?)
EqualStringSubstring	35	38	+8.6%	0.92x (?)
EqualSubstringSubstringGenericEquatable	35	38	+8.6%	0.92x (?)
EqualSubstringString	35	38	+8.6%	0.92x (?)
FloatingPointPrinting_Float_description_uniform	5000	5400	+8.0%	0.93x (?)

Improvement	OLD	NEW	DELTA	RATIO
FloatingPointPrinting_Double_description_small	18200	16000	-12.1%	1.14x (?)
FloatingPointPrinting_Double_description_uniform	19200	17200	-10.4%	1.12x
Array2D	6736	6224	-7.6%	1.08x
FloatingPointPrinting_Double_interpolated	47800	44400	-7.1%	1.08x (?)
StrComplexWalk	4620	4300	-6.9%	1.07x

Code size: -Osize

Performance: -Onone

Regression	OLD	NEW	DELTA	RATIO
DataToStringSmall	4600	5300	+15.2%	0.87x (?)
DropFirstAnyCollectionLazy	141769	154185	+8.8%	0.92x (?)
ArrayOfPOD	921	997	+8.3%	0.92x (?)

Code size: -swiftlibs

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 8-Core Intel Xeon E5
  Processor Speed: 3 GHz
  Number of Processors: 1
  Total Number of Cores: 8
  L2 Cache (per Core): 256 KB
  L3 Cache: 25 MB
  Memory: 64 GB

tbkka · 2021-01-07T23:48:16Z

@compnerd What about Android? I thought I heard that Android's long double was binary128?

I also believe that:

Windows has long double as IEEE 754 binary64 on all architectures
Darwin has long double as IEEE 754 binary64 on ARM64 and as Intel x87 float80 on x86 and x86_64.
PPC hardware supports double-double for quad precision, which is different than IEEE 754 binary128 format

The new logic in this header starts by examining the floating-point properties of the C types float, double, and long double as described in the C library's float.h header, which should accurately discriminate all of these cases. (This should be both more portable and more correct than previous versions of this file that hard-coded particular combinations of platforms and processors.)

Based on this inspection, the current SwiftDtoa implementation enables support for particular floating-point formats. The core implementations accept void * pointers to floating-point data in memory and have no reliance on C floating-point types. The implementation separately enables wrappers that accept C floating-point types and dispatch to the appropriate format support. So on Windows, this code would create a long double formatter that dispatches to the binary64 backend, and on Darwin x86_64, the long double formatter would dispatch to the float80 backend.

Some of this is a little muddled in the current Swift runtime code. Swift Float80 is always float80, which is not necessarily the same as C/C++ long double. For clarity, I think it would be good to be very cautious with the term "long double" since that may or may not be the same as float80.

compnerd · 2021-01-08T00:09:43Z

Correct, I categorise Android as Linux, which will do FP128. However, I did forget that Android x86_64 also does FP128. Both of these are IEEE 754 binary128.

Windows does FP64 across the board for consistency, and is IEEE 754 binary64.

Hmm, I thought that there was some support for IEEE 754 FP128 on PPC64. But, yes, the hardware's native support for 128-bit floating point is not IEEE 754 conformant.

Agreed with you on the use of long double - and why I always explicitly spell out FP width.

swift-ci · 2021-01-08T01:23:10Z

Build failed
Swift Test OS X Platform
Git Sha - 409c3d695b57a4b5e37510a2ee57b492279b9ae7

tbkka · 2021-01-08T01:28:11Z

Benchmark results are about as expected. Apart from the noise (substring, data, etc), there are two significant benchmark differences here:

Swift Float.description is benchmarking about 9% slower, consistent with the results from benchmarking the underlying C code. This is fallout from the code size reductions: The primary code size reduction was to switch float to use the same back-end code as Double. This means float is now using more precision than is technically necessary for IEEE 754 binary32, which does reduce performance. Other performance improvements offset this, but not entirely.

Swift Double.description is showing about 14% faster on average. That's about 1/2 of the improvement that shows in the C implementation, which is consistent with the fact that around 1/2 of Doubles need a heap-allocated String to hold the result.

For what it's worth, I consider these to be modest impacts, especially compared to the 10x perf improvement that started this project (see PR #15474).

tbkka · 2021-01-08T01:31:09Z

Tests failed on macOS with 14006 tests completed (out of 14007):

Build timed out (after 83 minutes). Marking the build as failed.

It seems that some test is taking longer than expected, but it's not clear which one.

tbkka · 2021-01-08T01:42:05Z

@swift-ci Please test

swift-ci · 2021-01-08T03:17:49Z

Build failed
Swift Test Linux Platform
Git Sha - f709b025837d7ba44bdc89c7d5f546ffbc093e61

swift-ci · 2021-01-08T05:10:43Z

Build failed
Swift Test OS X Platform
Git Sha - f709b025837d7ba44bdc89c7d5f546ffbc093e61

tbkka · 2021-01-08T23:52:56Z

Ugh. Just figured out that the SwiftDtoa numbers are all for .o files with debug information. So the code size comparisons aren't particularly valid. I'll try to get those updated.

tbkka · 2021-01-09T01:36:14Z

@swift-ci Please test Linux platform

tbkka · 2021-01-09T20:41:16Z

@swift-ci Please benchmark

swift-ci · 2021-01-09T21:18:16Z

Performance: -O

Regression	OLD	NEW	DELTA	RATIO
StringFromLongWholeSubstring	4	5	+25.0%	0.80x
LessSubstringSubstringGenericComparable	39	43	+10.3%	0.91x
ObjectiveCBridgeFromNSDictionaryAnyObjectForced	8900	9700	+9.0%	0.92x (?)
LessSubstringSubstring	39	42	+7.7%	0.93x (?)
EqualSubstringSubstring	39	42	+7.7%	0.93x (?)
EqualSubstringString	39	42	+7.7%	0.93x
DataToStringSmall	2650	2850	+7.5%	0.93x (?)

Improvement	OLD	NEW	DELTA	RATIO
FloatingPointPrinting_Double_description_small	20300	17800	-12.3%	1.14x (?)
NSStringConversion.MutableCopy.UTF8	983	901	-8.3%	1.09x (?)
RandomShuffleLCG2	480	448	-6.7%	1.07x (?)

Code size: -O

Performance: -Osize

Regression	OLD	NEW	DELTA	RATIO
FlattenListFlatMap	4370	6848	+56.7%	0.64x (?)
StringFromLongWholeSubstring	4	5	+25.0%	0.80x
DataToStringSmall	2700	3000	+11.1%	0.90x (?)
LessSubstringSubstring	39	43	+10.3%	0.91x (?)
LessSubstringSubstringGenericComparable	39	43	+10.3%	0.91x
Data.init.Sequence.809B.Count.RE.I	79	86	+8.9%	0.92x (?)
EqualSubstringSubstring	39	42	+7.7%	0.93x (?)
EqualSubstringSubstringGenericEquatable	39	42	+7.7%	0.93x (?)
EqualSubstringString	39	42	+7.7%	0.93x (?)

Improvement	OLD	NEW	DELTA	RATIO
FloatingPointPrinting_Double_description_small	20300	17800	-12.3%	1.14x
FloatingPointPrinting_Double_description_uniform	21400	19100	-10.7%	1.12x (?)
Array2D	7504	6928	-7.7%	1.08x (?)
RandomShuffleLCG2	448	416	-7.1%	1.08x
FloatingPointPrinting_Double_interpolated	53400	49600	-7.1%	1.08x (?)
StrComplexWalk	5140	4800	-6.6%	1.07x

Code size: -Osize

Performance: -Onone

Regression	OLD	NEW	DELTA	RATIO
SevenBoom	1773	2246	+26.7%	0.79x (?)
NSStringConversion.MutableCopy.Rebridge.UTF8	1061	1185	+11.7%	0.90x (?)
NSStringConversion.MutableCopy.Rebridge.LongUTF8	868	946	+9.0%	0.92x (?)
Dictionary4OfObjects	1477	1602	+8.5%	0.92x (?)
ArrayOfPOD	1050	1132	+7.8%	0.93x (?)

Improvement	OLD	NEW	DELTA	RATIO
FloatingPointPrinting_Double_interpolated	80000	73400	-8.2%	1.09x (?)
String.data.Empty	97	90	-7.2%	1.08x (?)

Code size: -swiftlibs

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

tbkka · 2021-01-10T17:58:49Z

Inlining the 128-bit shifts on 64-bit processors seems to have addressed the benchmark regression for Float. Let's see if the test suites are happier today...

tbkka · 2021-01-10T17:58:59Z

@swift-ci Please test

swift-ci · 2021-01-10T19:25:31Z

Build failed
Swift Test Linux Platform
Git Sha - 868b07c4ce7d4b43d025131bf6ceb2adbc789387

tbkka · 2021-01-11T23:42:26Z

@swift-ci Please test Linux platform

stephentyrone · 2021-01-12T20:29:43Z

include/swift/Runtime/SwiftDtoa.h


 #ifndef SWIFT_DTOA_H
 #define SWIFT_DTOA_H

+// <<<<< BEGIN Local configuration overrides for Swift runtime build
+// Forcibly DISABLE binary128 support for Swift runtime
+#define SWIFT_DTOA_BINARY128_SUPPORT 0


I'm tempted to define this on (or remove the condition) for the purposes of being able to back deploy a hypothetical future Float128 type. Thoughts?

I don't think there's any point right now. I see three issues:

Issue: These functions are not ABI. Prepping for back deployment would require building out the actual ABI, including wrestling with argument-passing conventions, etc.

Issue: We would need agreement on where to support Float128. Should it depend on long double being binary128? Will it depend on HW support?

Issue: The float80/binary128 arithmetic is the biggest part of this formatting logic. Turning on either one is almost 10k of code and tables, adding the second one is about 1.5k more. So adding binary128 to platforms that already have float80 support is pretty cheap here. But I'm wary of speculatively enabling this on platforms that don't already have float80 turned on.

SGTM.

For the record, I can answer this:

Issue: We would need agreement on where to support Float128. Should it depend on long double being binary128? Will it depend on HW support?

If/when we add support for Float128, we'll make it available everywhere, independent of how long double is defined. C provides the _Float128 name for this purpose, so we don't need to depend on long double (frankly, it was a silly mistake for platforms to bind long double to binary128, but well, they did).

SwiftDtoa is the C/C++ code used in the Swift runtime to produce the textual representations used by the `description` and `debugDescription` properties of the standard Swift floating-point types. This update includes a number of algorithmic improvements to SwiftDtoa to improve portability, reduce code size, and improve performance but does not change the actual output. About SwiftDtoa =============== In early versions of Swift, the `description` properties used the C library `sprintf` functionality with a fixed number of digits. In 2018, that logic was replaced with the first version of SwiftDtoa which used used a fast, adaptive algorithm to automatically choose the correct number of digits for a particular value. The resulting decimal output is always: * Accurate. Parsing the decimal form will yield exactly the same binary floating-point value again. This guarantee holds for any parser that accurately implements IEEE 754. In particular, the Swift standard library can guarantee that for any Double `d` that is not a NaN, `Double(d.description) == d`. * Short. Among all accurate forms, this form has the fewest significant digits. (Caution: Surprisingly, this is not the same as minimizing the number of characters. In some cases, minimizing the number of characters requires producing additional significant digits.) * Close. If there are multiple accurate, short forms, this code chooses the decimal form that is closest to the exact binary value. If there are two exactly the same distance, the one with an even final digit will be used. Algorithms that can produce this "optimal" output have been known since at least 1990, when Steele and White published their Dragon4 algorithm. However, Dragon4 and other algorithms from that period relied on high-precision integer arithmetic, which made them slow. More recently, a surge of interest in this problem has produced dramatically better algorithms that can produce the same results using only fast fixed-precision arithmetic. This format is ideal for JSON and other textual interchange: accuracy ensures that the value will be correctly decoded, shortness minimizes network traffic, and the existence of high-performance algorithms allows this form to be generated more quickly than many `printf`-based implementations. This format is also ideal for logging, debugging, and other general display. In particular, the shortness guarantee avoids the confusion of unnecessary additional digits, so that the result of `1.0 / 10.0` consistently displays as `0.1` instead of `0.100000000000000000001`. About SwiftDtoa v2 ================== Compared to the original SwiftDtoa code, this update is: **Better**: The core logic is implemented using only C99 features with 64-bit and smaller integer arithmetic. If available, 128-bit integers are used for better performance. The core routines do not require any floating-point support from the C/C++ standard library and with only minor modifications should be usable on systems with no hardware or software floating-point support at all. This version also has experimental support for IEEE 754 binary128 format, though this support is obviously not included when compiling for the Swift standard library. **Smaller**: Code size reduction compared to the earlier versions was a primary goal for this effort. In particular, the new binary128 support shares essentially all of its code with the float80 implementation. **Faster**: Even with the code size reductions, all formats are noticeably faster. The primary performance gains come from three major changes: Text digits are now emitted directly in the core routines in a form that requires only minimal adjustment to produce the final text. Digit generation produces 2, 4, or even 8 digits at a time, depending on the format. The double logic optimistically produces 7 digits in the initial scaling with a Ryu-inspired backtracking when fewer digits suffice. SwiftDtoa's algorithms ====================== SwiftDtoa started out as a variation of Florian Loitsch' Grisu2 that addressed the shortness failures of that algorithm. Subsequent work has incorporated ideas from Errol3, Ryu, and other sources to yield a production-quality implementation that is performance- and size-competitive with current research code. Those who wish to understand the details can read the extensive comments included in the code. Note that float16 actually uses a different algorithm than the other formats, as the extremely limited range can be handled with much simpler techniques. The float80/binary128 logic sacrifices some performance optimizations in order to minimize the code size for these less-used formats; the goal for SwiftDtoa v2 has been to match the float80 performance of earlier implementations while reducing code size and widening the arithmetic routines sufficiently to support binary128. SwiftDtoa Testing ================= A newly-developed test harness generates several large files of test data that include known-correct results computed with high-precision arithmetic routines. The test files include: * Critical values generated by the algorithm presented in the Errol paper (about 48 million cases for binary128) * Values for which the optimal decimal form is exactly midway between two binary floating-point values. * All exact powers of two representable in this format. * Floating-point values that are close to exact powers of ten. In addition, several billion random values for each format were compared to the results from other implementations. For binary16 and binary32 this provided exhaustive validation of every possible input value. Code Size and Performance ========================= The tables below summarize the code size and performance for the SwiftDtoa C library module by itself on several different processor architectures. When used from Swift, the `.description` and `.debugDescription` implementations incur additional overhead for creating and returning Swift strings that are not captured here. The code size tables show the total size in bytes of the compiled `.o` object files for a particular version of that code. The headings indicate the floating-point formats supported by that particular build (e.g., "16,32" for a version that supports binary16 and binary32 but no other formats). The performance numbers below were obtained from a custom test harness that generates random bit patterns, interprets them as the corresponding floating-point value, and averages the overall time. For float80, the random bit patterns were generated in a way that avoids generating invalid values. All code was compiled with the system C/C++ compiler using `-O2` optimization. A few notes about particular implementations: * **SwiftDtoa v1** is the original SwiftDtoa implementation as committed to the Swift runtime in April 2018. * **SwiftDtoa v1a** is the same as SwiftDtoa v1 with added binary16 support. * **SwiftDtoa v2** can be configured with preprocessor macros to support any subset of the supported formats. I've provided sizes here for several different build configurations. * **Ryu** (Ulf Anders) implements binary32 and binary64 as completely independent source files. The size here is the total size of the two .o object files. * **Ryu(size)** is Ryu compiled with the `RYU_OPTIMIZE_SIZE` option. * **Dragonbox** (Junekey Jeon). The size here is the compiled size of a simple `.cpp` file that instantiates the template for the specified formats, plus the size of the associated text output logic. * **Dragonbox(size)** is Dragonbox compiled to minimize size by using a compressed power-of-10 table. * **gdtoa** has a very large feature set. For this reason, I excluded it from the code size comparison since I didn't consider the numbers to be comparable to the others. x86_64 ---------------- These were built using Apple clang 12.0.5 on a 2019 16" MacBook Pro (2.4GHz 8-core Intel Core i9) running macOS 11.1. **Code Size** Bold numbers here indicate the configurations that have shipped as part of the Swift runtime. | | 16,32,64,80 | 32,64,80 | 32,64 | |---------------|------------:|------------:|------------:| |SwiftDtoa v1 | | **15128** | | |SwiftDtoa v1a | **16888** | | | |SwiftDtoa v2 | **20220** | 18628 | 8248 | |Ryu | | | 40408 | |Ryu(size) | | | 23836 | |Dragonbox | | | 23176 | |Dragonbox(size)| | | 15132 | **Performance** | | binary16 | binary32 | binary64 | float80 | binary128 | |--------------|---------:|---------:|---------:|--------:|----------:| |SwiftDtoa v1 | | 25ns | 46ns | 82ns | | |SwiftDtoa v1a | 37ns | 26ns | 47ns | 83ns | | |SwiftDtoa v2 | 22ns | 19ns | 31ns | 72ns | 90ns | |Ryu | | 19ns | 26ns | | | |Ryu(size) | | 17ns | 24ns | | | |Dragonbox | | 19ns | 24ns | | | |Dragonbox(size) | | 19ns | 29ns | | | |gdtoa | 220ns | 381ns | 1184ns | 16044ns | 22800ns | ARM64 ---------------- These were built using Apple clang 12.0.0 on a 2020 M1 Mac Mini running macOS 11.1. **Code Size** | | 16,32,64 | 32,64 | |---------------|---------:|------:| |SwiftDtoa v1 | | 7436 | |SwiftDtoa v1a | 9124 | | |SwiftDtoa v2 | 9964 | 8228 | |Ryu | | 35764 | |Ryu(size) | | 16708 | |Dragonbox | | 27108 | |Dragonbox(size)| | 19172 | **Performance** | | binary16 | binary32 | binary64 | float80 | binary128 | |--------------|---------:|---------:|---------:|--------:|----------:| |SwiftDtoa v1 | | 21ns | 39ns | | | |SwiftDtoa v1a | 17ns | 21ns | 39ns | | | |SwiftDtoa v2 | 15ns | 17ns | 29ns | 54ns | 71ns | |Ryu | | 15ns | 19ns | | | |Ryu(size) | | 29ns | 24ns | | | |Dragonbox | | 16ns | 24ns | | | |Dragonbox(size) | | 15ns | 34ns | | | |gdtoa | 143ns | 242ns | 858ns | 25129ns | 36195ns | ARM32 ---------------- These were built using clang 8.0.1 on a BeagleBone Black (500MHz ARMv7) running FreeBSD 12.1-RELEASE. **Code Size** | | 16,32,64 | 32,64 | |---------------|---------:|------:| |SwiftDtoa v1 | | 8668 | |SwiftDtoa v1a | 10356 | | |SwiftDtoa v2 | 9796 | 8340 | |Ryu | | 32292 | |Ryu(size) | | 14592 | |Dragonbox | | 29000 | |Dragonbox(size)| | 21980 | **Performance** | | binary16 | binary32 | binary64 | float80 | binary128 | |--------------|---------:|---------:|---------:|--------:|----------:| |SwiftDtoa v1 | | 459ns | 1152ns | | | |SwiftDtoa v1a | 383ns | 451ns | 1148ns | | | |SwiftDtoa v2 | 202ns | 357ns | 715ns | 2720ns | 3379ns | |Ryu | | 345ns | 5450ns | | | |Ryu(size) | | 786ns | 5577ns | | | |Dragonbox | | 300ns | 904ns | | | |Dragonbox(size) | | 294ns | 1021ns | | | |gdtoa | 2180ns | 4749ns | 18742ns |293000ns | 440000ns |

tbkka · 2021-01-20T20:23:38Z

Testing has gone so well that I decided to push a few more improvements:

Restored the separate binary32 formatter. This brings binary32 roughly to par with Ryu for performance. (perf)
Eliminated an expensive multiply from the interval computation for binary64 and float80/binary128 (size & perf)
Factored out some of the common textual adjustments from binary32,64,80,128 (size)

tbkka · 2021-01-20T20:26:16Z

Also, I changed the float80 infinity/NaN parsing to more strictly follow 80387 conventions as documented on Wikipedia. In particular, the 8087/80287 infinity is now handled as a signaling NaN.

tbkka · 2021-01-20T20:26:31Z

@swift-ci Please test

tbkka · 2021-01-20T21:37:40Z

I've also gone ahead and changed the float80 stub interface to be conditionally compiled only on platforms where Swift Stdlib actually supports Float80. That eliminated a chunk of pointless boilerplate that was trying to provide a makeshift float80 formatter for other platforms.

swift-ci · 2021-01-20T21:45:06Z

Build failed
Swift Test Linux Platform
Git Sha - 79c3a42

swift-ci · 2021-01-20T22:06:15Z

Build failed
Swift Test OS X Platform
Git Sha - 79c3a42

tbkka · 2021-01-21T02:08:40Z

Fixed a bug in the new float80 nan/inf parsing.

tbkka · 2021-01-21T02:09:50Z

I measured the time for the PrintFloat.swift.gyb test: It's 14s on my laptop with an optimized Stdlib, 15s with an unoptimized, so there's no reason to disable this test in the latter case.

tbkka · 2021-01-21T02:10:02Z

@swift-ci Please test

tbkka · 2021-01-21T02:12:16Z

I'm still a little unhappy that binary16 is now slower than binary32 on x86_64, but that can wait for a future update.

tbkka · 2021-01-21T02:34:32Z

@swift-ci Please test Windows platform

tbkka requested a review from stephentyrone January 7, 2021 22:01

jk-jeon mentioned this pull request Jan 8, 2021

Better support for ARM and other architectures jk-jeon/dragonbox#11

Open

stephentyrone reviewed Jan 12, 2021

View reviewed changes

tbkka force-pushed the tbkka/SwiftDtoav2 branch from 98bf7c5 to 6e83f2a Compare January 20, 2021 20:13

tbkka force-pushed the tbkka/SwiftDtoav2 branch from 6e83f2a to 79c3a42 Compare January 20, 2021 20:20

tbkka added 2 commits January 20, 2021 17:49

This is fast enough now even for non-optimized test runs

ce43d15

Fix float80 Nan/Inf parsing, comment more thoroughly

63723ab

tbkka merged commit a32dacb into swiftlang:main Jan 27, 2021

tbkka deleted the tbkka/SwiftDtoav2 branch January 27, 2021 22:36

Kristine1975 mentioned this pull request Nov 19, 2021

Rounding error when printing the number 460.0 micropython/micropython#4212

Closed

SwiftDtoa v2: Better, Smaller, Faster floating-point formatting #35299

SwiftDtoa v2: Better, Smaller, Faster floating-point formatting #35299

Conversation

tbkka commented Jan 7, 2021 • edited Loading

About SwiftDtoa

About SwiftDtoa v2

SwiftDtoa's algorithms

SwiftDtoa Testing

Code Size and Performance

x86_64

ARM64

ARM32

tbkka commented Jan 7, 2021

tbkka commented Jan 7, 2021

compnerd commented Jan 7, 2021

tbkka commented Jan 7, 2021

compnerd commented Jan 7, 2021

swift-ci commented Jan 7, 2021

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

tbkka commented Jan 7, 2021

compnerd commented Jan 8, 2021 • edited Loading

swift-ci commented Jan 8, 2021

tbkka commented Jan 8, 2021 • edited Loading

tbkka commented Jan 8, 2021

tbkka commented Jan 8, 2021

swift-ci commented Jan 8, 2021

swift-ci commented Jan 8, 2021

tbkka commented Jan 8, 2021

tbkka commented Jan 9, 2021

tbkka commented Jan 9, 2021

swift-ci commented Jan 9, 2021

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

tbkka commented Jan 10, 2021

tbkka commented Jan 10, 2021

swift-ci commented Jan 10, 2021

tbkka commented Jan 11, 2021

stephentyrone Jan 12, 2021

Choose a reason for hiding this comment

tbkka Jan 12, 2021

Choose a reason for hiding this comment

stephentyrone Jan 13, 2021

Choose a reason for hiding this comment

tbkka commented Jan 20, 2021 • edited Loading

tbkka commented Jan 20, 2021

tbkka commented Jan 20, 2021

tbkka commented Jan 20, 2021

swift-ci commented Jan 20, 2021

swift-ci commented Jan 20, 2021

tbkka commented Jan 21, 2021

tbkka commented Jan 21, 2021

tbkka commented Jan 21, 2021

tbkka commented Jan 21, 2021

tbkka commented Jan 21, 2021

tbkka commented Jan 7, 2021 •

edited

Loading

compnerd commented Jan 8, 2021 •

edited

Loading

tbkka commented Jan 8, 2021 •

edited

Loading

tbkka commented Jan 20, 2021 •

edited

Loading