Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SwiftDtoa v2: Better, Smaller, Faster floating-point formatting #35299

Merged
merged 3 commits into from
Jan 27, 2021

Conversation

tbkka
Copy link
Contributor

@tbkka tbkka commented Jan 7, 2021

SwiftDtoa is the C/C++ code used in the Swift runtime to produce the textual representations used by the description and debugDescription properties of the standard Swift floating-point types. This update includes a number of algorithmic improvements to SwiftDtoa to improve portability, reduce code size, and improve performance but does not change the actual output.

About SwiftDtoa

In early versions of Swift, the description properties used the C library sprintf functionality with a fixed number of digits. In April 2018, PR #15474 replaced that logic with the first version of SwiftDtoa which used used a fast, adaptive algorithm to automatically choose the correct number of digits for a particular value. The resulting decimal output is always:

  • Accurate. Parsing the decimal form will yield exactly the same binary floating-point value again. This guarantee holds for any parser that accurately implements IEEE 754. In particular, the Swift standard library can guarantee that for any Double d that is not a NaN, Double(d.description) == d.

  • Short. Among all accurate forms, this form has the fewest significant digits. (Caution: Surprisingly, this is not the same as minimizing the number of characters. In some cases, minimizing the number of characters requires producing additional significant digits.)

  • Close. If there are multiple accurate, short forms, this code chooses the decimal form that is closest to the exact binary value. If there are two exactly the same distance, the one with an even final digit will be used.

Algorithms that can produce this "optimal" output have been known since at least 1990, when Steele and White published their Dragon4 algorithm. However, Dragon4 and other algorithms from that period relied on high-precision integer arithmetic, which made them slow. More recently, a surge of interest in this problem has produced dramatically better algorithms that can produce the same results using only fast fixed-precision arithmetic.

This format is ideal for JSON and other textual interchange: accuracy ensures that the value will be correctly decoded, shortness minimizes network traffic, and the existence of high-performance algorithms allows this form to be generated more quickly than many printf-based implementations.

This format is also ideal for logging, debugging, and other general display. In particular, the shortness guarantee avoids the confusion of unnecessary additional digits, so that the result of 1.0 / 10.0 consistently displays as 0.1 instead of 0.100000000000000000001.

About SwiftDtoa v2

Compared to the original SwiftDtoa code, this update is:

Better: The core logic is implemented using only C99 features with 64-bit and smaller integer arithmetic. If available, 128-bit integers are used for better performance. The core routines do not require any floating-point support from the C/C++ standard library and with only minor modifications should be usable on systems with no hardware or software floating-point support at all. This version also has experimental support for IEEE 754 binary128 format, though this support is obviously not included when compiling for the Swift standard library.

Smaller: Code size reduction compared to the earlier versions was a primary goal for this effort. In particular, float80 and binary128 share most of their core code, avoiding a full additional copy of the primary algorithm.

Faster: Even with the code size reductions, all formats are noticeably faster than before. The primary performance gains come from three major changes: ASCII digits are now emitted directly in the core routine in a form that requires only minimal adjustment to produce the final text. The main digit generation produces 2, 4, or 8 digits at a time when possible. Finally, the double logic optimistically produces 7 digits in the initial scaling with a Ryu-inspired backtracking when fewer digits suffice.

SwiftDtoa's algorithms

SwiftDtoa started out as a variation of Florian Loitsch' Grisu2 that addressed the shortness failures of that algorithm. Specifically, it uses wider arithmetic (128-bit for binary64) and a novel initial interval scaling that together address all of the weakness of Grisu2. Subsequent work has incorporated ideas from Errol3, Ryu, and other sources to yield a production-quality implementation that is performance- and size-competitive with current research code.

Those who wish to understand the details can read the extensive comments included in the code. Note that float16 actually uses a different algorithm than the other formats, as the extremely limited range can be handled with much simpler techniques. The float80/binary128 logic sacrifices some performance optimizations in order to minimize the code size for these less-used formats; the goal for SwiftDtoa v2 has been to match the float80 performance of earlier implementations while reducing code size and widening the arithmetic routines sufficiently to support binary128.

SwiftDtoa Testing

A newly-developed test harness generates several large files of test data that include known-correct results computed with high-precision arithmetic routines. The test files include:

  • Critical values generated by the algorithm presented in the Errol paper (about 48 million cases for binary128)
  • Values for which the optimal decimal form is exactly midway between two binary floating-point values.
  • All exact powers of two representable in this format.
  • Floating-point values that are close to exact powers of ten.

In addition, several billion random values for each format were compared to the results from other implementations. For binary16 and binary32 this provided exhaustive validation of every possible input value.

Code Size and Performance

The tables below summarize the code size and performance for the SwiftDtoa C library module on several different processor architectures, with comparisons to several other implementations. When used from Swift, the .description and .debugDescription implementations incur additional overhead for creating and returning Swift strings that are not captured here.

The code size tables show the total size in bytes of the compiled and stripped .o object files for a particular version of that code. The headings indicate the floating-point formats supported by that particular build (e.g., "16,32" for a version that supports binary16 and binary32 but no other formats).

The performance numbers below were obtained from a custom test harness that generates random bit patterns, interprets them as the corresponding floating-point value, and averages the overall time. For float80, the random bit patterns were generated in a way that avoids generating invalid values.

All code was compiled with the system C/C++ compiler using -O2 optimization.

A few notes about particular implementations mentioned below:

  • SwiftDtoa v1 is the original SwiftDtoa implementation as committed to the Swift runtime in April 2018.
  • SwiftDtoa v1a is the same as SwiftDtoa v1 with added binary16 support.
  • SwiftDtoa v2 can be configured with preprocessor macros to support any subset of the supported formats. I've provided sizes here for several different build configurations.
  • Ryu (Ulf Anders) implements binary32 and binary64 as completely independent source files. The size here is the total size of the two .o object files. The Ryu(size) version was built with the RYU_OPTIMIZE_SIZE option.
  • Dragonbox (Junekey Jeon) is provided as a header-only C++ template implementation. The size here is the compiled size of a simple .cpp file that instantiates the template for the specified formats. The Dragonbox(size) version used the cache::compressed policy to reduce the size of the lookup tables.
  • gdtoa (David M. Gay) has a very large feature set. For this reason, I excluded it from the code size comparison since I didn't consider the numbers to be comparable to the others.

x86_64

These were built using Apple clang 12.0.5 on a 2019 16" MacBook Pro (2.4GHz 8-core Intel Core i9) running macOS 11.1.

Code Size

Bold numbers here indicate the configurations that have shipped as part of the Swift runtime.

16,32,64,80 32,64,80 32,64
SwiftDtoa v1 15128
SwiftDtoa v1a 16888
SwiftDtoa v2 20220 18628 8248
Ryu 40408
Ryu(size) 23836
Dragonbox 23176
Dragonbox(size) 15132

Performance

binary16 binary32 binary64 float80 binary128
SwiftDtoa v1 25ns 46ns 82ns
SwiftDtoa v1a 37ns 26ns 47ns 83ns
SwiftDtoa v2 22ns 19ns 31ns 72ns 90ns
Ryu 19ns 26ns
Ryu(size) 17ns 24ns
Dragonbox 19ns 24ns
Dragonbox(size) 19ns 29ns
gdtoa 220ns 381ns 1184ns 16044ns 22800ns

ARM64

These were built using Apple clang 12.0.0 on a 2020 M1 Mac Mini running macOS 11.1.

Code Size

16,32,64 32,64
SwiftDtoa v1 7436
SwiftDtoa v1a 9124
SwiftDtoa v2 9964 8228
Ryu 35764
Ryu(size) 17872
Dragonbox 27108
Dragonbox(size) 19172

Performance

binary16 binary32 binary64 float80 binary128
SwiftDtoa v1 21ns 39ns
SwiftDtoa v1a 17ns 21ns 39ns
SwiftDtoa v2 15ns 17ns 29ns 54ns 71ns
Ryu 15ns 19ns
Ryu(size) 29ns 24ns
Dragonbox 16ns 24ns
Dragonbox(size) 15ns 34ns
gdtoa 143ns 242ns 858ns 25129ns 36195ns

ARM32

These were built using clang 8.0.1 on a BeagleBone Black (500MHz ARMv7) running FreeBSD 12.1-RELEASE.

Code Size

16,32,64 32,64
SwiftDtoa v1 8668
SwiftDtoa v1a 10356
SwiftDtoa v2 9796 8340
Ryu 32292
Ryu(size) 14592
Dragonbox 29000
Dragonbox(size) 21980

Performance

binary16 binary32 binary64 float80 binary128
SwiftDtoa v1 459ns 1152ns
SwiftDtoa v1a 383ns 451ns 1148ns
SwiftDtoa v2 202ns 357ns 715ns 2720ns 3379ns
Ryu 345ns 5450ns
Ryu(size) 786ns 5577ns
Dragonbox 300ns 904ns
Dragonbox(size) 294ns 1021ns
gdtoa 2180ns 4749ns 18742ns 293000ns 440000ns

EDITED 1/8/2021: Thanks to Junekey Jeon for pointing out that I was not building Ryu or Dragonbox with the correct options to get a smaller table size. I added new rows to some of the tables above with numbers for x86_64 and ARM32.

EDITED 1/8/2021: Turns out I was building some code with -g and some without. I've rebuilt everything without -g and updated the numbers above.

EDITED 1/20/2021: Updated the benchmark results to track some additional code changes: Binary32 now has a separate core implementation, bringing perf in line with Ryu and dragon box, with a corresponding increase in code size. Minor perf/size improvements to binary64, float80, binary128.

@tbkka tbkka requested a review from stephentyrone January 7, 2021 22:01
@tbkka
Copy link
Contributor Author

tbkka commented Jan 7, 2021

@swift-ci Please test

@tbkka
Copy link
Contributor Author

tbkka commented Jan 7, 2021

@swift-ci Please benchmark

@compnerd
Copy link
Member

compnerd commented Jan 7, 2021

@swift-ci please test Windows platform

@tbkka
Copy link
Contributor Author

tbkka commented Jan 7, 2021

Note: I still need to edit the header to ensure that binary128 is truly disabled for all Swift builds. Currently, it would get built (but not used) on a platform where long double happens to be binary128.

@compnerd
Copy link
Member

compnerd commented Jan 7, 2021

@tbkka - the only platforms we need to worry about for that is PPC and AArch64 (non-Darwin). Windows AArch64 does not support FP128 and IIRC, neither does Darwin AArch64.

@swift-ci
Copy link
Contributor

swift-ci commented Jan 7, 2021

Performance: -O

Regression OLD NEW DELTA RATIO
LessSubstringSubstringGenericComparable 35 39 +11.4% 0.90x
ObjectiveCBridgeFromNSSetAnyObjectToString 61000 67500 +10.7% 0.90x (?)
FloatingPointPrinting_Float_description_uniform 5000 5500 +10.0% 0.91x (?)
LessSubstringSubstring 35 38 +8.6% 0.92x (?)
EqualSubstringSubstring 35 38 +8.6% 0.92x (?)
EqualSubstringString 35 38 +8.6% 0.92x
 
Improvement OLD NEW DELTA RATIO
CharIteration_utf16_unicodeScalars 4600 3920 -14.8% 1.17x (?)
FloatingPointPrinting_Double_description_small 18200 16000 -12.1% 1.14x
FloatingPointPrinting_Double_description_uniform 19200 17200 -10.4% 1.12x
RandomShuffleLCG2 432 400 -7.4% 1.08x

Code size: -O

Performance: -Osize

Regression OLD NEW DELTA RATIO
LessSubstringSubstring 35 39 +11.4% 0.90x
LessSubstringSubstringGenericComparable 35 39 +11.4% 0.90x
ObjectiveCBridgeFromNSStringForced 2220 2465 +11.0% 0.90x (?)
FloatingPointPrinting_Float_description_small 4644 5076 +9.3% 0.91x (?)
UTF8Decode_InitFromData_ascii_as_ascii 635 690 +8.7% 0.92x (?)
EqualSubstringSubstring 35 38 +8.6% 0.92x (?)
EqualStringSubstring 35 38 +8.6% 0.92x (?)
EqualSubstringSubstringGenericEquatable 35 38 +8.6% 0.92x (?)
EqualSubstringString 35 38 +8.6% 0.92x (?)
FloatingPointPrinting_Float_description_uniform 5000 5400 +8.0% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
FloatingPointPrinting_Double_description_small 18200 16000 -12.1% 1.14x (?)
FloatingPointPrinting_Double_description_uniform 19200 17200 -10.4% 1.12x
Array2D 6736 6224 -7.6% 1.08x
FloatingPointPrinting_Double_interpolated 47800 44400 -7.1% 1.08x (?)
StrComplexWalk 4620 4300 -6.9% 1.07x

Code size: -Osize

Performance: -Onone

Regression OLD NEW DELTA RATIO
DataToStringSmall 4600 5300 +15.2% 0.87x (?)
DropFirstAnyCollectionLazy 141769 154185 +8.8% 0.92x (?)
ArrayOfPOD 921 997 +8.3% 0.92x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 8-Core Intel Xeon E5
  Processor Speed: 3 GHz
  Number of Processors: 1
  Total Number of Cores: 8
  L2 Cache (per Core): 256 KB
  L3 Cache: 25 MB
  Memory: 64 GB

@tbkka
Copy link
Contributor Author

tbkka commented Jan 7, 2021

@compnerd What about Android? I thought I heard that Android's long double was binary128?

I also believe that:

  • Windows has long double as IEEE 754 binary64 on all architectures
  • Darwin has long double as IEEE 754 binary64 on ARM64 and as Intel x87 float80 on x86 and x86_64.
  • PPC hardware supports double-double for quad precision, which is different than IEEE 754 binary128 format

The new logic in this header starts by examining the floating-point properties of the C types float, double, and long double as described in the C library's float.h header, which should accurately discriminate all of these cases. (This should be both more portable and more correct than previous versions of this file that hard-coded particular combinations of platforms and processors.)

Based on this inspection, the current SwiftDtoa implementation enables support for particular floating-point formats. The core implementations accept void * pointers to floating-point data in memory and have no reliance on C floating-point types. The implementation separately enables wrappers that accept C floating-point types and dispatch to the appropriate format support. So on Windows, this code would create a long double formatter that dispatches to the binary64 backend, and on Darwin x86_64, the long double formatter would dispatch to the float80 backend.

Some of this is a little muddled in the current Swift runtime code. Swift Float80 is always float80, which is not necessarily the same as C/C++ long double. For clarity, I think it would be good to be very cautious with the term "long double" since that may or may not be the same as float80.

@compnerd
Copy link
Member

compnerd commented Jan 8, 2021

Correct, I categorise Android as Linux, which will do FP128. However, I did forget that Android x86_64 also does FP128. Both of these are IEEE 754 binary128.

Windows does FP64 across the board for consistency, and is IEEE 754 binary64.

Hmm, I thought that there was some support for IEEE 754 FP128 on PPC64. But, yes, the hardware's native support for 128-bit floating point is not IEEE 754 conformant.

Agreed with you on the use of long double - and why I always explicitly spell out FP width.

@swift-ci
Copy link
Contributor

swift-ci commented Jan 8, 2021

Build failed
Swift Test OS X Platform
Git Sha - 409c3d695b57a4b5e37510a2ee57b492279b9ae7

@tbkka
Copy link
Contributor Author

tbkka commented Jan 8, 2021

Benchmark results are about as expected. Apart from the noise (substring, data, etc), there are two significant benchmark differences here:

Swift Float.description is benchmarking about 9% slower, consistent with the results from benchmarking the underlying C code. This is fallout from the code size reductions: The primary code size reduction was to switch float to use the same back-end code as Double. This means float is now using more precision than is technically necessary for IEEE 754 binary32, which does reduce performance. Other performance improvements offset this, but not entirely.

Swift Double.description is showing about 14% faster on average. That's about 1/2 of the improvement that shows in the C implementation, which is consistent with the fact that around 1/2 of Doubles need a heap-allocated String to hold the result.

For what it's worth, I consider these to be modest impacts, especially compared to the 10x perf improvement that started this project (see PR #15474).

@tbkka
Copy link
Contributor Author

tbkka commented Jan 8, 2021

Tests failed on macOS with 14006 tests completed (out of 14007):

Build timed out (after 83 minutes). Marking the build as failed.

It seems that some test is taking longer than expected, but it's not clear which one.

@tbkka
Copy link
Contributor Author

tbkka commented Jan 8, 2021

@swift-ci Please test

@swift-ci
Copy link
Contributor

swift-ci commented Jan 8, 2021

Build failed
Swift Test Linux Platform
Git Sha - f709b025837d7ba44bdc89c7d5f546ffbc093e61

@swift-ci
Copy link
Contributor

swift-ci commented Jan 8, 2021

Build failed
Swift Test OS X Platform
Git Sha - f709b025837d7ba44bdc89c7d5f546ffbc093e61

@tbkka
Copy link
Contributor Author

tbkka commented Jan 8, 2021

Ugh. Just figured out that the SwiftDtoa numbers are all for .o files with debug information. So the code size comparisons aren't particularly valid. I'll try to get those updated.

@tbkka
Copy link
Contributor Author

tbkka commented Jan 9, 2021

@swift-ci Please test Linux platform

@tbkka
Copy link
Contributor Author

tbkka commented Jan 9, 2021

@swift-ci Please benchmark

@swift-ci
Copy link
Contributor

swift-ci commented Jan 9, 2021

Performance: -O

Regression OLD NEW DELTA RATIO
StringFromLongWholeSubstring 4 5 +25.0% 0.80x
LessSubstringSubstringGenericComparable 39 43 +10.3% 0.91x
ObjectiveCBridgeFromNSDictionaryAnyObjectForced 8900 9700 +9.0% 0.92x (?)
LessSubstringSubstring 39 42 +7.7% 0.93x (?)
EqualSubstringSubstring 39 42 +7.7% 0.93x (?)
EqualSubstringString 39 42 +7.7% 0.93x
DataToStringSmall 2650 2850 +7.5% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
FloatingPointPrinting_Double_description_small 20300 17800 -12.3% 1.14x (?)
NSStringConversion.MutableCopy.UTF8 983 901 -8.3% 1.09x (?)
RandomShuffleLCG2 480 448 -6.7% 1.07x (?)

Code size: -O

Performance: -Osize

Regression OLD NEW DELTA RATIO
FlattenListFlatMap 4370 6848 +56.7% 0.64x (?)
StringFromLongWholeSubstring 4 5 +25.0% 0.80x
DataToStringSmall 2700 3000 +11.1% 0.90x (?)
LessSubstringSubstring 39 43 +10.3% 0.91x (?)
LessSubstringSubstringGenericComparable 39 43 +10.3% 0.91x
Data.init.Sequence.809B.Count.RE.I 79 86 +8.9% 0.92x (?)
EqualSubstringSubstring 39 42 +7.7% 0.93x (?)
EqualSubstringSubstringGenericEquatable 39 42 +7.7% 0.93x (?)
EqualSubstringString 39 42 +7.7% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
FloatingPointPrinting_Double_description_small 20300 17800 -12.3% 1.14x
FloatingPointPrinting_Double_description_uniform 21400 19100 -10.7% 1.12x (?)
Array2D 7504 6928 -7.7% 1.08x (?)
RandomShuffleLCG2 448 416 -7.1% 1.08x
FloatingPointPrinting_Double_interpolated 53400 49600 -7.1% 1.08x (?)
StrComplexWalk 5140 4800 -6.6% 1.07x

Code size: -Osize

Performance: -Onone

Regression OLD NEW DELTA RATIO
SevenBoom 1773 2246 +26.7% 0.79x (?)
NSStringConversion.MutableCopy.Rebridge.UTF8 1061 1185 +11.7% 0.90x (?)
NSStringConversion.MutableCopy.Rebridge.LongUTF8 868 946 +9.0% 0.92x (?)
Dictionary4OfObjects 1477 1602 +8.5% 0.92x (?)
ArrayOfPOD 1050 1132 +7.8% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
FloatingPointPrinting_Double_interpolated 80000 73400 -8.2% 1.09x (?)
String.data.Empty 97 90 -7.2% 1.08x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

@tbkka
Copy link
Contributor Author

tbkka commented Jan 10, 2021

Inlining the 128-bit shifts on 64-bit processors seems to have addressed the benchmark regression for Float. Let's see if the test suites are happier today...

@tbkka
Copy link
Contributor Author

tbkka commented Jan 10, 2021

@swift-ci Please test

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - 868b07c4ce7d4b43d025131bf6ceb2adbc789387

@tbkka
Copy link
Contributor Author

tbkka commented Jan 11, 2021

@swift-ci Please test Linux platform


#ifndef SWIFT_DTOA_H
#define SWIFT_DTOA_H

// <<<<< BEGIN Local configuration overrides for Swift runtime build
// Forcibly DISABLE binary128 support for Swift runtime
#define SWIFT_DTOA_BINARY128_SUPPORT 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to define this on (or remove the condition) for the purposes of being able to back deploy a hypothetical future Float128 type. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any point right now. I see three issues:

Issue: These functions are not ABI. Prepping for back deployment would require building out the actual ABI, including wrestling with argument-passing conventions, etc.

Issue: We would need agreement on where to support Float128. Should it depend on long double being binary128? Will it depend on HW support?

Issue: The float80/binary128 arithmetic is the biggest part of this formatting logic. Turning on either one is almost 10k of code and tables, adding the second one is about 1.5k more. So adding binary128 to platforms that already have float80 support is pretty cheap here. But I'm wary of speculatively enabling this on platforms that don't already have float80 turned on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

For the record, I can answer this:

Issue: We would need agreement on where to support Float128. Should it depend on long double being binary128? Will it depend on HW support?

If/when we add support for Float128, we'll make it available everywhere, independent of how long double is defined. C provides the _Float128 name for this purpose, so we don't need to depend on long double (frankly, it was a silly mistake for platforms to bind long double to binary128, but well, they did).

@tbkka tbkka force-pushed the tbkka/SwiftDtoav2 branch from 98bf7c5 to 6e83f2a Compare January 20, 2021 20:13
SwiftDtoa is the C/C++ code used in the Swift runtime to produce the textual representations used by the `description` and `debugDescription` properties of the standard Swift floating-point types.
This update includes a number of algorithmic improvements to SwiftDtoa to improve portability, reduce code size, and improve performance but does not change the actual output.

About SwiftDtoa
===============

In early versions of Swift, the `description` properties used the C library `sprintf` functionality with a fixed number of digits.
In 2018, that logic was replaced with the first version of SwiftDtoa which used used a fast, adaptive algorithm to automatically choose the correct number of digits for a particular value.
The resulting decimal output is always:

* Accurate.  Parsing the decimal form will yield exactly the same binary floating-point value again. This guarantee holds for any parser that accurately implements IEEE 754. In particular, the Swift standard library can guarantee that for any Double `d` that is not a NaN, `Double(d.description) == d`.

* Short. Among all accurate forms, this form has the fewest significant digits. (Caution: Surprisingly, this is not the same as minimizing the number of characters. In some cases, minimizing the number of characters requires producing additional significant digits.)

* Close. If there are multiple accurate, short forms, this code chooses the decimal form that is closest to the exact binary value.  If there are two exactly the same distance, the one with an even final digit will be used.

Algorithms that can produce this "optimal" output have been known since at least 1990, when Steele and White published their Dragon4 algorithm.
However, Dragon4 and other algorithms from that period relied on high-precision integer arithmetic, which made them slow.
More recently, a surge of interest in this problem has produced dramatically better algorithms that can produce the same results using only fast fixed-precision arithmetic.

This format is ideal for JSON and other textual interchange: accuracy ensures that the value will be correctly decoded, shortness minimizes network traffic, and the existence of high-performance algorithms allows this form to be generated more quickly than many `printf`-based implementations.

This format is also ideal for logging, debugging, and other general display. In particular, the shortness guarantee avoids the confusion of unnecessary additional digits, so that the result of `1.0 / 10.0` consistently displays as `0.1` instead of `0.100000000000000000001`.

About SwiftDtoa v2
==================

Compared to the original SwiftDtoa code, this update is:

**Better**:
The core logic is implemented using only C99 features with 64-bit and smaller integer arithmetic.
If available, 128-bit integers are used for better performance.
The core routines do not require any floating-point support from the C/C++ standard library and with only minor modifications should be usable on systems with no hardware or software floating-point support at all.
This version also has experimental support for IEEE 754 binary128 format, though this support is obviously not included when compiling for the Swift standard library.

**Smaller**:
Code size reduction compared to the earlier versions was a primary goal for this effort.
In particular, the new binary128 support shares essentially all of its code with the float80 implementation.

**Faster**:
Even with the code size reductions, all formats are noticeably faster.
The primary performance gains come from three major changes:
Text digits are now emitted directly in the core routines in a form that requires only minimal adjustment to produce the final text.
Digit generation produces 2, 4, or even 8 digits at a time, depending on the format.
The double logic optimistically produces 7 digits in the initial scaling with a Ryu-inspired backtracking when fewer digits suffice.

SwiftDtoa's algorithms
======================

SwiftDtoa started out as a variation of Florian Loitsch' Grisu2 that addressed the shortness failures of that algorithm.
Subsequent work has incorporated ideas from Errol3, Ryu, and other sources to yield a production-quality implementation that is performance- and size-competitive with current research code.

Those who wish to understand the details can read the extensive comments included in the code.
Note that float16 actually uses a different algorithm than the other formats, as the extremely limited range can be handled with much simpler techniques.
The float80/binary128 logic sacrifices some performance optimizations in order to minimize the code size for these less-used formats; the goal for SwiftDtoa v2 has been to match the float80 performance of earlier implementations while reducing code size and widening the arithmetic routines sufficiently to support binary128.

SwiftDtoa Testing
=================

A newly-developed test harness generates several large files of test data that include known-correct results computed with high-precision arithmetic routines.
The test files include:
* Critical values generated by the algorithm presented in the Errol paper (about 48 million cases for binary128)
* Values for which the optimal decimal form is exactly midway between two binary floating-point values.
* All exact powers of two representable in this format.
* Floating-point values that are close to exact powers of ten.

In addition, several billion random values for each format were compared to the results from other implementations.
For binary16 and binary32 this provided exhaustive validation of every possible input value.

Code Size and Performance
=========================

The tables below summarize the code size and performance for the SwiftDtoa C library module by itself on several different processor architectures.
When used from Swift, the `.description` and `.debugDescription` implementations incur additional overhead for creating and returning Swift strings that are not captured here.

The code size tables show the total size in bytes of the compiled `.o` object files for a particular version of that code.
The headings indicate the floating-point formats supported by that particular build (e.g., "16,32" for a version that supports binary16 and binary32 but no other formats).

The performance numbers below were obtained from a custom test harness that generates random bit patterns, interprets them as the corresponding floating-point value, and averages the overall time.
For float80, the random bit patterns were generated in a way that avoids generating invalid values.

All code was compiled with the system C/C++ compiler using `-O2` optimization.

A few notes about particular implementations:
* **SwiftDtoa v1** is the original SwiftDtoa implementation as committed to the Swift runtime in April 2018.
* **SwiftDtoa v1a** is the same as SwiftDtoa v1 with added binary16 support.
* **SwiftDtoa v2** can be configured with preprocessor macros to support any subset of the supported formats.  I've provided sizes here for several different build configurations.
* **Ryu** (Ulf Anders) implements binary32 and binary64 as completely independent source files.  The size here is the total size of the two .o object files.
* **Ryu(size)** is Ryu compiled with the `RYU_OPTIMIZE_SIZE` option.
* **Dragonbox** (Junekey Jeon).  The size here is the compiled size of a simple `.cpp` file that instantiates the template for the specified formats, plus the size of the associated text output logic.
* **Dragonbox(size)** is Dragonbox compiled to minimize size by using a compressed power-of-10 table.
* **gdtoa** has a very large feature set.  For this reason, I excluded it from the code size comparison since I didn't consider the numbers to be comparable to the others.

x86_64
----------------

These were built using Apple clang 12.0.5 on a 2019 16" MacBook Pro (2.4GHz 8-core Intel Core i9) running macOS 11.1.

**Code Size**

Bold numbers here indicate the configurations that have shipped as part of the Swift runtime.

|               | 16,32,64,80 | 32,64,80    | 32,64       |
|---------------|------------:|------------:|------------:|
|SwiftDtoa v1   |             |   **15128** |             |
|SwiftDtoa v1a  |   **16888** |             |             |
|SwiftDtoa v2   |   **20220** |     18628   |        8248 |
|Ryu            |             |             |       40408 |
|Ryu(size)      |             |             |       23836 |
|Dragonbox      |             |             |       23176 |
|Dragonbox(size)|             |             |       15132 |

**Performance**

|              | binary16 | binary32 | binary64 | float80 | binary128 |
|--------------|---------:|---------:|---------:|--------:|----------:|
|SwiftDtoa v1  |          |     25ns |     46ns |    82ns |           |
|SwiftDtoa v1a |     37ns |     26ns |     47ns |    83ns |           |
|SwiftDtoa v2  |     22ns |     19ns |     31ns |    72ns |      90ns |
|Ryu           |          |     19ns |     26ns |         |           |
|Ryu(size)     |          |     17ns |     24ns |         |           |
|Dragonbox     |          |     19ns |     24ns |         |           |
|Dragonbox(size) |        |     19ns |     29ns |         |           |
|gdtoa         |    220ns |    381ns |   1184ns | 16044ns |   22800ns |

ARM64
----------------

These were built using Apple clang 12.0.0 on a 2020 M1 Mac Mini running macOS 11.1.

**Code Size**

|               | 16,32,64 | 32,64 |
|---------------|---------:|------:|
|SwiftDtoa v1   |          |  7436 |
|SwiftDtoa v1a  |     9124 |       |
|SwiftDtoa v2   |     9964 |  8228 |
|Ryu            |          | 35764 |
|Ryu(size)      |          | 16708 |
|Dragonbox      |          | 27108 |
|Dragonbox(size)|          | 19172 |

**Performance**

|              | binary16 | binary32 | binary64 | float80 | binary128 |
|--------------|---------:|---------:|---------:|--------:|----------:|
|SwiftDtoa v1  |          |     21ns |     39ns |         |           |
|SwiftDtoa v1a |     17ns |     21ns |     39ns |         |           |
|SwiftDtoa v2  |     15ns |     17ns |     29ns |    54ns |      71ns |
|Ryu           |          |     15ns |     19ns |         |           |
|Ryu(size)     |          |     29ns |     24ns |         |           |
|Dragonbox     |          |     16ns |     24ns |         |           |
|Dragonbox(size) |        |     15ns |     34ns |         |           |
|gdtoa         |    143ns |    242ns |    858ns | 25129ns |   36195ns |

ARM32
----------------

These were built using clang 8.0.1 on a BeagleBone Black (500MHz ARMv7) running FreeBSD 12.1-RELEASE.

**Code Size**

|               | 16,32,64 | 32,64 |
|---------------|---------:|------:|
|SwiftDtoa v1   |          |  8668 |
|SwiftDtoa v1a  |    10356 |       |
|SwiftDtoa v2   |     9796 |  8340 |
|Ryu            |          | 32292 |
|Ryu(size)      |          | 14592 |
|Dragonbox      |          | 29000 |
|Dragonbox(size)|          | 21980 |

**Performance**

|              | binary16 | binary32 | binary64 | float80 | binary128 |
|--------------|---------:|---------:|---------:|--------:|----------:|
|SwiftDtoa v1  |          |    459ns |   1152ns |         |           |
|SwiftDtoa v1a |    383ns |    451ns |   1148ns |         |           |
|SwiftDtoa v2  |    202ns |    357ns |    715ns |  2720ns |    3379ns |
|Ryu           |          |    345ns |   5450ns |         |           |
|Ryu(size)     |          |    786ns |   5577ns |         |           |
|Dragonbox     |          |    300ns |    904ns |         |           |
|Dragonbox(size) |        |    294ns |   1021ns |         |           |
|gdtoa         |   2180ns |   4749ns |  18742ns |293000ns |  440000ns |
@tbkka tbkka force-pushed the tbkka/SwiftDtoav2 branch from 6e83f2a to 79c3a42 Compare January 20, 2021 20:20
@tbkka
Copy link
Contributor Author

tbkka commented Jan 20, 2021

Testing has gone so well that I decided to push a few more improvements:

  • Restored the separate binary32 formatter. This brings binary32 roughly to par with Ryu for performance. (perf)
  • Eliminated an expensive multiply from the interval computation for binary64 and float80/binary128 (size & perf)
  • Factored out some of the common textual adjustments from binary32,64,80,128 (size)

@tbkka
Copy link
Contributor Author

tbkka commented Jan 20, 2021

Also, I changed the float80 infinity/NaN parsing to more strictly follow 80387 conventions as documented on Wikipedia. In particular, the 8087/80287 infinity is now handled as a signaling NaN.

@tbkka
Copy link
Contributor Author

tbkka commented Jan 20, 2021

@swift-ci Please test

@tbkka
Copy link
Contributor Author

tbkka commented Jan 20, 2021

I've also gone ahead and changed the float80 stub interface to be conditionally compiled only on platforms where Swift Stdlib actually supports Float80. That eliminated a chunk of pointless boilerplate that was trying to provide a makeshift float80 formatter for other platforms.

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - 79c3a42

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - 79c3a42

@tbkka
Copy link
Contributor Author

tbkka commented Jan 21, 2021

Fixed a bug in the new float80 nan/inf parsing.

@tbkka
Copy link
Contributor Author

tbkka commented Jan 21, 2021

I measured the time for the PrintFloat.swift.gyb test: It's 14s on my laptop with an optimized Stdlib, 15s with an unoptimized, so there's no reason to disable this test in the latter case.

@tbkka
Copy link
Contributor Author

tbkka commented Jan 21, 2021

@swift-ci Please test

@tbkka
Copy link
Contributor Author

tbkka commented Jan 21, 2021

I'm still a little unhappy that binary16 is now slower than binary32 on x86_64, but that can wait for a future update.

@tbkka
Copy link
Contributor Author

tbkka commented Jan 21, 2021

@swift-ci Please test Windows platform

@tbkka tbkka merged commit a32dacb into swiftlang:main Jan 27, 2021
@tbkka tbkka deleted the tbkka/SwiftDtoav2 branch January 27, 2021 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants