-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SwiftDtoa v2: Better, Smaller, Faster floating-point formatting #35299
Conversation
@swift-ci Please test |
@swift-ci Please benchmark |
@swift-ci please test Windows platform |
Note: I still need to edit the header to ensure that binary128 is truly disabled for all Swift builds. Currently, it would get built (but not used) on a platform where |
@tbkka - the only platforms we need to worry about for that is PPC and AArch64 (non-Darwin). Windows AArch64 does not support FP128 and IIRC, neither does Darwin AArch64. |
Performance: -O
Code size: -OPerformance: -Osize
Code size: -OsizePerformance: -Onone
Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
@compnerd What about Android? I thought I heard that Android's I also believe that:
The new logic in this header starts by examining the floating-point properties of the C types Based on this inspection, the current SwiftDtoa implementation enables support for particular floating-point formats. The core implementations accept Some of this is a little muddled in the current Swift runtime code. Swift |
Correct, I categorise Android as Linux, which will do FP128. However, I did forget that Android x86_64 also does FP128. Both of these are IEEE 754 binary128. Windows does FP64 across the board for consistency, and is IEEE 754 binary64. Hmm, I thought that there was some support for IEEE 754 FP128 on PPC64. But, yes, the hardware's native support for 128-bit floating point is not IEEE 754 conformant. Agreed with you on the use of |
Build failed |
Benchmark results are about as expected. Apart from the noise (substring, data, etc), there are two significant benchmark differences here: Swift Float.description is benchmarking about 9% slower, consistent with the results from benchmarking the underlying C code. This is fallout from the code size reductions: The primary code size reduction was to switch float to use the same back-end code as Double. This means float is now using more precision than is technically necessary for IEEE 754 binary32, which does reduce performance. Other performance improvements offset this, but not entirely. Swift Double.description is showing about 14% faster on average. That's about 1/2 of the improvement that shows in the C implementation, which is consistent with the fact that around 1/2 of Doubles need a heap-allocated String to hold the result. For what it's worth, I consider these to be modest impacts, especially compared to the 10x perf improvement that started this project (see PR #15474). |
Tests failed on macOS with 14006 tests completed (out of 14007):
It seems that some test is taking longer than expected, but it's not clear which one. |
@swift-ci Please test |
Build failed |
Build failed |
Ugh. Just figured out that the SwiftDtoa numbers are all for .o files with debug information. So the code size comparisons aren't particularly valid. I'll try to get those updated. |
@swift-ci Please test Linux platform |
@swift-ci Please benchmark |
Performance: -O
Code size: -OPerformance: -Osize
Code size: -OsizePerformance: -Onone
Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
Inlining the 128-bit shifts on 64-bit processors seems to have addressed the benchmark regression for Float. Let's see if the test suites are happier today... |
@swift-ci Please test |
Build failed |
@swift-ci Please test Linux platform |
include/swift/Runtime/SwiftDtoa.h
Outdated
|
||
#ifndef SWIFT_DTOA_H | ||
#define SWIFT_DTOA_H | ||
|
||
// <<<<< BEGIN Local configuration overrides for Swift runtime build | ||
// Forcibly DISABLE binary128 support for Swift runtime | ||
#define SWIFT_DTOA_BINARY128_SUPPORT 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to define this on (or remove the condition) for the purposes of being able to back deploy a hypothetical future Float128
type. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any point right now. I see three issues:
Issue: These functions are not ABI. Prepping for back deployment would require building out the actual ABI, including wrestling with argument-passing conventions, etc.
Issue: We would need agreement on where to support Float128. Should it depend on long double
being binary128? Will it depend on HW support?
Issue: The float80/binary128 arithmetic is the biggest part of this formatting logic. Turning on either one is almost 10k of code and tables, adding the second one is about 1.5k more. So adding binary128 to platforms that already have float80 support is pretty cheap here. But I'm wary of speculatively enabling this on platforms that don't already have float80 turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
For the record, I can answer this:
Issue: We would need agreement on where to support Float128. Should it depend on long double being binary128? Will it depend on HW support?
If/when we add support for Float128, we'll make it available everywhere, independent of how long double
is defined. C provides the _Float128
name for this purpose, so we don't need to depend on long double
(frankly, it was a silly mistake for platforms to bind long double
to binary128, but well, they did).
98bf7c5
to
6e83f2a
Compare
SwiftDtoa is the C/C++ code used in the Swift runtime to produce the textual representations used by the `description` and `debugDescription` properties of the standard Swift floating-point types. This update includes a number of algorithmic improvements to SwiftDtoa to improve portability, reduce code size, and improve performance but does not change the actual output. About SwiftDtoa =============== In early versions of Swift, the `description` properties used the C library `sprintf` functionality with a fixed number of digits. In 2018, that logic was replaced with the first version of SwiftDtoa which used used a fast, adaptive algorithm to automatically choose the correct number of digits for a particular value. The resulting decimal output is always: * Accurate. Parsing the decimal form will yield exactly the same binary floating-point value again. This guarantee holds for any parser that accurately implements IEEE 754. In particular, the Swift standard library can guarantee that for any Double `d` that is not a NaN, `Double(d.description) == d`. * Short. Among all accurate forms, this form has the fewest significant digits. (Caution: Surprisingly, this is not the same as minimizing the number of characters. In some cases, minimizing the number of characters requires producing additional significant digits.) * Close. If there are multiple accurate, short forms, this code chooses the decimal form that is closest to the exact binary value. If there are two exactly the same distance, the one with an even final digit will be used. Algorithms that can produce this "optimal" output have been known since at least 1990, when Steele and White published their Dragon4 algorithm. However, Dragon4 and other algorithms from that period relied on high-precision integer arithmetic, which made them slow. More recently, a surge of interest in this problem has produced dramatically better algorithms that can produce the same results using only fast fixed-precision arithmetic. This format is ideal for JSON and other textual interchange: accuracy ensures that the value will be correctly decoded, shortness minimizes network traffic, and the existence of high-performance algorithms allows this form to be generated more quickly than many `printf`-based implementations. This format is also ideal for logging, debugging, and other general display. In particular, the shortness guarantee avoids the confusion of unnecessary additional digits, so that the result of `1.0 / 10.0` consistently displays as `0.1` instead of `0.100000000000000000001`. About SwiftDtoa v2 ================== Compared to the original SwiftDtoa code, this update is: **Better**: The core logic is implemented using only C99 features with 64-bit and smaller integer arithmetic. If available, 128-bit integers are used for better performance. The core routines do not require any floating-point support from the C/C++ standard library and with only minor modifications should be usable on systems with no hardware or software floating-point support at all. This version also has experimental support for IEEE 754 binary128 format, though this support is obviously not included when compiling for the Swift standard library. **Smaller**: Code size reduction compared to the earlier versions was a primary goal for this effort. In particular, the new binary128 support shares essentially all of its code with the float80 implementation. **Faster**: Even with the code size reductions, all formats are noticeably faster. The primary performance gains come from three major changes: Text digits are now emitted directly in the core routines in a form that requires only minimal adjustment to produce the final text. Digit generation produces 2, 4, or even 8 digits at a time, depending on the format. The double logic optimistically produces 7 digits in the initial scaling with a Ryu-inspired backtracking when fewer digits suffice. SwiftDtoa's algorithms ====================== SwiftDtoa started out as a variation of Florian Loitsch' Grisu2 that addressed the shortness failures of that algorithm. Subsequent work has incorporated ideas from Errol3, Ryu, and other sources to yield a production-quality implementation that is performance- and size-competitive with current research code. Those who wish to understand the details can read the extensive comments included in the code. Note that float16 actually uses a different algorithm than the other formats, as the extremely limited range can be handled with much simpler techniques. The float80/binary128 logic sacrifices some performance optimizations in order to minimize the code size for these less-used formats; the goal for SwiftDtoa v2 has been to match the float80 performance of earlier implementations while reducing code size and widening the arithmetic routines sufficiently to support binary128. SwiftDtoa Testing ================= A newly-developed test harness generates several large files of test data that include known-correct results computed with high-precision arithmetic routines. The test files include: * Critical values generated by the algorithm presented in the Errol paper (about 48 million cases for binary128) * Values for which the optimal decimal form is exactly midway between two binary floating-point values. * All exact powers of two representable in this format. * Floating-point values that are close to exact powers of ten. In addition, several billion random values for each format were compared to the results from other implementations. For binary16 and binary32 this provided exhaustive validation of every possible input value. Code Size and Performance ========================= The tables below summarize the code size and performance for the SwiftDtoa C library module by itself on several different processor architectures. When used from Swift, the `.description` and `.debugDescription` implementations incur additional overhead for creating and returning Swift strings that are not captured here. The code size tables show the total size in bytes of the compiled `.o` object files for a particular version of that code. The headings indicate the floating-point formats supported by that particular build (e.g., "16,32" for a version that supports binary16 and binary32 but no other formats). The performance numbers below were obtained from a custom test harness that generates random bit patterns, interprets them as the corresponding floating-point value, and averages the overall time. For float80, the random bit patterns were generated in a way that avoids generating invalid values. All code was compiled with the system C/C++ compiler using `-O2` optimization. A few notes about particular implementations: * **SwiftDtoa v1** is the original SwiftDtoa implementation as committed to the Swift runtime in April 2018. * **SwiftDtoa v1a** is the same as SwiftDtoa v1 with added binary16 support. * **SwiftDtoa v2** can be configured with preprocessor macros to support any subset of the supported formats. I've provided sizes here for several different build configurations. * **Ryu** (Ulf Anders) implements binary32 and binary64 as completely independent source files. The size here is the total size of the two .o object files. * **Ryu(size)** is Ryu compiled with the `RYU_OPTIMIZE_SIZE` option. * **Dragonbox** (Junekey Jeon). The size here is the compiled size of a simple `.cpp` file that instantiates the template for the specified formats, plus the size of the associated text output logic. * **Dragonbox(size)** is Dragonbox compiled to minimize size by using a compressed power-of-10 table. * **gdtoa** has a very large feature set. For this reason, I excluded it from the code size comparison since I didn't consider the numbers to be comparable to the others. x86_64 ---------------- These were built using Apple clang 12.0.5 on a 2019 16" MacBook Pro (2.4GHz 8-core Intel Core i9) running macOS 11.1. **Code Size** Bold numbers here indicate the configurations that have shipped as part of the Swift runtime. | | 16,32,64,80 | 32,64,80 | 32,64 | |---------------|------------:|------------:|------------:| |SwiftDtoa v1 | | **15128** | | |SwiftDtoa v1a | **16888** | | | |SwiftDtoa v2 | **20220** | 18628 | 8248 | |Ryu | | | 40408 | |Ryu(size) | | | 23836 | |Dragonbox | | | 23176 | |Dragonbox(size)| | | 15132 | **Performance** | | binary16 | binary32 | binary64 | float80 | binary128 | |--------------|---------:|---------:|---------:|--------:|----------:| |SwiftDtoa v1 | | 25ns | 46ns | 82ns | | |SwiftDtoa v1a | 37ns | 26ns | 47ns | 83ns | | |SwiftDtoa v2 | 22ns | 19ns | 31ns | 72ns | 90ns | |Ryu | | 19ns | 26ns | | | |Ryu(size) | | 17ns | 24ns | | | |Dragonbox | | 19ns | 24ns | | | |Dragonbox(size) | | 19ns | 29ns | | | |gdtoa | 220ns | 381ns | 1184ns | 16044ns | 22800ns | ARM64 ---------------- These were built using Apple clang 12.0.0 on a 2020 M1 Mac Mini running macOS 11.1. **Code Size** | | 16,32,64 | 32,64 | |---------------|---------:|------:| |SwiftDtoa v1 | | 7436 | |SwiftDtoa v1a | 9124 | | |SwiftDtoa v2 | 9964 | 8228 | |Ryu | | 35764 | |Ryu(size) | | 16708 | |Dragonbox | | 27108 | |Dragonbox(size)| | 19172 | **Performance** | | binary16 | binary32 | binary64 | float80 | binary128 | |--------------|---------:|---------:|---------:|--------:|----------:| |SwiftDtoa v1 | | 21ns | 39ns | | | |SwiftDtoa v1a | 17ns | 21ns | 39ns | | | |SwiftDtoa v2 | 15ns | 17ns | 29ns | 54ns | 71ns | |Ryu | | 15ns | 19ns | | | |Ryu(size) | | 29ns | 24ns | | | |Dragonbox | | 16ns | 24ns | | | |Dragonbox(size) | | 15ns | 34ns | | | |gdtoa | 143ns | 242ns | 858ns | 25129ns | 36195ns | ARM32 ---------------- These were built using clang 8.0.1 on a BeagleBone Black (500MHz ARMv7) running FreeBSD 12.1-RELEASE. **Code Size** | | 16,32,64 | 32,64 | |---------------|---------:|------:| |SwiftDtoa v1 | | 8668 | |SwiftDtoa v1a | 10356 | | |SwiftDtoa v2 | 9796 | 8340 | |Ryu | | 32292 | |Ryu(size) | | 14592 | |Dragonbox | | 29000 | |Dragonbox(size)| | 21980 | **Performance** | | binary16 | binary32 | binary64 | float80 | binary128 | |--------------|---------:|---------:|---------:|--------:|----------:| |SwiftDtoa v1 | | 459ns | 1152ns | | | |SwiftDtoa v1a | 383ns | 451ns | 1148ns | | | |SwiftDtoa v2 | 202ns | 357ns | 715ns | 2720ns | 3379ns | |Ryu | | 345ns | 5450ns | | | |Ryu(size) | | 786ns | 5577ns | | | |Dragonbox | | 300ns | 904ns | | | |Dragonbox(size) | | 294ns | 1021ns | | | |gdtoa | 2180ns | 4749ns | 18742ns |293000ns | 440000ns |
6e83f2a
to
79c3a42
Compare
Testing has gone so well that I decided to push a few more improvements:
|
Also, I changed the float80 infinity/NaN parsing to more strictly follow 80387 conventions as documented on Wikipedia. In particular, the 8087/80287 infinity is now handled as a signaling NaN. |
@swift-ci Please test |
I've also gone ahead and changed the float80 stub interface to be conditionally compiled only on platforms where Swift Stdlib actually supports Float80. That eliminated a chunk of pointless boilerplate that was trying to provide a makeshift float80 formatter for other platforms. |
Build failed |
Build failed |
Fixed a bug in the new float80 nan/inf parsing. |
I measured the time for the |
@swift-ci Please test |
I'm still a little unhappy that binary16 is now slower than binary32 on x86_64, but that can wait for a future update. |
@swift-ci Please test Windows platform |
SwiftDtoa is the C/C++ code used in the Swift runtime to produce the textual representations used by the
description
anddebugDescription
properties of the standard Swift floating-point types. This update includes a number of algorithmic improvements to SwiftDtoa to improve portability, reduce code size, and improve performance but does not change the actual output.About SwiftDtoa
In early versions of Swift, the
description
properties used the C librarysprintf
functionality with a fixed number of digits. In April 2018, PR #15474 replaced that logic with the first version of SwiftDtoa which used used a fast, adaptive algorithm to automatically choose the correct number of digits for a particular value. The resulting decimal output is always:Accurate. Parsing the decimal form will yield exactly the same binary floating-point value again. This guarantee holds for any parser that accurately implements IEEE 754. In particular, the Swift standard library can guarantee that for any Double
d
that is not a NaN,Double(d.description) == d
.Short. Among all accurate forms, this form has the fewest significant digits. (Caution: Surprisingly, this is not the same as minimizing the number of characters. In some cases, minimizing the number of characters requires producing additional significant digits.)
Close. If there are multiple accurate, short forms, this code chooses the decimal form that is closest to the exact binary value. If there are two exactly the same distance, the one with an even final digit will be used.
Algorithms that can produce this "optimal" output have been known since at least 1990, when Steele and White published their Dragon4 algorithm. However, Dragon4 and other algorithms from that period relied on high-precision integer arithmetic, which made them slow. More recently, a surge of interest in this problem has produced dramatically better algorithms that can produce the same results using only fast fixed-precision arithmetic.
This format is ideal for JSON and other textual interchange: accuracy ensures that the value will be correctly decoded, shortness minimizes network traffic, and the existence of high-performance algorithms allows this form to be generated more quickly than many
printf
-based implementations.This format is also ideal for logging, debugging, and other general display. In particular, the shortness guarantee avoids the confusion of unnecessary additional digits, so that the result of
1.0 / 10.0
consistently displays as0.1
instead of0.100000000000000000001
.About SwiftDtoa v2
Compared to the original SwiftDtoa code, this update is:
Better: The core logic is implemented using only C99 features with 64-bit and smaller integer arithmetic. If available, 128-bit integers are used for better performance. The core routines do not require any floating-point support from the C/C++ standard library and with only minor modifications should be usable on systems with no hardware or software floating-point support at all. This version also has experimental support for IEEE 754 binary128 format, though this support is obviously not included when compiling for the Swift standard library.
Smaller: Code size reduction compared to the earlier versions was a primary goal for this effort. In particular, float80 and binary128 share most of their core code, avoiding a full additional copy of the primary algorithm.
Faster: Even with the code size reductions, all formats are noticeably faster than before. The primary performance gains come from three major changes: ASCII digits are now emitted directly in the core routine in a form that requires only minimal adjustment to produce the final text. The main digit generation produces 2, 4, or 8 digits at a time when possible. Finally, the double logic optimistically produces 7 digits in the initial scaling with a Ryu-inspired backtracking when fewer digits suffice.
SwiftDtoa's algorithms
SwiftDtoa started out as a variation of Florian Loitsch' Grisu2 that addressed the shortness failures of that algorithm. Specifically, it uses wider arithmetic (128-bit for binary64) and a novel initial interval scaling that together address all of the weakness of Grisu2. Subsequent work has incorporated ideas from Errol3, Ryu, and other sources to yield a production-quality implementation that is performance- and size-competitive with current research code.
Those who wish to understand the details can read the extensive comments included in the code. Note that float16 actually uses a different algorithm than the other formats, as the extremely limited range can be handled with much simpler techniques. The float80/binary128 logic sacrifices some performance optimizations in order to minimize the code size for these less-used formats; the goal for SwiftDtoa v2 has been to match the float80 performance of earlier implementations while reducing code size and widening the arithmetic routines sufficiently to support binary128.
SwiftDtoa Testing
A newly-developed test harness generates several large files of test data that include known-correct results computed with high-precision arithmetic routines. The test files include:
In addition, several billion random values for each format were compared to the results from other implementations. For binary16 and binary32 this provided exhaustive validation of every possible input value.
Code Size and Performance
The tables below summarize the code size and performance for the SwiftDtoa C library module on several different processor architectures, with comparisons to several other implementations. When used from Swift, the
.description
and.debugDescription
implementations incur additional overhead for creating and returning Swift strings that are not captured here.The code size tables show the total size in bytes of the compiled and stripped
.o
object files for a particular version of that code. The headings indicate the floating-point formats supported by that particular build (e.g., "16,32" for a version that supports binary16 and binary32 but no other formats).The performance numbers below were obtained from a custom test harness that generates random bit patterns, interprets them as the corresponding floating-point value, and averages the overall time. For float80, the random bit patterns were generated in a way that avoids generating invalid values.
All code was compiled with the system C/C++ compiler using
-O2
optimization.A few notes about particular implementations mentioned below:
Ryu(size)
version was built with theRYU_OPTIMIZE_SIZE
option..cpp
file that instantiates the template for the specified formats. TheDragonbox(size)
version used thecache::compressed
policy to reduce the size of the lookup tables.x86_64
These were built using Apple clang 12.0.5 on a 2019 16" MacBook Pro (2.4GHz 8-core Intel Core i9) running macOS 11.1.
Code Size
Bold numbers here indicate the configurations that have shipped as part of the Swift runtime.
Performance
ARM64
These were built using Apple clang 12.0.0 on a 2020 M1 Mac Mini running macOS 11.1.
Code Size
Performance
ARM32
These were built using clang 8.0.1 on a BeagleBone Black (500MHz ARMv7) running FreeBSD 12.1-RELEASE.
Code Size
Performance
EDITED 1/8/2021: Thanks to Junekey Jeon for pointing out that I was not building Ryu or Dragonbox with the correct options to get a smaller table size. I added new rows to some of the tables above with numbers for x86_64 and ARM32.
EDITED 1/8/2021: Turns out I was building some code with
-g
and some without. I've rebuilt everything without-g
and updated the numbers above.EDITED 1/20/2021: Updated the benchmark results to track some additional code changes: Binary32 now has a separate core implementation, bringing perf in line with Ryu and dragon box, with a corresponding increase in code size. Minor perf/size improvements to binary64, float80, binary128.