-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM] apd: embed small coefficient values in Decimal struct #101
[DNM] apd: embed small coefficient values in Decimal struct #101
Conversation
They cause the benchmarks to run for a very long time. See golang/go#27217. Adjust the benchmarks to have an explicit setup phase and run phase, separated by `b.ResetTimer`.
Cleans up a bit of code.
This reduces the size of the Decimal struct from 48 bytes to 40 bytes.
This commit introduces a performance optimization that embeds small coefficient values directly in their `Decimal` struct, instead of storing these values in a separate heap allocation. Each `Decimal` maintains (through `big.Int`) an internal reference to a variable-length coefficient, which is represented by a `[]big.Word`. The `coeffInline` field and the `lazyInit` method combine to allow us to inline this coefficient within the `Decimal` struct when its value is sufficiently small. In `lazyInit`, we point the `Coeff` field's backing slice at the `coeffInline` array. `big.Int` will avoid re-allocating this array until it is provided with a value that exceeds the initial capacity. We set this initial capacity to accommodate any coefficient that would fit in a 64-bit integer (i.e. values up to 2^64). This is an alternative to an optimization that many other arbitrary precision decimal libraries have where small coefficient values are stored as numeric fields in their data type's struct. Only when this coefficient value gets large enough do these libraries fall back to a variable-length coefficient with internal indirection. We can see the optimization in practice in the `ericlagergren/decimal` library, where each struct contains a `compact uint64` and an `unscaled big.Int`. Prior concern from the authors of `cockroachdb/apd` regarding this form of compact representation optimization was that it traded performance for complexity. The optimization fractured control flow, leaking out across the library and leading to more complex, error-prone code. The approach taken in this commit does not have the same issue. All arithmetic on the decimal's coefficient is still deferred to `bit.Int`. In fact, the entire optimization is best-effort, and bugs that lead to missed calls to `lazyInit` are merely missed opportunities to avoid a heap allocation, and nothing more serious. The internal reference within Decimal is vulnerable to memory aliasing bugs when a `Decimal` is copied by value (and not through the `Decimal.Set` method), but this is not a new concern. Impact on benchmarks: ``` name old time/op new time/op delta GDA/compare-10 61.5µs ±20% 54.5µs ±18% -11.34% (p=0.035 n=10+10) GDA/reduce-10 18.8µs ± 7% 17.5µs ± 6% -7.16% (p=0.001 n=10+10) GDA/remainder-10 67.0µs ±11% 62.2µs ± 6% -7.09% (p=0.023 n=10+10) Exp/P5/S-4/D-2-10 2.56µs ± 4% 2.51µs ± 4% ~ (p=0.060 n=10+10) Exp/P5/S-4/D2-10 2.65µs ± 4% 2.60µs ± 4% ~ (p=0.060 n=10+10) Exp/P5/S-1/D-2-10 6.44µs ± 5% 6.31µs ± 4% ~ (p=0.060 n=10+10) Exp/P5/S-1/D2-10 6.27µs ± 4% 6.15µs ± 4% ~ (p=0.063 n=10+10) Exp/P5/S2/D-2-10 19.3µs ± 4% 19.1µs ± 4% ~ (p=0.063 n=10+10) Exp/P5/S2/D2-10 19.1µs ± 4% 18.9µs ± 4% ~ (p=0.063 n=10+10) Exp/P10/S-4/D-10-10 10.5µs ± 4% 10.4µs ± 4% ~ (p=0.063 n=10+10) Exp/P10/S-4/D-2-10 5.59µs ± 4% 5.49µs ± 5% ~ (p=0.138 n=10+10) Exp/P10/S-4/D2-10 4.68µs ± 4% 4.62µs ± 5% ~ (p=0.138 n=10+10) Exp/P10/S-4/D10-10 10.5µs ± 4% 10.4µs ± 4% ~ (p=0.143 n=10+10) Exp/P10/S-1/D-10-10 22.2µs ± 4% 22.1µs ± 4% ~ (p=0.143 n=10+10) Exp/P10/S-1/D-2-10 12.2µs ± 4% 12.1µs ± 4% ~ (p=0.143 n=10+10) Exp/P10/S-1/D2-10 12.2µs ± 4% 12.0µs ± 4% ~ (p=0.143 n=10+10) Exp/P10/S-1/D10-10 21.8µs ± 4% 21.7µs ± 4% ~ (p=0.247 n=10+10) Exp/P10/S2/D-10-10 44.0µs ± 4% 44.0µs ± 4% ~ (p=0.165 n=10+10) Exp/P10/S2/D-2-10 31.8µs ± 4% 31.7µs ± 4% ~ (p=0.190 n=10+10) Exp/P10/S2/D2-10 28.7µs ± 4% 28.6µs ± 4% ~ (p=0.139 n=10+10) Exp/P10/S2/D10-10 44.0µs ± 4% 44.0µs ± 4% ~ (p=0.143 n=10+10) Exp/P100/S-4/D-100-10 561µs ± 4% 565µs ± 4% ~ (p=0.971 n=10+10) Exp/P100/S-4/D-10-10 297µs ± 4% 299µs ± 4% ~ (p=0.912 n=10+10) Exp/P100/S-4/D-2-10 286µs ± 4% 289µs ± 4% ~ (p=0.315 n=10+10) Exp/P100/S-4/D2-10 287µs ± 4% 289µs ± 4% ~ (p=0.684 n=10+10) Exp/P100/S-4/D10-10 292µs ± 4% 294µs ± 4% ~ (p=0.529 n=10+10) Exp/P100/S-4/D100-10 550µs ± 4% 554µs ± 4% ~ (p=0.739 n=10+10) Exp/P100/S-1/D-100-10 1.05ms ± 4% 1.05ms ± 4% ~ (p=0.631 n=10+10) Exp/P100/S-1/D-10-10 637µs ± 4% 641µs ± 4% ~ (p=0.739 n=10+10) Exp/P100/S-1/D-2-10 636µs ± 4% 641µs ± 4% ~ (p=0.436 n=10+10) Exp/P100/S-1/D2-10 612µs ± 4% 616µs ± 4% ~ (p=0.631 n=10+10) Exp/P100/S-1/D10-10 621µs ± 4% 625µs ± 4% ~ (p=0.912 n=10+10) Exp/P100/S-1/D100-10 1.07ms ± 3% 1.08ms ± 4% ~ (p=0.971 n=10+10) Exp/P100/S2/D-100-10 1.54ms ± 3% 1.55ms ± 4% ~ (p=0.684 n=10+10) Exp/P100/S2/D-10-10 1.06ms ± 4% 1.06ms ± 4% ~ (p=0.971 n=10+10) Exp/P100/S2/D-2-10 918µs ± 4% 924µs ± 4% ~ (p=0.971 n=10+10) Exp/P100/S2/D2-10 1.01ms ± 4% 1.02ms ± 4% ~ (p=0.853 n=10+10) Exp/P100/S2/D10-10 1.05ms ± 4% 1.06ms ± 4% ~ (p=0.971 n=10+10) Exp/P100/S2/D100-10 1.58ms ± 4% 1.59ms ± 4% ~ (p=0.853 n=10+10) Ln/P2/S-100/D2-10 25.1µs ± 4% 25.0µs ± 4% ~ (p=0.143 n=10+10) Ln/P2/S-10/D2-10 22.6µs ± 4% 22.6µs ± 4% ~ (p=0.143 n=10+10) Ln/P2/S-2/D2-10 24.9µs ± 4% 24.8µs ± 4% ~ (p=0.143 n=10+10) Ln/P2/S2/D2-10 28.1µs ± 4% 27.9µs ± 4% ~ (p=0.139 n=10+10) Ln/P2/S10/D2-10 27.0µs ± 4% 26.9µs ± 4% ~ (p=0.143 n=10+10) Ln/P2/S100/D2-10 28.7µs ± 4% 28.5µs ± 4% ~ (p=0.143 n=10+10) Ln/P10/S-100/D2-10 78.1µs ± 4% 78.1µs ± 4% ~ (p=0.218 n=10+10) Ln/P10/S-100/D10-10 74.3µs ± 4% 74.3µs ± 4% ~ (p=0.190 n=10+10) Ln/P10/S-10/D2-10 83.4µs ± 4% 83.4µs ± 4% ~ (p=0.315 n=10+10) Ln/P10/S-10/D10-10 80.9µs ± 4% 80.9µs ± 4% ~ (p=0.143 n=10+10) Ln/P10/S-2/D2-10 82.1µs ± 4% 82.2µs ± 5% ~ (p=0.280 n=10+10) Ln/P10/S-2/D10-10 90.6µs ± 4% 90.6µs ± 5% ~ (p=0.190 n=10+10) Ln/P10/S2/D2-10 80.6µs ± 4% 80.6µs ± 4% ~ (p=0.184 n=10+10) Ln/P10/S2/D10-10 79.7µs ± 4% 79.5µs ± 4% ~ (p=0.143 n=10+10) Ln/P10/S10/D2-10 85.2µs ± 4% 85.2µs ± 5% ~ (p=0.218 n=10+10) Ln/P10/S10/D10-10 74.8µs ± 4% 74.7µs ± 4% ~ (p=0.247 n=10+10) Ln/P10/S100/D2-10 83.8µs ± 4% 83.6µs ± 4% ~ (p=0.165 n=10+10) Ln/P10/S100/D10-10 66.1µs ± 4% 65.9µs ± 5% ~ (p=0.143 n=10+10) Ln/P100/S-100/D2-10 3.87ms ± 4% 3.90ms ± 5% ~ (p=0.853 n=10+10) Ln/P100/S-100/D10-10 3.34ms ± 4% 3.35ms ± 5% ~ (p=0.529 n=10+10) Ln/P100/S-100/D100-10 3.46ms ± 4% 3.48ms ± 5% ~ (p=0.971 n=10+10) Ln/P100/S-10/D2-10 3.63ms ± 4% 3.65ms ± 4% ~ (p=0.739 n=10+10) Ln/P100/S-10/D10-10 3.43ms ± 4% 3.44ms ± 4% ~ (p=0.796 n=10+10) Ln/P100/S-10/D100-10 3.42ms ± 4% 3.45ms ± 4% ~ (p=0.631 n=10+10) Ln/P100/S-2/D2-10 3.51ms ± 4% 3.53ms ± 4% ~ (p=0.853 n=10+10) Ln/P100/S-2/D10-10 3.21ms ± 4% 3.24ms ± 4% ~ (p=0.631 n=10+10) Ln/P100/S-2/D100-10 3.47ms ± 4% 3.50ms ± 4% ~ (p=0.631 n=10+10) Ln/P100/S2/D2-10 3.56ms ± 4% 3.59ms ± 4% ~ (p=0.796 n=10+10) Ln/P100/S2/D10-10 3.82ms ± 4% 3.84ms ± 4% ~ (p=1.000 n=10+10) Ln/P100/S2/D100-10 3.72ms ± 4% 3.75ms ± 5% ~ (p=0.353 n=10+10) Ln/P100/S10/D2-10 3.78ms ± 4% 3.81ms ± 4% ~ (p=0.631 n=10+10) Ln/P100/S10/D10-10 3.50ms ± 4% 3.52ms ± 4% ~ (p=0.529 n=10+10) Ln/P100/S10/D100-10 3.49ms ± 4% 3.52ms ± 4% ~ (p=0.684 n=10+10) Ln/P100/S100/D2-10 3.82ms ± 4% 3.85ms ± 4% ~ (p=0.393 n=10+10) Ln/P100/S100/D10-10 3.72ms ± 4% 3.75ms ± 4% ~ (p=0.393 n=10+10) Ln/P100/S100/D100-10 3.65ms ± 4% 3.68ms ± 4% ~ (p=0.315 n=10+10) GDA/abs-10 10.4µs ± 6% 10.3µs ± 6% ~ (p=0.564 n=10+10) GDA/base-10 110µs ± 4% 111µs ± 5% ~ (p=0.869 n=10+10) GDA/comparetotal-10 30.2µs ± 5% 30.3µs ± 5% ~ (p=0.631 n=10+10) GDA/divide-10 391µs ± 7% 390µs ± 3% ~ (p=0.853 n=10+10) GDA/exp-10 127ms ± 4% 127ms ± 4% ~ (p=0.796 n=10+10) GDA/ln-10 83.6ms ± 3% 83.6ms ± 4% ~ (p=0.218 n=10+10) GDA/log10-10 106ms ± 4% 105ms ± 4% ~ (p=0.218 n=10+10) GDA/minus-10 11.7µs ± 4% 11.7µs ± 7% ~ (p=0.579 n=10+10) GDA/multiply-10 77.3µs ± 9% 80.8µs ± 9% ~ (p=0.089 n=10+10) GDA/plus-10 42.9µs ± 5% 44.2µs ± 4% ~ (p=0.052 n=10+10) GDA/power-10 218ms ± 3% 218ms ± 3% ~ (p=0.631 n=10+10) GDA/powersqrt-10 450ms ± 2% 448ms ± 2% ~ (p=0.280 n=10+10) GDA/quantize-10 143µs ±12% 128µs ± 8% ~ (p=0.052 n=10+10) GDA/randoms-10 3.05ms ± 5% 3.02ms ± 5% ~ (p=0.143 n=10+10) GDA/rounding-10 659µs ± 4% 658µs ± 4% ~ (p=0.579 n=10+10) GDA/squareroot-10 33.1ms ± 3% 32.7ms ± 3% ~ (p=0.143 n=10+10) GDA/subtract-10 172µs ±11% 170µs ± 8% ~ (p=0.912 n=10+10) GDA/tointegral-10 32.9µs ± 6% 32.2µs ± 5% ~ (p=0.190 n=10+10) GDA/tointegralx-10 33.6µs ± 6% 32.7µs ± 4% ~ (p=0.190 n=10+10) GDA/cuberoot-apd-10 2.08ms ± 3% 2.08ms ± 4% ~ (p=0.165 n=10+10) GDA/add-10 904µs ± 5% 919µs ± 5% +1.65% (p=0.043 n=10+10) GDA/divideint-10 28.7µs ± 4% 29.7µs ± 1% +3.51% (p=0.026 n=9+6) name old alloc/op new alloc/op delta GDA/compare-10 10.1kB ± 0% 7.0kB ± 0% -30.72% (p=0.000 n=10+10) GDA/quantize-10 42.3kB ± 0% 36.5kB ± 0% -13.63% (p=0.000 n=9+10) GDA/divideint-10 6.51kB ± 0% 5.70kB ± 0% -12.41% (p=0.000 n=10+10) GDA/remainder-10 22.4kB ± 0% 20.0kB ± 0% -10.60% (p=0.000 n=10+10) GDA/tointegralx-10 19.2kB ± 0% 17.4kB ± 0% -9.28% (p=0.000 n=10+10) GDA/tointegral-10 19.1kB ± 0% 17.4kB ± 0% -8.83% (p=0.000 n=10+10) GDA/reduce-10 2.22kB ± 0% 2.02kB ± 0% -8.66% (p=0.000 n=10+10) GDA/divide-10 74.3kB ± 0% 70.1kB ± 0% -5.62% (p=0.000 n=10+10) GDA/squareroot-10 8.30MB ± 0% 7.96MB ± 0% -4.08% (p=0.000 n=10+10) Exp/P5/S-4/D-2-10 1.23kB ± 0% 1.18kB ± 0% -3.92% (p=0.000 n=10+10) Exp/P10/S-1/D-2-10 4.50kB ± 0% 4.33kB ± 0% -3.75% (p=0.000 n=10+10) Exp/P5/S-1/D-2-10 2.83kB ± 0% 2.72kB ± 0% -3.75% (p=0.000 n=10+10) Exp/P5/S-4/D2-10 1.29kB ± 0% 1.24kB ± 0% -3.72% (p=0.000 n=10+10) Exp/P10/S-4/D2-10 2.02kB ± 0% 1.94kB ± 0% -3.71% (p=0.000 n=10+10) Exp/P10/S2/D-2-10 9.59kB ± 0% 9.24kB ± 0% -3.66% (p=0.000 n=10+10) Exp/P5/S2/D-2-10 7.53kB ± 0% 7.26kB ± 0% -3.65% (p=0.000 n=10+10) Exp/P5/S2/D2-10 7.76kB ± 0% 7.48kB ± 0% -3.65% (p=0.000 n=10+10) Exp/P10/S2/D2-10 9.75kB ± 0% 9.40kB ± 0% -3.60% (p=0.000 n=10+10) Exp/P10/S-4/D-2-10 2.32kB ± 0% 2.24kB ± 0% -3.58% (p=0.000 n=10+10) GDA/rounding-10 246kB ± 0% 237kB ± 0% -3.56% (p=0.000 n=10+10) GDA/randoms-10 1.16MB ± 0% 1.12MB ± 0% -3.55% (p=0.000 n=10+10) Exp/P10/S2/D-10-10 9.79kB ± 0% 9.44kB ± 0% -3.54% (p=0.000 n=10+10) Exp/P10/S2/D10-10 10.7kB ± 0% 10.3kB ± 0% -3.53% (p=0.000 n=10+10) Exp/P5/S-1/D2-10 2.87kB ± 0% 2.77kB ± 0% -3.52% (p=0.000 n=10+10) Exp/P10/S-1/D-10-10 4.95kB ± 0% 4.78kB ± 0% -3.47% (p=0.000 n=10+10) Exp/P10/S-1/D2-10 4.64kB ± 0% 4.49kB ± 0% -3.40% (p=0.000 n=10+10) GDA/cuberoot-apd-10 261kB ± 0% 252kB ± 0% -3.36% (p=0.000 n=10+10) Exp/P10/S-1/D10-10 5.17kB ± 0% 5.00kB ± 0% -3.33% (p=0.000 n=10+10) GDA/powersqrt-10 77.6MB ± 0% 75.1MB ± 0% -3.29% (p=0.000 n=10+10) Ln/P10/S-2/D2-10 17.1kB ± 0% 16.5kB ± 0% -3.26% (p=0.000 n=9+10) Ln/P10/S-2/D10-10 18.6kB ± 0% 18.0kB ± 0% -3.24% (p=0.000 n=9+8) Exp/P10/S-4/D-10-10 2.56kB ± 0% 2.48kB ± 0% -3.24% (p=0.000 n=10+10) Ln/P10/S2/D2-10 17.0kB ± 0% 16.5kB ± 0% -3.23% (p=0.000 n=7+10) Ln/P10/S10/D10-10 15.8kB ± 0% 15.3kB ± 0% -3.23% (p=0.000 n=10+10) Ln/P10/S10/D2-10 18.0kB ± 0% 17.4kB ± 0% -3.22% (p=0.000 n=10+10) Ln/P10/S-10/D2-10 17.0kB ± 0% 16.4kB ± 0% -3.22% (p=0.000 n=8+10) Ln/P10/S2/D10-10 16.5kB ± 0% 16.0kB ± 0% -3.21% (p=0.000 n=10+10) Ln/P10/S-10/D10-10 17.1kB ± 0% 16.5kB ± 0% -3.18% (p=0.000 n=10+7) Exp/P10/S-4/D10-10 2.61kB ± 0% 2.52kB ± 0% -3.18% (p=0.000 n=10+10) Ln/P2/S10/D2-10 10.3kB ± 0% 9.9kB ± 0% -3.15% (p=0.000 n=10+10) Ln/P10/S-100/D2-10 17.1kB ± 0% 16.6kB ± 0% -3.11% (p=0.000 n=10+10) Ln/P10/S100/D2-10 17.8kB ± 0% 17.2kB ± 0% -3.11% (p=0.000 n=8+10) Ln/P2/S2/D2-10 10.3kB ± 0% 9.9kB ± 0% -3.10% (p=0.000 n=10+9) Ln/P2/S-2/D2-10 9.62kB ± 0% 9.32kB ± 0% -3.09% (p=0.000 n=10+10) Ln/P10/S100/D10-10 15.0kB ± 0% 14.5kB ± 0% -3.09% (p=0.000 n=9+10) Ln/P10/S-100/D10-10 16.3kB ± 0% 15.8kB ± 0% -3.08% (p=0.000 n=10+10) Ln/P2/S-10/D2-10 8.91kB ± 0% 8.64kB ± 0% -3.07% (p=0.000 n=10+9) GDA/subtract-10 100kB ± 0% 97kB ± 0% -2.89% (p=0.000 n=10+10) Ln/P2/S100/D2-10 10.9kB ± 0% 10.6kB ± 0% -2.86% (p=0.000 n=10+10) Ln/P2/S-100/D2-10 9.81kB ± 0% 9.53kB ± 0% -2.83% (p=0.000 n=10+10) GDA/ln-10 10.2MB ± 0% 10.0MB ± 0% -2.09% (p=0.000 n=10+10) GDA/log10-10 12.5MB ± 0% 12.3MB ± 0% -1.87% (p=0.000 n=10+10) GDA/power-10 27.0MB ± 0% 26.5MB ± 0% -1.78% (p=0.000 n=10+10) Exp/P100/S-4/D10-10 31.9kB ± 0% 31.4kB ± 0% -1.46% (p=0.000 n=10+10) Exp/P100/S-4/D-10-10 33.5kB ± 0% 33.0kB ± 0% -1.41% (p=0.000 n=10+10) Exp/P100/S-4/D2-10 31.8kB ± 0% 31.4kB ± 0% -1.40% (p=0.000 n=10+9) Exp/P100/S-4/D-2-10 34.6kB ± 0% 34.1kB ± 0% -1.36% (p=0.000 n=10+9) Exp/P100/S-1/D10-10 65.0kB ± 0% 64.2kB ± 0% -1.27% (p=0.000 n=8+10) Exp/P100/S-1/D2-10 64.9kB ± 0% 64.1kB ± 0% -1.24% (p=0.000 n=10+10) Exp/P100/S-1/D-10-10 70.1kB ± 0% 69.3kB ± 0% -1.19% (p=0.000 n=10+10) Exp/P100/S-1/D-2-10 72.6kB ± 0% 71.8kB ± 0% -1.14% (p=0.000 n=9+9) Exp/P100/S2/D10-10 115kB ± 0% 114kB ± 0% -1.05% (p=0.000 n=10+10) Exp/P100/S2/D2-10 113kB ± 0% 112kB ± 0% -1.01% (p=0.000 n=10+10) Exp/P100/S2/D-2-10 107kB ± 0% 106kB ± 0% -1.00% (p=0.000 n=10+10) Exp/P100/S2/D-10-10 117kB ± 0% 116kB ± 0% -0.96% (p=0.000 n=10+10) Ln/P100/S-10/D10-10 318kB ± 0% 315kB ± 0% -0.83% (p=0.000 n=10+8) Ln/P100/S-100/D100-10 328kB ± 0% 326kB ± 0% -0.79% (p=0.000 n=10+10) Ln/P100/S-10/D2-10 336kB ± 0% 334kB ± 0% -0.79% (p=0.000 n=10+9) Ln/P100/S-2/D2-10 326kB ± 0% 324kB ± 0% -0.79% (p=0.000 n=10+10) Ln/P100/S100/D2-10 353kB ± 0% 350kB ± 0% -0.79% (p=0.000 n=10+9) Ln/P100/S10/D10-10 326kB ± 0% 324kB ± 0% -0.78% (p=0.000 n=10+10) Ln/P100/S100/D100-10 340kB ± 0% 338kB ± 0% -0.77% (p=0.000 n=10+10) Ln/P100/S2/D10-10 351kB ± 0% 348kB ± 0% -0.77% (p=0.000 n=10+10) Ln/P100/S-10/D100-10 323kB ± 0% 320kB ± 0% -0.77% (p=0.000 n=10+10) Ln/P100/S2/D100-10 344kB ± 0% 341kB ± 0% -0.77% (p=0.000 n=9+6) Ln/P100/S-100/D10-10 311kB ± 0% 309kB ± 0% -0.76% (p=0.000 n=10+7) Ln/P100/S10/D100-10 323kB ± 0% 321kB ± 0% -0.76% (p=0.000 n=8+9) Ln/P100/S-100/D2-10 362kB ± 0% 360kB ± 0% -0.75% (p=0.000 n=10+10) Ln/P100/S100/D10-10 345kB ± 0% 342kB ± 0% -0.74% (p=0.000 n=9+8) Exp/P100/S-1/D100-10 94.0kB ± 0% 93.3kB ± 0% -0.74% (p=0.000 n=8+10) Exp/P100/S-4/D100-10 47.9kB ± 0% 47.5kB ± 0% -0.74% (p=0.000 n=6+9) Ln/P100/S2/D2-10 333kB ± 0% 330kB ± 0% -0.74% (p=0.000 n=10+10) Exp/P100/S-1/D-100-10 97.5kB ± 0% 96.8kB ± 0% -0.70% (p=0.000 n=10+10) Exp/P100/S-4/D-100-10 51.5kB ± 0% 51.2kB ± 0% -0.70% (p=0.000 n=10+10) Ln/P100/S10/D2-10 346kB ± 0% 343kB ± 0% -0.69% (p=0.000 n=8+10) Ln/P100/S-2/D10-10 306kB ± 0% 304kB ± 0% -0.68% (p=0.000 n=10+10) Ln/P100/S-2/D100-10 329kB ± 0% 326kB ± 0% -0.67% (p=0.000 n=10+8) Exp/P100/S2/D100-10 150kB ± 0% 149kB ± 0% -0.67% (p=0.000 n=10+10) Exp/P100/S2/D-100-10 148kB ± 0% 147kB ± 0% -0.62% (p=0.000 n=9+10) GDA/exp-10 61.3MB ± 0% 61.1MB ± 0% -0.18% (p=0.000 n=10+10) GDA/base-10 24.4kB ± 0% 24.4kB ± 0% ~ (all equal) GDA/comparetotal-10 7.11kB ± 0% 7.11kB ± 0% ~ (all equal) GDA/add-10 712kB ± 0% 720kB ± 0% +1.18% (p=0.000 n=10+8) GDA/plus-10 46.6kB ± 0% 47.5kB ± 0% +2.04% (p=0.000 n=10+9) GDA/multiply-10 35.5kB ± 0% 39.2kB ± 0% +10.32% (p=0.000 n=10+10) GDA/minus-10 2.43kB ± 0% 2.82kB ± 0% +16.12% (p=0.000 n=10+10) GDA/abs-10 2.33kB ± 0% 2.82kB ± 0% +21.31% (p=0.000 n=10+10) name old allocs/op new allocs/op delta GDA/reduce-10 187 ± 0% 43 ± 0% -77.01% (p=0.000 n=10+10) GDA/compare-10 638 ± 0% 250 ± 0% -60.82% (p=0.000 n=10+10) GDA/minus-10 164 ± 0% 73 ± 0% -55.49% (p=0.000 n=10+10) GDA/abs-10 151 ± 0% 73 ± 0% -51.66% (p=0.000 n=10+10) GDA/quantize-10 1.70k ± 0% 0.98k ± 0% -42.30% (p=0.000 n=10+10) GDA/tointegralx-10 677 ± 0% 454 ± 0% -32.94% (p=0.000 n=10+10) GDA/remainder-10 923 ± 0% 626 ± 0% -32.18% (p=0.000 n=10+10) GDA/tointegral-10 665 ± 0% 454 ± 0% -31.73% (p=0.000 n=10+10) GDA/divideint-10 357 ± 0% 256 ± 0% -28.29% (p=0.000 n=10+10) GDA/subtract-10 2.85k ± 0% 2.14k ± 0% -24.89% (p=0.000 n=10+10) GDA/divide-10 2.51k ± 0% 1.91k ± 0% -24.01% (p=0.000 n=10+10) Exp/P5/S-4/D-2-10 47.0 ± 0% 37.0 ± 0% -21.28% (p=0.000 n=10+10) Exp/P5/S-4/D2-10 50.0 ± 0% 40.0 ± 0% -20.00% (p=0.000 n=10+10) Exp/P10/S-4/D2-10 76.0 ± 0% 62.0 ± 0% -18.42% (p=0.000 n=10+10) GDA/rounding-10 9.20k ± 0% 7.59k ± 0% -17.52% (p=0.000 n=10+10) Exp/P10/S-4/D-2-10 87.0 ± 0% 72.0 ± 0% -17.24% (p=0.000 n=10+10) Exp/P10/S-4/D10-10 95.0 ± 0% 79.0 ± 0% -16.84% (p=0.000 n=10+10) GDA/randoms-10 40.4k ± 0% 33.6k ± 0% -16.82% (p=0.000 n=10+10) Exp/P5/S-1/D-2-10 104 ± 0% 87 ± 0% -16.35% (p=0.000 n=10+10) Exp/P5/S-1/D2-10 105 ± 0% 88 ± 0% -16.19% (p=0.000 n=10+10) Exp/P10/S-4/D-10-10 93.0 ± 0% 78.0 ± 0% -16.13% (p=0.000 n=10+10) GDA/squareroot-10 275k ± 0% 230k ± 0% -16.08% (p=0.000 n=10+10) Exp/P10/S-1/D-10-10 174 ± 0% 147 ± 0% -15.52% (p=0.000 n=10+10) GDA/plus-10 587 ± 0% 496 ± 0% -15.50% (p=0.000 n=10+10) Exp/P10/S-1/D-2-10 163 ± 0% 138 ± 0% -15.34% (p=0.000 n=10+10) Exp/P10/S-1/D2-10 166 ± 0% 141 ± 0% -15.06% (p=0.000 n=10+10) Exp/P10/S-1/D10-10 182 ± 0% 155 ± 0% -14.84% (p=0.000 n=10+10) Exp/P10/S2/D-10-10 335 ± 0% 286 ± 0% -14.63% (p=0.000 n=10+10) Ln/P2/S-2/D2-10 342 ± 0% 292 ± 0% -14.62% (p=0.000 n=10+10) Ln/P2/S2/D2-10 363 ± 0% 310 ± 0% -14.60% (p=0.000 n=10+10) Ln/P2/S10/D2-10 366 ± 0% 313 ± 0% -14.48% (p=0.000 n=10+10) Exp/P10/S2/D-2-10 335 ± 0% 287 ± 0% -14.33% (p=0.000 n=10+10) GDA/cuberoot-apd-10 7.67k ± 0% 6.58k ± 0% -14.27% (p=0.000 n=10+10) Ln/P10/S-2/D2-10 582 ± 0% 499 ± 0% -14.26% (p=0.000 n=10+10) Ln/P10/S2/D2-10 582 ± 0% 499 ± 0% -14.26% (p=0.000 n=10+10) Ln/P10/S10/D10-10 541 ± 0% 464 ± 0% -14.23% (p=0.000 n=10+10) Ln/P10/S2/D10-10 563 ± 0% 483 ± 0% -14.21% (p=0.000 n=10+10) Ln/P2/S-10/D2-10 317 ± 0% 272 ± 0% -14.20% (p=0.000 n=10+10) GDA/powersqrt-10 2.63M ± 0% 2.26M ± 0% -14.17% (p=0.000 n=10+10) Ln/P10/S-2/D10-10 637 ± 0% 547 ± 0% -14.13% (p=0.000 n=10+10) Ln/P10/S-10/D2-10 581 ± 0% 499 ± 0% -14.11% (p=0.000 n=10+10) Exp/P10/S2/D10-10 369 ± 0% 317 ± 0% -14.09% (p=0.000 n=10+10) Exp/P10/S2/D2-10 342 ± 0% 294 ± 0% -14.04% (p=0.000 n=10+10) Ln/P10/S10/D2-10 613 ± 0% 527 ± 0% -14.03% (p=0.000 n=9+10) Ln/P10/S-10/D10-10 581 ± 0% 500 ± 0% -13.99% (p=0.000 n=10+10) Exp/P5/S2/D-2-10 272 ± 0% 234 ± 0% -13.97% (p=0.000 n=10+10) Ln/P10/S-100/D2-10 575 ± 0% 495 ± 0% -13.91% (p=0.000 n=10+10) Ln/P10/S100/D2-10 597 ± 0% 514 ± 0% -13.90% (p=0.000 n=10+10) Ln/P10/S100/D10-10 504 ± 0% 434 ± 0% -13.89% (p=0.000 n=10+10) Exp/P5/S2/D2-10 281 ± 0% 242 ± 0% -13.88% (p=0.000 n=10+10) Ln/P10/S-100/D10-10 549 ± 0% 473 ± 0% -13.84% (p=0.000 n=10+10) Ln/P2/S-100/D2-10 336 ± 0% 290 ± 0% -13.69% (p=0.000 n=10+10) Ln/P2/S100/D2-10 376 ± 0% 325 ± 0% -13.56% (p=0.000 n=10+10) GDA/ln-10 269k ± 0% 239k ± 0% -10.88% (p=0.000 n=10+10) GDA/log10-10 326k ± 0% 292k ± 0% -10.44% (p=0.000 n=10+10) GDA/power-10 699k ± 0% 627k ± 0% -10.30% (p=0.000 n=10+10) GDA/add-10 13.2k ± 0% 11.9k ± 0% -9.68% (p=0.000 n=10+10) Exp/P100/S-4/D10-10 701 ± 0% 638 ± 0% -8.99% (p=0.000 n=10+10) Exp/P100/S-4/D-10-10 715 ± 0% 651 ± 0% -8.95% (p=0.000 n=10+10) Exp/P100/S-4/D2-10 686 ± 0% 626 ± 0% -8.75% (p=0.000 n=10+10) Exp/P100/S-4/D-2-10 723 ± 0% 661 ± 0% -8.58% (p=0.000 n=10+10) Exp/P100/S-1/D10-10 1.36k ± 0% 1.26k ± 0% -7.99% (p=0.000 n=10+10) Exp/P100/S-1/D2-10 1.35k ± 0% 1.24k ± 0% -7.72% (p=0.000 n=9+10) Exp/P100/S-1/D-10-10 1.40k ± 0% 1.29k ± 0% -7.70% (p=0.000 n=10+10) Exp/P100/S-1/D-2-10 1.43k ± 0% 1.32k ± 0% -7.48% (p=0.000 n=9+10) Exp/P100/S2/D-2-10 2.03k ± 0% 1.89k ± 0% -6.82% (p=0.000 n=9+10) Exp/P100/S2/D10-10 2.28k ± 0% 2.13k ± 0% -6.81% (p=0.000 n=10+9) Exp/P100/S2/D-10-10 2.19k ± 0% 2.05k ± 0% -6.66% (p=0.001 n=8+9) Exp/P100/S2/D2-10 2.21k ± 0% 2.06k ± 0% -6.61% (p=0.000 n=9+10) Ln/P100/S-10/D2-10 6.00k ± 0% 5.65k ± 0% -5.87% (p=0.000 n=10+9) Ln/P100/S100/D2-10 6.25k ± 0% 5.89k ± 0% -5.80% (p=0.000 n=10+9) Ln/P100/S-2/D2-10 5.79k ± 0% 5.46k ± 0% -5.78% (p=0.000 n=10+10) Ln/P100/S100/D100-10 6.04k ± 0% 5.69k ± 0% -5.76% (p=0.000 n=10+10) Ln/P100/S2/D10-10 6.21k ± 0% 5.86k ± 0% -5.76% (p=0.000 n=10+10) Ln/P100/S2/D100-10 6.09k ± 0% 5.74k ± 0% -5.73% (p=0.000 n=10+8) Exp/P100/S-4/D100-10 855 ± 0% 806 ± 0% -5.73% (p=0.000 n=10+10) Ln/P100/S-10/D10-10 5.62k ± 0% 5.30k ± 0% -5.72% (p=0.000 n=10+8) Ln/P100/S-100/D2-10 6.34k ± 0% 5.98k ± 0% -5.68% (p=0.000 n=10+10) Ln/P100/S10/D100-10 5.73k ± 0% 5.41k ± 0% -5.67% (p=0.000 n=10+10) Ln/P100/S100/D10-10 6.08k ± 0% 5.74k ± 0% -5.66% (p=0.000 n=9+8) Ln/P100/S10/D2-10 6.13k ± 0% 5.78k ± 0% -5.66% (p=0.000 n=8+10) Ln/P100/S-10/D100-10 5.69k ± 0% 5.37k ± 0% -5.64% (p=0.000 n=10+10) Ln/P100/S10/D10-10 5.74k ± 0% 5.41k ± 0% -5.64% (p=0.000 n=10+10) Ln/P100/S-100/D10-10 5.55k ± 0% 5.23k ± 0% -5.63% (p=0.000 n=10+7) Ln/P100/S2/D2-10 5.86k ± 0% 5.53k ± 0% -5.62% (p=0.000 n=10+10) Exp/P100/S-4/D-100-10 893 ± 0% 843 ± 0% -5.60% (p=0.000 n=10+10) Ln/P100/S-100/D100-10 5.73k ± 0% 5.41k ± 0% -5.54% (p=0.000 n=10+10) Exp/P100/S-1/D100-10 1.68k ± 0% 1.58k ± 0% -5.52% (p=0.000 n=8+10) Ln/P100/S-2/D10-10 5.40k ± 0% 5.10k ± 0% -5.49% (p=0.000 n=10+10) Ln/P100/S-2/D100-10 5.77k ± 0% 5.45k ± 0% -5.49% (p=0.000 n=10+10) Exp/P100/S-1/D-100-10 1.68k ± 0% 1.59k ± 0% -5.41% (p=0.000 n=10+10) Exp/P100/S2/D100-10 2.61k ± 0% 2.48k ± 0% -5.03% (p=0.000 n=8+10) Exp/P100/S2/D-100-10 2.47k ± 0% 2.35k ± 0% -4.90% (p=0.000 n=10+10) GDA/multiply-10 931 ± 0% 889 ± 0% -4.51% (p=0.000 n=10+10) GDA/exp-10 911k ± 0% 895k ± 0% -1.75% (p=0.000 n=10+10) GDA/base-10 2.11k ± 0% 2.11k ± 0% ~ (all equal) GDA/comparetotal-10 254 ± 0% 254 ± 0% ~ (all equal) ```
I think we can simplify this even further by creating a wrapper around const inlineWords = 1 // or maybe 2
type BigInt struct {
inner big.Int
inline [inlineWords]big.Word
}
func (b *BigInt) lazyInit() {
if b.inner.Bits() == nil {
b.inline = [inlineWords]big.Word{}
b.inner.SetBits(b.inline[:0])
}
}
func (b *BigInt) Add(x, y *BigInt) *BigInt {
b.lazyInit()
b.inner.Add(&x.inner, &y.inner)
return b
}
...
type Decimal struct {
Form Form
Negative bool
Exponent int32
Coeff BigInt
} We could then re-implement the This |
Replaces cockroachdb/cockroach#74369.
This commit introduces a performance optimization that embeds small coefficient values directly in their
Decimal
struct, instead of storing these values in a separate heap allocation.Each
Decimal
maintains (throughbig.Int
) an internal reference to a variable-length coefficient, which is represented by a[]big.Word
. ThecoeffInline
field and thelazyInit
method combine to allow us to inline this coefficient within theDecimal
struct when its value is sufficiently small. InlazyInit
, we point theCoeff
field's backing slice at thecoeffInline
array.big.Int
will avoid re-allocating this array until it is provided with a value that exceeds the initial capacity. We set this initial capacity to accommodate any coefficient that would fit in a 64-bit integer (i.e. values up to 2^64).This is an alternative to an optimization that many other arbitrary precision decimal libraries have where small coefficient values are stored as numeric fields in their data type's struct. Only when this coefficient value gets large enough do these libraries fall back to a variable-length coefficient with internal indirection. We can see the optimization in practice in the
ericlagergren/decimal
library, where each struct contains acompact uint64
and anunscaled big.Int
. Prior concern from the authors ofcockroachdb/apd
regarding this form of compact representation optimization was that it traded performance for complexity. The optimization fractured control flow, leaking out across the library and leading to more complex, error-prone code.The approach taken in this commit does not have the same issue. All arithmetic on the decimal's coefficient is still deferred to
bit.Int
. In fact, the entire optimization is best-effort, and bugs that lead to missed calls tolazyInit
are merely missed opportunities to avoid a heap allocation, and nothing more serious. The internal reference within Decimal is vulnerable to memory aliasing bugs when aDecimal
is copied by value (and not through theDecimal.Set
method), but this is not a new concern.Impact on benchmarks:
cc. @mjibson