Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance compare with native dragonbox::to_chars #3675

Open
zhiqiang-hhhh opened this issue Oct 9, 2023 · 7 comments
Open

Performance compare with native dragonbox::to_chars #3675

zhiqiang-hhhh opened this issue Oct 9, 2023 · 7 comments

Comments

@zhiqiang-hhhh
Copy link

Hello, I am using dragonbox::to_chars as my float pointer number to string method, and trying to replace dragonbox with lib fmt 10.x since it has already integrated with dragonbox and lib fmt has more output format control.

But according to my simple benchmark, dragonbox is almost x1.7 faster than lib fmt when doing float-point to string.

#include <gtest/gtest.h>
#include <fmt/format.h>
#include <dragonbox/dragonbox_to_chars.h>
#include <random>

class PerformanceTest : public ::testing::Test {
protected:
    void SetUp() override {
        std::random_device rd;
        std::mt19937 gen(rd());
        std::uniform_real_distribution<double> dis_double(-100000.0, +100000.0);
        std::uniform_real_distribution<float> dis_float(-100000.0, +100000.0);
        
        for (int i = 0; i < 100000000; ++i) {
            values_double.push_back(dis_double(gen));
            values_float.push_back(dis_float(gen));
        }
    }

    void TearDown() override {
    }

    std::vector<double> values_double;
    std::vector<float> values_float;
};

TEST_F(PerformanceTest, FmtPerformanceDouble) {
    char buffer[20];

    for (const auto& value : values_double) {
        auto res = fmt::format_to(buffer, "{}", value);
        *res = '\0';
    }
}

TEST_F(PerformanceTest, DragonboxPerformanceDouble) {
    char buffer[20];

    for (const auto& value : values_double) {
        jkj::dragonbox::to_chars(value, buffer);
    }
}


TEST_F(PerformanceTest, FmtPerformanceFloat) {
    char buffer[20];

    for (const auto& value : values_float) {
        auto res = fmt::format_to(buffer, "{}", value);
        *res = '\0';
    }
}

TEST_F(PerformanceTest, DragonboxPerformanceFloat) {
    char buffer[20];

    for (const auto& value : values_double) {
        jkj::dragonbox::to_chars(value, buffer);
    }
}

int main(int argc, char** argv) {
    ::testing::InitGoogleTest(&argc, argv);
    return RUN_ALL_TESTS();
}

build with release, result:

[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from PerformanceTest
[ RUN      ] PerformanceTest.FmtPerformanceDouble
[       OK ] PerformanceTest.FmtPerformanceDouble (11391 ms)
[ RUN      ] PerformanceTest.DragonboxPerformanceDouble
[       OK ] PerformanceTest.DragonboxPerformanceDouble (6636 ms)
[ RUN      ] PerformanceTest.FmtPerformanceFloat
[       OK ] PerformanceTest.FmtPerformanceFloat (10185 ms)
[ RUN      ] PerformanceTest.DragonboxPerformanceFloat
[       OK ] PerformanceTest.DragonboxPerformanceFloat (6649 ms)
[----------] 4 tests from PerformanceTest (34864 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (34864 ms total)
[  PASSED  ] 4 tests.

According to my basic knowledge, time consumption of float point to decimal should be almost same after lib fmt integrated with dragon box. so the determining factor of performance difference here should be the output formatting control?

@vitaut
Copy link
Contributor

vitaut commented Oct 9, 2023

It is expected that the (default) runtime formatting will be slightly slower than calling Dragonbox directly because the formatting function has to do some extra work. This overhead can be reduced by using format string compilation: https://fmt.dev/latest/api.html#compile-api. It's hard to say anything more specific and numbers don't look particularly meaningful because your test is quite broken: {fmt} cases do additional nul termination, there is stack corruption and you are using gtest instead of a proper benchmark. I recommend looking at an existing benchmark, e.g. https://github.com/miloyip/dtoa-benchmark. Also keep in mind that {fmt} uses compact Dragonbox tables by default so if you want maximum perf at the cost of binary size you could switch to larger tables.

@vitaut vitaut closed this as completed Oct 9, 2023
@vitaut vitaut added the question label Oct 9, 2023
@zhiqiang-hhhh
Copy link
Author

I tested agian with opt method and the benchmark tools you have mentioned, but result seems still about 2.x slower than dragonbox.

Verifying doubleconv... OK. Length Avg = 22.426, Max = 25
Verifying dragonbox... OK. Length Avg = 22.027, Max = 24
Verifying dragonbox_comp... OK. Length Avg = 22.027, Max = 24
Verifying fmt... OK. Length Avg = 22.445, Max = 24
Verifying fmt_full_cache_test... OK. Length Avg = 22.445, Max = 24
Verifying ostringstream... OK. Length Avg = 22.940, Max = 24
Verifying ostrstream... OK. Length Avg = 22.940, Max = 24
Verifying sprintf... OK. Length Avg = 22.940, Max = 24
Benchmarking randomdigit doubleconv... Done
Benchmarking randomdigit dragonbox... Done
Benchmarking randomdigit dragonbox_comp... Done
Benchmarking randomdigit fmt... Done
Benchmarking randomdigit fmt_full_cache_test... Done
Benchmarking randomdigit null... Done
Benchmarking randomdigit ostringstream... Done
Benchmarking randomdigit ostrstream... Done
Benchmarking randomdigit sprintf... Done
Function      |  Min ns |  RMS ns  |  Max ns |   Sum ns  | Speedup |
:-------------|--------:|---------:|--------:|----------:|--------:|
null          |     1.6 |    1.600 |     1.6 |      27.2 | ×597.4  |
dragonbox     |    28.4 |   30.379 |    33.6 |     515.9 | ×31.5   |
dragonbox_comp|    34.4 |   36.937 |    41.9 |     627.3 | ×25.9   |
fmt_full_cache_test|    53.4 |   59.377 |    68.7 |    1007.5 | ×16.1   |
fmt           |    53.5 |   59.513 |    68.1 |    1010.0 | ×16.1   |
doubleconv    |    82.9 |  129.439 |   168.7 |    2170.8 | ×7.5    |
sprintf       |   868.0 |  957.211 |  1028.4 |   16249.7 | ×1.0    |
ostrstream    |  1197.1 | 1285.831 |  1357.9 |   21841.8 | ×0.7    |
ostringstream |  1279.8 | 1377.940 |  1462.4 |   23401.2 | ×0.7    |

append null termination is necessary for correct
fmt_full_cache_test
fmttest
dragonbox

@zhiqiang-hhhh
Copy link
Author

@vitaut

@vitaut
Copy link
Contributor

vitaut commented Oct 10, 2023

Will need to look in more details but one surprising thing is that full and compact cache results are identical.

@vitaut
Copy link
Contributor

vitaut commented Oct 14, 2023

So I looked in more details and one obvious problem with the new benchmark is ODR violation: you are trying to use {fmt} compiled with different configurations in different TUs. This is a UB. If you correctly enable full Dragonbox cache with

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=-DFMT_USE_FULL_CACHE_DRAGONBOX=1 .

you'll get a noticeable speedup from

fmt           |    40.1 |   42.825 |    46.9 |     727.4 | ×17.3   |

to

fmt           |    34.0 |   36.018 |    39.3 |     611.8 | ×20.5   |

on my system.

It is still not as fast as calling Dragonbox directly which is worth investigating further.

@vitaut
Copy link
Contributor

vitaut commented Oct 14, 2023

Looking at the CPU profile ~40% of time is spent postprocessing and writing the output in do_write_float:

image

Some of that is inevitable since we have to deal with all the formatting options but it could probably be improved for the common case.

@vitaut vitaut reopened this Oct 15, 2023
@vitaut vitaut removed the question label Oct 15, 2023
@jk-jeon
Copy link
Contributor

jk-jeon commented Oct 26, 2023

@zhiqiang-hhhh If you really want to test {fmt} with multiple different configurations in a single executable, you can do something like this to avoid the ODR issue: https://github.com/jk-jeon/dtoa-benchmark/blob/master/src/fmt_full_cachetest.cpp

This still feels like a terrible hack, but since it is just for testing I think it should be alright.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants