Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization for hexadecimal formatting #1944

Closed
NN--- opened this issue Oct 20, 2020 · 20 comments
Closed

Optimization for hexadecimal formatting #1944

NN--- opened this issue Oct 20, 2020 · 20 comments

Comments

@NN---
Copy link

NN--- commented Oct 20, 2020

Manual implementation is about 10 times faster than fmtlib.
MSVC x64

Manual: 2414
Fmt: 16481
#include <string>
#include <iostream>
#include <chrono>
#include <fmt/format.h>

constexpr char hexchar[] = "0123456789ABCDEF";

template <typename T>
std::string to_hex(T value)
{
    constexpr size_t sz = sizeof(value) * 2;
    std::string buffer(sz, L'\0');
    for (int i = 0; i < 2 * sizeof(value); i++)
        buffer[i] = hexchar[(value >> 4 * (sz - 1 - i)) & 0xf];
    return buffer;
}

template <typename T>
std::string fmt_to_hex(T value)
{
    return fmt::format(FMT_STRING("{0:X}"), value);
}

int main()
{
    int sum = 0;

    auto before =  std::chrono::system_clock::now();

    for (int i = 0; i < 100000000; i++)
    {
        auto a = to_hex(i);
        sum += a.length();
    }

    auto manual =  std::chrono::system_clock::now() - before;

    before=  std::chrono::system_clock::now();

    for (int i = 0; i < 100000000; i++)
    {
        auto a = fmt_to_hex(i);
        sum += a.length();
    }

    auto fmt=  std::chrono::system_clock::now() -before;

    std::cout << "Manual: " << duration_cast<std::chrono::milliseconds>(manual).count() << std::endl;
    std::cout << "Fmt: " << duration_cast<std::chrono::milliseconds>(fmt).count() << std::endl;

    return sum;
}
@vitaut
Copy link
Contributor

vitaut commented Oct 20, 2020

Your implementation of to_hex is not equivalent because it always produces unnecessary leading zeros. That said fmt's hex formatter is indeed quite suboptimal but it's easy to optimize. It's just a matter of adding a fast path in

fmt/include/fmt/compile.h

Lines 537 to 540 in 17fba75

using type = get_type<ID, Args>;
constexpr auto result = parse_specs<type>(str, POS + 2, ID);
return parse_tail<Args, result.end, result.next_arg_id>(
spec_field<char_type, type, ID>{result.fmt}, format_str);
and using format string compilation. PRs are welcome.

@BuzaL
Copy link

BuzaL commented Nov 17, 2020

Fortunate, easy to compare, 'cause it is 10 times slower than sprintf:
https://godbolt.org/z/cPqMz3

@NobodyXu
Copy link
Contributor

NobodyXu commented Feb 25, 2021

@BuzaL Interestingly, if I use FMT_COMPILE and use -std=c++17 -O3 -flto, then fmt::format is faster than snprintf by roughly 3 time.

https://godbolt.org/z/9Ejjoc

@BuzaL
Copy link

BuzaL commented Feb 25, 2021

As I see, its an issue with 7.x.x. The 6.2.1 still faster then reserve + append, but 7.0 and trunk are slow with ANY formatting.
That means something went wrong the on-the-fly format compiler.

@vitaut
Copy link
Contributor

vitaut commented Feb 25, 2021

Looks like the regression is specific to Compiler Explorer, most likely the installed version is compiled with optimizations disabled.

@vitaut
Copy link
Contributor

vitaut commented Feb 25, 2021

Confirmed in #2153 (comment). Don't use Compiler Explorer for benchmarking =). Also don't use system clock.

@vitaut
Copy link
Contributor

vitaut commented Mar 11, 2021

I've landed a few formatting optimizations that improve hex formatting perf among other things.

With these changes, format string compilation and adding size computation to the manual method for consistency

#include <string>
#include <iostream>
#include <chrono>
#include <fmt/compile.h>

constexpr char hexchar[] = "0123456789ABCDEF";

template <typename T>
std::string to_hex(T value)
{
    constexpr size_t sz = sizeof(value) * 2;
    std::string buffer(sz, L'\0');
    int size = 0;
    for (int i = 0; i < 2 * sizeof(value); i++) {
        int digit = (value >> 4 * (sz - 1 - i)) & 0xf;
        buffer[i] = hexchar[digit];
        if (digit != 0) size = digit;
    }
    buffer.resize(size + 1);
    return buffer;
}

template <typename T>
std::string fmt_to_hex(T value)
{
    return fmt::format(FMT_COMPILE("{:X}"), value);
}

int main()
{
    int sum = 0;

    auto before = std::chrono::steady_clock::now();

    for (int i = 0; i < 100000000; i++)
    {
        auto a = to_hex(i);
        sum += a.length();
    }

    auto manual = std::chrono::steady_clock::now() - before;

    before = std::chrono::steady_clock::now();

    for (int i = 0; i < 100000000; i++)
    {
        auto a = fmt_to_hex(i);
        sum += a.length();
    }

    auto fmt = std::chrono::steady_clock::now() - before;

    std::cout << "Manual: " << std::chrono::duration_cast<std::chrono::milliseconds>(manual).count() << std::endl;
    std::cout << "Fmt: " << std::chrono::duration_cast<std::chrono::milliseconds>(fmt).count() << std::endl;

    return sum;
}

we can get comparable performance:

$ g++ test.cc -DNDEBUG -O3 -I include src/format.cc -std=c++17 -flto
$ ./a.out 
Manual: 1099
Fmt: 867

Note that the results are not very reliable, I recommend using a benchmarking framework instead.

@alexezeder
Copy link
Contributor

alexezeder commented Mar 12, 2021

I prepared a bit different still fitting benchmark of hex formatting.
It doesn't check fmt::format() or to_hex from the top message that returns std::string, as well as std::stringstream and std::to_string, because all these approaches introduce additional allocations that may influence on actual formatting. But it does check formatting to the existing buffer, so there are several versions of fmt::format_to, std::to_chars, and custom code (it copied from {fmt} but doesn't introduce any complexity that format_to does).
There are tests named FMTCustomFormatter* which have a slightly modified version of fmt/compile.h that provides a custom int_formatter class for formatting integer types. It's the initial test of my idea of formatter being specific for passed specs, it was previously described here - #2078.
Let me paste some of the results here (full results available in README.md, also you can check it on your PC):

Benchmark Time
CustomCode 2.93 ns
FMTCompileMaster 12.5 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 3.66 ns
ToChars 2.57 ns

@vitaut
Copy link
Contributor

vitaut commented Mar 12, 2021

Thanks for the benchmark. I wonder why the results for FMTCompileMaster and FMTCompileMasterNOINLINERemovedFromFill are slightly different in https://github.com/alexezeder/fmt_bnchmrk if fill is not called? Maybe a measurement error or some difference in codegen?

@alexezeder
Copy link
Contributor

alexezeder commented Mar 13, 2021

I wonder why the results for FMTCompileMaster and FMTCompileMasterNOINLINERemovedFromFill are slightly different in https://github.com/alexezeder/fmt_bnchmrk_legacy if fill is not called? Maybe a measurement error or some difference in codegen?

Oops... I made a mistake at the last state of preparing this repo and FMTCompileMasterNOINLINERemovedFromFill actually has the same code as FMTCompileMaster. I updated the benchmark repo, but they are still really close, almost equal.

There is a difference in codegen of course, but looks like as long as fill() function is not used this difference does not really affect performance. So it's probably a measurement error, something like ±0.3 ns which is not bad on my opinion.

EDIT: repo moved from https://github.com/alexezeder/fmt_bnchmrk to https://github.com/alexezeder/fmt_bnchmrk_legacy

@vitaut
Copy link
Contributor

vitaut commented Mar 14, 2021

Regarding FMTCompileSeparateFormatterNOINLINERemovedFromFill , it is a nice speedup but I wonder if we can avoid replicating the format specifier parsing logic. In principle the compiler should be able to optimize basic_format_specs handling since the specs are constexpr. The LTO results seem to confirm that but the question is how to make it work without LTO.

@alexezeder
Copy link
Contributor

In principle the compiler should be able to optimize basic_format_specs handling since the specs are constexpr.

My bet here is that the code is just too complex for the compiler to optimize. We can play with attributes of function to make them always inlined, but that would not probably solve problems with losing knowledge of constness of value in some cases. I mean, I'm not sure that even the formatter object that is created at compile-time can be used inside format method of spec_field at runtime in the most efficient way, i.e. inlined somehow.

The LTO results seem to confirm that but the question is how to make it work without LTO.

I'm not sure how LTO should help in this case, at least for the results in the benchmark. It uses header-only versions of {fmt}, so looks like it could help only with reducing the size of the executable by deleting some of non-inlined functions from different versions. In fact, GCC with -flto did it for my precious benchmark, so I had to build it several times to get the results of different versions. But here are results for LTO and non-LTO builds:

Master SeparatedFormatter
0 11.5 ns (w/o LTO), 10.7 ns (with LTO) 2.44 ns (w/o LTO), 2.20 ns (with LTO)
42 12.0 ns (w/o LTO), 11.0 ns (with LTO) 3.06 ns (w/o LTO), 3.25 ns (with LTO)
273123 13.2 ns (w/o LTO), 12.7 ns (with LTO) 6.11 ns (w/o LTO), 7.10 ns (with LTO)
9223372036854775807 14.3 ns (w/o LTO), 13.5 ns (with LTO) 7.89 ns (w/o LTO), 10.1 ns (with LTO)

@alexezeder
Copy link
Contributor

alexezeder commented Mar 19, 2021

And another set of benchmarks. They are also available in my benchmark repository.

Clang

Clang was added, it shows even better results for SeparatedFormatter tests:

Benchmark, value == 273123 GCC Clang
CustomCode 6.15 ns 5.03 ns
FMTCompileMaster 13.3 ns 14.7 ns
FMTCompileMasterNOINLINERemovedFromFill 13.2 ns 14.8 ns
FMTCompileSeparateFormatter 14.2 ns 13.4 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 7.49 ns 5.19 ns
ToChars 3.76 ns 3.44 ns

For Clang there is no overhead of using code from FMTCompileSeparateFormatterNOINLINERemovedFromFill compare to custom code.

Force inlining

To test what I wrote earlier:

We can play with attributes of function to make them always inlined, but that would not probably solve problems with losing knowledge of constness of value in some cases.

I modified all functions of {fmt} (which are invoked by the test code) to add to them always_inline attribute. This forces compiler to make inlining even in case if it decides that the function is too complex. Of course, fill function left untouched for FMTCompileMaster and FMTCompileSeparateFormatter versions of {fmt}. Here are the results (FI stands for force inline version):

Benchmark, value == 42 GCC* GCC* FI Clang Clang FI
CustomCode 2.79 ns 2.79 ns 2.22 ns 2.24 ns
FMTCompileMaster 12.0 ns 11.8 ns 13.3 ns 12.8 ns
FMTCompileMasterNOINLINERemovedFromFill 12.0 ns 11.8 ns 13.4 ns 12.7 ns
FMTCompileSeparateFormatter 11.5 ns 4.83 ns 10.6 ns 10.6 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 3.53 ns 3.31 ns 2.22 ns 2.26 ns
ToChars 2.49 ns 2.01 ns 2.21 ns 2.23 ns
Benchmark, value == 9223372036854775807 GCC* GCC* FI Clang Clang FI
CustomCode 6.82 ns 6.85 ns 6.47 ns 6.58 ns
FMTCompileMaster 14.5 ns 14.5 ns 16.0 ns 15.5 ns
FMTCompileMasterNOINLINERemovedFromFill 14.4 ns 14.5 ns 16.0 ns 15.5 ns
FMTCompileSeparateFormatter 16.6 ns 11.6 ns 15.4 ns 15.6 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 10.5 ns 9.80 ns 6.49 ns 6.55 ns
ToChars 5.45 ns 4.54 ns 4.92 ns 5.00 ns
  • note: GCC and GCC FI results cannot be directly compared because there is a time gap between these runs

Looks like this trick can only slice some time for the FMTCompileSeparateFormatter test on GCC, but for the current approach in {fmt} it does nothing.

std::string tests

Since this issue started with std::string based test, I decided to make a proper benchmark for this case. I took code from the first message, and since this code fills the result with leading zeros (up to 8 digits), the format string for {fmt}-based test is {:0>8x}. Also, I replaced ToChars test with StringStream test, just to be sure that stringstream still sucks.

Benchmark, value == 42 GCC Clang
CustomCode 3.51 ns 3.24 ns
FMTCompileMaster 36.4 ns 28.1 ns
FMTCompileMasterNOINLINERemovedFromFill 36.5 ns 27.9 ns
FMTCompileSeparateFormatter 19.8 ns 18.5 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 14.9 ns 18.8 ns
StringStream 311 ns 322 ns

Considering these results looks like {fmt} has a big overhead, but that's quite false. Code from the first message is actually closer to the usage of fmt::format_to because the buffer size is available at compile-time. Just by changing this code like that:

std::string test_custom_code(unsigned value) {
  constexpr unsigned width = 2 * sizeof(value);
  std::string buffer;
  for (int i = 0; i < width; i++)
    buffer.append(1, hexchar[(value >> 4 * (width - 1 - i)) & 0xf]);
  return buffer;
}

the results from GCC (Clang still optimizes this code) become comparable:

Benchmark, value == 273123 Time
CustomCode 15.6 ns
FMTCompileMaster 35.4 ns
FMTCompileMasterNOINLINERemovedFromFill 35.4 ns
FMTCompileSeparateFormatter 19.4 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 14.7 ns
StringStream 321 ns

@vitaut
Copy link
Contributor

vitaut commented Mar 21, 2021

AFAICS the part that prevents optimization for some reason is dynamic width/precision handling:

fmt/include/fmt/format.h

Lines 3540 to 3544 in 0f85a46

auto specs = specs_;
detail::handle_dynamic_spec<detail::width_checker>(specs.width,
specs.width_ref, ctx);
detail::handle_dynamic_spec<detail::precision_checker>(
specs.precision, specs.precision_ref, ctx);

Temporarily commenting it out makes gcc do the expected optimization:

-----------------------------------------------------------------------------------------------------------------
Benchmark                                                                       Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------------
CustomCode/0                                                                 1.40 ns         1.40 ns    499241121
CustomCode/42                                                                1.99 ns         1.99 ns    350273792
CustomCode/273123                                                            4.93 ns         4.93 ns    143643778
CustomCode/9223372036854775807                                               5.48 ns         5.48 ns    125456338
FMTCompileMaster/0                                                           1.20 ns         1.20 ns    582823513
FMTCompileMaster/42                                                          1.59 ns         1.59 ns    437414846
FMTCompileMaster/273123                                                      2.89 ns         2.89 ns    238470638
FMTCompileMaster/9223372036854775807                                         3.89 ns         3.89 ns    180152587
FMTCompileMasterNOINLINERemovedFromFill/0                                    9.17 ns         9.17 ns     75675294
FMTCompileMasterNOINLINERemovedFromFill/42                                   9.57 ns         9.57 ns     72413567
FMTCompileMasterNOINLINERemovedFromFill/273123                               10.6 ns         10.6 ns     65808614
FMTCompileMasterNOINLINERemovedFromFill/9223372036854775807                  11.6 ns         11.6 ns     60052123
FMTCompileSeparateFormatter/0                                                7.71 ns         7.71 ns     89553288
FMTCompileSeparateFormatter/42                                               8.52 ns         8.52 ns     81365037
FMTCompileSeparateFormatter/273123                                           10.7 ns         10.7 ns     64955329
FMTCompileSeparateFormatter/9223372036854775807                              12.0 ns         12.0 ns     58023609
FMTCompileSeparateFormatterNOINLINERemovedFromFill/0                         1.91 ns         1.91 ns    368577999
FMTCompileSeparateFormatterNOINLINERemovedFromFill/42                        2.63 ns         2.63 ns    265312761
FMTCompileSeparateFormatterNOINLINERemovedFromFill/273123                    5.57 ns         5.57 ns    126824185
FMTCompileSeparateFormatterNOINLINERemovedFromFill/9223372036854775807       8.26 ns         8.26 ns     84841458
ToChars/0                                                                    1.00 ns         1.00 ns    667463669
ToChars/42                                                                   1.99 ns         1.99 ns    350416544
ToChars/273123                                                               2.99 ns         2.99 ns    233728063
ToChars/9223372036854775807                                                  4.32 ns         4.32 ns    161760882

Need to figure out what specifically gcc doesn't like about the above code snippet.

@alexezeder
Copy link
Contributor

alexezeder commented Mar 21, 2021

Temporarily commenting it out makes gcc do the expected optimization:

Wow, these results are great! I didn't expect that compilers can optimize code like that.

This change makes FMTCompileMaster timings even better than timings for std::to_chars, despite {fmt} is way more generic. But that's true for GCC, not for Clang. On Clang, FMTCompileMaster timings become better with this change, but it's still slower than FMTCompileSeparateFormatterNOINLINERemovedFromFill. Here are combined results (DSH - dynamic specs handling):

Benchmark, value == 0 GCC w/o DSH GCC with DSH Clang w/o DSH Clang with DSH
CustomCode 1.76 ns 1.75 ns 1.59 ns 1.55 ns
FMTCompileMaster 1.50 ns 11.5 ns 6.82 ns 12.9 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 2.54 ns 2.52 ns 1.59 ns 1.56 ns
ToChars 1.24 ns 1.25 ns 1.25 ns 1.89 ns
Benchmark, value == 42 GCC w/o DSH GCC with DSH Clang w/o DSH Clang with DSH
CustomCode 2.55 ns 2.79 ns 2.26 ns 2.22 ns
FMTCompileMaster 2.00 ns 12.0 ns 7.49 ns 13.3 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 3.80 ns 3.53 ns 2.26 ns 2.22 ns
ToChars 2.90 ns 2.49 ns 2.25 ns 2.21 ns
Benchmark, value == 273123 GCC w/o DSH GCC with DSH Clang w/o DSH Clang with DSH
CustomCode 5.81 ns 6.15 ns 5.19 ns 5.03 ns
FMTCompileMaster 3.34 ns 13.3 ns 8.95 ns 14.7 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 8.26 ns 7.49 ns 5.51 ns 5.19 ns
ToChars 4.50 ns 3.76 ns 3.50 ns 3.44 ns
Benchmark, value == 9223372036854775807 GCC w/o DSH GCC with DSH Clang w/o DSH Clang with DSH
CustomCode 6.95 ns 6.82 ns 6.72 ns 6.47 ns
FMTCompileMaster 4.86 ns 14.5 ns 10.6 ns 16.0 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 11.3 ns 10.5 ns 6.67 ns 6.49 ns
ToChars 5.63 ns 5.45 ns 5.02 ns 4.92 ns

For some reason FMTCompileSeparateFormatterNOINLINERemovedFromFill is slower than FMTCompileMaster on GCC. I mean, they share the same {fmt} code, and int_formatter is way simpler than the current one 😐. But okay, I would investigate what's going on there.

In addition, I did a quick check for a bit more complex format string 0>8x. Here are also combined results:

Benchmark, value == 0 GCC w/o DSH GCC with DSH Clang w/o DSH Clang with DSH
CustomCode 4.07 ns 4.48 ns 5.27 ns 5.23 ns
FMTCompileMaster 21.5 ns 27.8 ns 14.2 ns 19.2 ns
FMTCompileMasterNOINLINERemovedFromFill 4.12 ns 27.7 ns 14.2 ns 19.2 ns
FMTCompileSeparateFormatter 11.6 ns 11.0 ns 9.72 ns 9.97 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 5.60 ns 5.97 ns 4.95 ns 5.48 ns
Benchmark, value == 42 GCC w/o DSH GCC with DSH Clang w/o DSH Clang with DSH
CustomCode 4.32 ns 5.61 ns 6.15 ns 6.08 ns
FMTCompileMaster 21.9 ns 28.8 ns 14.8 ns 19.6 ns
FMTCompileMasterNOINLINERemovedFromFill 4.84 ns 28.6 ns 15.0 ns 19.6 ns
FMTCompileSeparateFormatter 11.8 ns 11.8 ns 10.7 ns 10.7 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 5.96 ns 6.21 ns 6.29 ns 5.99 ns
Benchmark, value == 273123 GCC w/o DSH GCC with DSH Clang w/o DSH Clang with DSH
CustomCode 6.91 ns 8.07 ns 8.51 ns 8.44 ns
FMTCompileMaster 23.2 ns 29.7 ns 16.3 ns 20.9 ns
FMTCompileMasterNOINLINERemovedFromFill 5.72 ns 29.9 ns 16.2 ns 21.0 ns
FMTCompileSeparateFormatter 14.7 ns 15.1 ns 12.6 ns 13.5 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 7.98 ns 8.06 ns 9.37 ns 8.30 ns
Benchmark, value == 9223372036854775807 GCC w/o DSH GCC with DSH Clang w/o DSH Clang with DSH
CustomCode 7.53 ns 9.28 ns 8.38 ns 8.28 ns
FMTCompileMaster 20.5 ns 27.2 ns 14.0 ns 19.3 ns
FMTCompileMasterNOINLINERemovedFromFill 5.07 ns 27.5 ns 13.9 ns 19.0 ns
FMTCompileSeparateFormatter 16.9 ns 16.7 ns 14.9 ns 15.7 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 9.38 ns 9.33 ns 11.7 ns 10.6 ns

@alexezeder
Copy link
Contributor

alexezeder commented Mar 23, 2021

Ok, I realized why FMTCompileSeparateFormatter* are so slow compared to FMTCompileMaster* - it was probably a bad idea to take different revisions of {fmt} without rebasing separate formatter test code onto the master version. Especially when format.h was updated in master (integral formatting refactoring).

Results for {:x} format string:

Benchmark, value == 42 GCC Clang
CustomCode 2.51 ns 2.27 ns
FMTCompileMaster 1.99 ns 7.41 ns
FMTCompileMasterNOINLINERemovedFromFill 2.00 ns 7.17 ns
FMTCompileSeparateFormatter 2.75 ns 2.09 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 2.76 ns 2.08 ns
ToChars 2.10 ns 2.27 ns

Results for {:0>8x} format string:

Benchmark, value == 42 GCC Clang
CustomCode 4.27 ns 6.19 ns
FMTCompileMaster 22.3 ns 14.8 ns
FMTCompileMasterNOINLINERemovedFromFill 4.82 ns 15.0 ns
FMTCompileSeparateFormatter 8.72 ns 6.85 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 5.95 ns 5.32 ns

Still, FMTCompileMaster is a bit faster than FMTCompileSeparateFormatter on GCC, but it's either poor optimization for a non-type template parameter or some optimization magic which cannot be solved.

Anyway, even if those handle_dynamic_spec() calls cannot be optimized out sanely, then I think we still can have that static only specs optimization in {fmt}. This solution requires a new formatter that handles only basic_ not dynamic_ format specs. Also, there is a problem of detecting dynamic format specs on the string parsing step, because we cannot just look for { and } in specs to determine it is dynamic or static. But we can create the default formatter and look at the context argument identifier after calling parse() method of formatter.

@alexezeder
Copy link
Contributor

Probably the final results for hexadecimal formatting.

I ran the same benchmarks for {:x} and {:0>8x} format strings, but this time they were compiled using MSVC (version 19.28.29913), since this compiler is also popular. Again, it's C++20 standard, it's a release configuration, everything remains the same. I also added results from GCC (Cygwin) on Windows just to demonstrate that GCC results on Linux and on Windows are almost identical.

Results for {:x} format string:

Benchmark, value == 42 MSVC GCC (Cygwin) GCC Clang
CustomCode 4.14 ns 2.78 ns 2.51 ns 2.27 ns
FMTCompileMaster 38.5 ns 2.52 ns 1.99 ns 7.41 ns
FMTCompileMasterNOINLINERemovedFromFill 38.5 ns 2.02 ns 2.00 ns 7.17 ns
FMTCompileSeparateFormatter 7.87 ns 2.52 ns 2.75 ns 2.09 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 7.89 ns 2.85 ns 2.76 ns 2.08 ns
ToChars 8.07 ns 2.02 ns 2.10 ns 2.27 ns

Results for {:0>8x} format string:

Benchmark, value == 42 MSVC GCC (Cygwin) GCC Clang
CustomCode 6.85 ns 4.78 ns 4.27 ns 6.19 ns
FMTCompileMaster 45.8 ns 27.4 ns 22.3 ns 14.8 ns
FMTCompileMasterNOINLINERemovedFromFill 45.9 ns 5.37 ns 4.82 ns 15.0 ns
FMTCompileSeparateFormatter 16.7 ns 15.4 ns 8.72 ns 6.85 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill 16.7 ns 6.64 ns 5.95 ns 5.32 ns

Need to figure out what specifically gcc doesn't like about the above code snippet.

I dig a little into the handle_dynamic_spec(), and based on what I saw - every meaningful line (in this function or in functions that are invoked inside) introduces some overhead (mostly it's ~1-2 ns). But the biggest impact comes from basic_format_args and it's get() method (~5 ns). Note that these observations are not very precise, because they actually just cannot be precise because we are talking about compiler optimizations here with all that magic under the hood.

@vitaut
Copy link
Contributor

vitaut commented Mar 26, 2021

Thanks for such a comprehensive testing!

I didn't expect that compilers can optimize code like that.

In principle this is mostly inlining and constant propagation. I'm more surprised that clang does a poor job here. I'll try to dig into the generated assembly to see what's going on.

I think we still can have that static only specs optimization in {fmt}.

Right.

I dig a little into the handle_dynamic_spec(), and based on what I saw - every meaningful line (in this function or in functions that are invoked inside) introduces some overhead (mostly it's ~1-2 ns).

True but it should be optimized away completely if dynamic specs are not used.

@jovibor
Copy link

jovibor commented Mar 27, 2021

@alexezeder That's a nice benchmarks indeed.
I hope you'll add a std::format to this contest, when it's ready.

@vitaut
Copy link
Contributor

vitaut commented May 13, 2021

After a recent series of optimizations the perf is on par with the handwritten code. Here are the results on clang and the current master (d1aebdb):

------------------------------------------------------------------------------------------------------
Benchmark                                                            Time             CPU   Iterations
------------------------------------------------------------------------------------------------------
CustomCode/0                                                      1.86 ns         1.86 ns    351709307
CustomCode/42                                                     3.03 ns         3.02 ns    244754389
CustomCode/273123                                                 5.45 ns         5.43 ns    130468007
CustomCode/9223372036854775807                                    6.79 ns         6.77 ns    102097372
FMTCompileMaster/0                                                1.99 ns         1.99 ns    360523686
FMTCompileMaster/42                                               2.33 ns         2.33 ns    279865664
FMTCompileMaster/273123                                           3.72 ns         3.71 ns    190230315
FMTCompileMaster/9223372036854775807                              5.28 ns         5.26 ns    130711631
ToChars/0                                                         4.42 ns         4.41 ns    160196630
ToChars/42                                                        5.00 ns         4.98 ns    140735201
ToChars/273123                                                    7.26 ns         7.24 ns     95784130
ToChars/9223372036854775807                                       8.77 ns         8.75 ns     75872534

For comparison here's are results on an old version (7.1.3):

Running ./fmt_test
Run on (8 X 2800 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 22.02, 12.70, 8.66
------------------------------------------------------------------------------------------------------
Benchmark                                                            Time             CPU   Iterations
------------------------------------------------------------------------------------------------------
CustomCode/0                                                      1.81 ns         1.81 ns    369289856
CustomCode/42                                                     2.66 ns         2.63 ns    277249683
CustomCode/273123                                                 5.30 ns         5.27 ns    131162285
CustomCode/9223372036854775807                                    6.92 ns         6.86 ns    101821144
FMTCompileOld/0                                                   15.5 ns         15.5 ns     43302898
FMTCompileOld/42.                                                 16.6 ns         16.6 ns     43278267
FMTCompileOld/273123                                              18.7 ns         18.6 ns     37035861
FMTCompileOld/9223372036854775807                                 19.4 ns         19.4 ns     35243000
ToChars/0                                                         4.33 ns         4.31 ns    162237581
ToChars/42                                                        5.50 ns         5.44 ns    123228589
ToChars/273123                                                    8.47 ns         8.16 ns     92972600
ToChars/9223372036854775807                                       9.84 ns         9.52 ns     63743569

@vitaut vitaut closed this as completed May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants