-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization for hexadecimal formatting #1944
Comments
Your implementation of Lines 537 to 540 in 17fba75
|
Fortunate, easy to compare, 'cause it is 10 times slower than sprintf: |
@BuzaL Interestingly, if I use |
As I see, its an issue with 7.x.x. The 6.2.1 still faster then reserve + append, but 7.0 and trunk are slow with ANY formatting. |
Looks like the regression is specific to Compiler Explorer, most likely the installed version is compiled with optimizations disabled. |
Confirmed in #2153 (comment). Don't use Compiler Explorer for benchmarking =). Also don't use system clock. |
I've landed a few formatting optimizations that improve hex formatting perf among other things. With these changes, format string compilation and adding size computation to the manual method for consistency #include <string>
#include <iostream>
#include <chrono>
#include <fmt/compile.h>
constexpr char hexchar[] = "0123456789ABCDEF";
template <typename T>
std::string to_hex(T value)
{
constexpr size_t sz = sizeof(value) * 2;
std::string buffer(sz, L'\0');
int size = 0;
for (int i = 0; i < 2 * sizeof(value); i++) {
int digit = (value >> 4 * (sz - 1 - i)) & 0xf;
buffer[i] = hexchar[digit];
if (digit != 0) size = digit;
}
buffer.resize(size + 1);
return buffer;
}
template <typename T>
std::string fmt_to_hex(T value)
{
return fmt::format(FMT_COMPILE("{:X}"), value);
}
int main()
{
int sum = 0;
auto before = std::chrono::steady_clock::now();
for (int i = 0; i < 100000000; i++)
{
auto a = to_hex(i);
sum += a.length();
}
auto manual = std::chrono::steady_clock::now() - before;
before = std::chrono::steady_clock::now();
for (int i = 0; i < 100000000; i++)
{
auto a = fmt_to_hex(i);
sum += a.length();
}
auto fmt = std::chrono::steady_clock::now() - before;
std::cout << "Manual: " << std::chrono::duration_cast<std::chrono::milliseconds>(manual).count() << std::endl;
std::cout << "Fmt: " << std::chrono::duration_cast<std::chrono::milliseconds>(fmt).count() << std::endl;
return sum;
} we can get comparable performance:
Note that the results are not very reliable, I recommend using a benchmarking framework instead. |
I prepared a bit different still fitting benchmark of hex formatting.
|
Thanks for the benchmark. I wonder why the results for |
Oops... I made a mistake at the last state of preparing this repo and There is a difference in codegen of course, but looks like as long as EDIT: repo moved from https://github.com/alexezeder/fmt_bnchmrk to https://github.com/alexezeder/fmt_bnchmrk_legacy |
Regarding |
My bet here is that the code is just too complex for the compiler to optimize. We can play with attributes of function to make them always inlined, but that would not probably solve problems with losing knowledge of constness of value in some cases. I mean, I'm not sure that even the formatter object that is created at compile-time can be used inside
I'm not sure how LTO should help in this case, at least for the results in the benchmark. It uses header-only versions of {fmt}, so looks like it could help only with reducing the size of the executable by deleting some of non-inlined functions from different versions. In fact, GCC with
|
And another set of benchmarks. They are also available in my benchmark repository. ClangClang was added, it shows even better results for SeparatedFormatter tests:
For Clang there is no overhead of using code from Force inliningTo test what I wrote earlier:
I modified all functions of {fmt} (which are invoked by the test code) to add to them
Looks like this trick can only slice some time for the std::string testsSince this issue started with
Considering these results looks like {fmt} has a big overhead, but that's quite false. Code from the first message is actually closer to the usage of std::string test_custom_code(unsigned value) {
constexpr unsigned width = 2 * sizeof(value);
std::string buffer;
for (int i = 0; i < width; i++)
buffer.append(1, hexchar[(value >> 4 * (width - 1 - i)) & 0xf]);
return buffer;
} the results from GCC (Clang still optimizes this code) become comparable:
|
AFAICS the part that prevents optimization for some reason is dynamic width/precision handling: Lines 3540 to 3544 in 0f85a46
Temporarily commenting it out makes gcc do the expected optimization:
Need to figure out what specifically gcc doesn't like about the above code snippet. |
Wow, these results are great! I didn't expect that compilers can optimize code like that. This change makes
For some reason In addition, I did a quick check for a bit more complex format string
|
Ok, I realized why Results for
Results for
Still, Anyway, even if those |
Probably the final results for hexadecimal formatting. I ran the same benchmarks for Results for
Results for
I dig a little into the |
Thanks for such a comprehensive testing!
In principle this is mostly inlining and constant propagation. I'm more surprised that clang does a poor job here. I'll try to dig into the generated assembly to see what's going on.
Right.
True but it should be optimized away completely if dynamic specs are not used. |
@alexezeder That's a nice benchmarks indeed. |
After a recent series of optimizations the perf is on par with the handwritten code. Here are the results on clang and the current
For comparison here's are results on an old version (7.1.3):
|
Manual implementation is about 10 times faster than fmtlib.
MSVC x64
The text was updated successfully, but these errors were encountered: