-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: use find instead of char-by-char in FromFilePath() #50288
Conversation
Review requested:
|
Co-authored-by: Tobias Nießen <[email protected]>
Co-authored-by: Tobias Nießen <[email protected]>
@tniessen Thanks for the review of the comments. I have committed your proposed changes. I think that the code itself is correct and contains no superfluous lines. Do you agree? |
// Escape '%' characters to a temporary string. | ||
std::string escaped_file_path; | ||
do { | ||
escaped_file_path += file_path.substr(0, pos + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential poor performance due to reallocating here. Imagine a pathological input like "%".repeat(1e6)
.
Better to count the number of %
characters, then preallocate a buffer of size file_path.size() + 2 * count
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no reallocation due to the string_view.
Clang compiles the second inner loop to the following:
.LBB0_12: # =>This Inner Loop Header: Depth=1
lea r14, [r13 + 1]
cmp r15, r14
mov rdx, r14
cmovb rdx, r15
mov rax, rbp
sub rax, qword ptr [rbx + 8]
cmp rax, rdx
jb .LBB0_13
mov rdi, rbx
mov rsi, r12
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_append(char const*, unsigned long)@PLT
mov rax, qword ptr [rbx + 8]
and rax, -2
cmp rax, qword ptr [rsp + 16] # 8-byte Folded Reload
je .LBB0_17
mov edx, 2
mov rdi, rbx
lea rsi, [rip + .L.str]
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_append(char const*, unsigned long)@PLT
cmp r15, r13
jbe .LBB0_21
add r12, r14
sub r15, r14
je .LBB0_24
mov rdi, r12
mov esi, 37
mov rdx, r15
call memchr@PLT
test rax, rax
je .LBB0_27
mov r13, rax
sub r13, r12
cmp r13, -1
jne .LBB0_12
There are two calls to string append, as you'd expect. Otherwise, there is no allocation. E.g., for example, there are calls to a std::string_view
constructor because it gets optimized away. Remember that std::string_view
instances are non-allocating. That's why, for example, we pass them by value (not by reference), typically.
The adversarial scenario is one where the entire string is made of '%'. You can benchmark this case using my code, while passing --adversarial
to the benchmark program (benchmark --adversarial
). In that scenario, the repeated calls to find
are not a positive. All tested functions are slow in this adversarial case, but the new code is about 50% slower. That's to be expected. If you do expect strings to contain a large fraction of '%' characters, then the PR is not beneficial. But the point of the PR is that on realistic inputs, the PR can multiply the performance by 10x.
Results on my macBook (LLVM14, you can run your own benchmarks):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empirically, doing a count + reserve does not improve the performance in the adversarial case where every character is a '%'. It makes the code slightly more complicated, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember that std::string_view instances are non-allocating
Yes, but escaped_file_path is a string, not a string_view, and that's what being appended to.
Empirically, doing a count + reserve does not improve the performance in the adversarial case
Honestly, I'm not worried about maximum performance here. I'd rather have predictable worst-case performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnoordhuis I'm going to merge this in a couple of hours since this is not a review that blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather have predictable worst-case performance.
Appending character-by-character to an std::string
in C++ provides predictable performance: https://lemire.me/blog/2023/10/23/appending-to-an-stdstring-character-by-character-how-does-the-capacity-grow/
It is a common usage pattern in C++ (with std::vector
, std::string
, etc.). Complexity-wise, it is linear time with or withour reserve. A reserve may improve the performance, but it does not change the complexity of the algorithm. In my mind, you'd only want to do a reserve if it improves the real-world performance. It is effectively a performance/efficiency optimization.
Bugs are always possible, but in my mind, neither this PR nor the previous code (written by @anonrig I think) could degenerate performance-wise. They use linear time, safe algorithms.
I should stress that this code is not nearly as good as it could be, but further gains require changing ada.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appending character-by-character to an std::string in C++ provides predictable performance
You're making an observation based on a sample size of one. I suspect you're also working on the assumption that (re)allocation is a O(1) operation.
Now, the change in this PR doesn't make the code materially worse (only longer and more complex) so I'm not going to block it but it also doesn't make it materially better, it's just rummaging in the margin.
Count + alloc on the other hand is a material improvement because it changes the asymptotic runtime behavior.
(I'm not interested in discussing this further because time is finite and if I haven't gotten my point across by now, it's never going to happen.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not interested in discussing this further because time is finite and if I haven't gotten my point across by now, it's never going to happen.
There must be a misunderstanding. I suspect I explained myself poorly. I can see no reason why we would even disagree. Let me try to clarify.
I am saying that if you benchmark the following function...
std::string my_string;
while (my_string.size() <= volume) { my_string += "a"; }
... the time per entry (time/volume) is effectively constant with existing C++ runtime libraries (glibc++, libc++, Visual Studio). My blog post has an analysis, and even a benchmark (complete with C++ code).
Another way to put it is that the running time is proportional to volume. And yes, I mean "the measured, wall-clock time".
They make it so because it is common usage. E.g., it would be considered a bug if it were quadratic time. We would know. The dynamic arrays in Swift, early on, could be tricked into doing one allocation per append. They quickly updated it.
I suspect you're also working on the assumption that (re)allocation is a O(1) operation.
I am assuming that allocating (or copying) N bytes takes O(N) time. With this model, insertion in a dynamic array with capacity that is expanded by a constant factor (which is how C++ runtime libraries do it) ensures that inserting an element is constant time (amortized).
Let us consider a toy example where you construct a 1024-char string, doubling each time the capacity, and starting with a capacity of 1 byte... (that's not realistic but it is simple)
- first character: allocate 1 byte.
- second character: double the capacity to 2 bytes.
- 4th character: double the capacity to 4 bytes.
- 8th character: double the capacity to 8 bytes.
- 16th character: double the capacity to 16 bytes.
- 32nd character: double the capacity to 32 bytes.
- 64th character: double the capacity to 64 bytes.
- 128th character: double the capacity to 128 bytes.
- 256th character: double the capacity to 256 bytes.
- 512th character: double the capacity to 512 bytes.
- 1024th character: double the capacity to 1024 bytes.
If you sum it up you get 1 + 2 + 4 + ... 512 + 1024 = 2047. And the result is general. To construct a string of N bytes, you need ~2N bytes being allocated. The key part is that it has amortized linear complexity. Both glibc++ and libc++ use a factor of 2 whereas Microsoft and Facebook prefer a smaller factor, but the analysis is the same.
Feel free to reach out to me at [email protected] if I did not clarify that point. We can set up a videoconference if needed.
Landed in c89bae1 |
PR-URL: nodejs#50288 Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Vinícius Lourenço Claro Cardoso <[email protected]> Reviewed-By: Tobias Nießen <[email protected]>
PR-URL: #50288 Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Vinícius Lourenço Claro Cardoso <[email protected]> Reviewed-By: Tobias Nießen <[email protected]>
PR-URL: #50288 Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Vinícius Lourenço Claro Cardoso <[email protected]> Reviewed-By: Tobias Nießen <[email protected]>
In PR https://github.com/nodejs/node/pull/50253/files, an optimization was proposed for the FromFilePath() function. This function replaces every occurence of the '%' character by the string '%25'. The expectation is that in this function (FromFilePath), strings typically do not contain the '%', or if they do, they have few such characters. It lead me to write a blog post and write a non-trivial benchmark and realistic data: For processing strings, streams in C++ can be slow. My work suggests that a loop calling
find
is faster. Furthermore, we can do one better and avoid the allocation of a temporarystd::string
in the common case where the '%' is not found.If you use my benchmarking code, you find that code similar to the PR is several times more efficient the current code (5 GB/s vs 0.34 GB/s) (note: the benchmark does not include the URL parsing which is considered separate). Here are my numbers of my macBook with LLVM 14 (you can run your own benchmarks):
Clang compiles the second inner loop to the following:
There are two calls to string append, as you'd expect. Otherwise, there is no allocation. E.g., for example, there are calls to a
std::string_view
constructor because it gets optimized away. Remember thatstd::string_view
instances are non-allocating. That's why, for example, we pass them by value (not by reference), typically.The adversarial scenario is one where the entire string is made of '%'. You can benchmark this case using my code, while passing
--adversarial
to the benchmark program (benchmark --adversarial
). In that scenario, the repeated calls tofind
are not a positive. All tested functions are slow in this adversarial case, but the new code is about 50% slower. That's to be expected. If you do expect strings to contain a large fraction of '%' characters, then the PR is not beneficial. But the point of the PR is that on realistic inputs, the PR can multiply the performance by 10x.Results on my macBook (LLVM14, you can run your own benchmarks):
It is possible to use a slightly more complicated function that does a first pass, counting the number of '%' characters and allocates accordingly. Empirically (see the
_count
results in the screenshot), it does not make much of a difference in the performance, whether you are in the realistic or adversarial case. However, requiring two passes over the data is slightly more complex so I opt for the simplest efficient implementation.Note that appending to an std::string, even character by character, has linear complexity. Each append does not translate into a new allocation. Rather the complexity grows exponentially, doubling a few times.