Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: use find instead of char-by-char in FromFilePath() #50288

Merged
merged 6 commits into from
Oct 24, 2023

Conversation

lemire
Copy link
Member

@lemire lemire commented Oct 19, 2023

In PR https://github.com/nodejs/node/pull/50253/files, an optimization was proposed for the FromFilePath() function. This function replaces every occurence of the '%' character by the string '%25'. The expectation is that in this function (FromFilePath), strings typically do not contain the '%', or if they do, they have few such characters. It lead me to write a blog post and write a non-trivial benchmark and realistic data: For processing strings, streams in C++ can be slow. My work suggests that a loop calling find is faster. Furthermore, we can do one better and avoid the allocation of a temporary std::string in the common case where the '%' is not found.

If you use my benchmarking code, you find that code similar to the PR is several times more efficient the current code (5 GB/s vs 0.34 GB/s) (note: the benchmark does not include the URL parsing which is considered separate). Here are my numbers of my macBook with LLVM 14 (you can run your own benchmarks):

Capture d’écran, le 2023-10-22 à 12 52 02

Clang compiles the second inner loop to the following:

.LBB0_12: # =>This Inner Loop Header: Depth=1
  lea r14, [r13 + 1]
  cmp r15, r14
  mov rdx, r14
  cmovb rdx, r15
  mov rax, rbp
  sub rax, qword ptr [rbx + 8]
  cmp rax, rdx
  jb .LBB0_13
  mov rdi, rbx
  mov rsi, r12
  call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_append(char const*, unsigned long)@PLT
  mov rax, qword ptr [rbx + 8]
  and rax, -2
  cmp rax, qword ptr [rsp + 16] # 8-byte Folded Reload
  je .LBB0_17
  mov edx, 2
  mov rdi, rbx
  lea rsi, [rip + .L.str]
  call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_append(char const*, unsigned long)@PLT
  cmp r15, r13
  jbe .LBB0_21
  add r12, r14
  sub r15, r14
  je .LBB0_24
  mov rdi, r12
  mov esi, 37
  mov rdx, r15
  call memchr@PLT
  test rax, rax
  je .LBB0_27
  mov r13, rax
  sub r13, r12
  cmp r13, -1
  jne .LBB0_12

There are two calls to string append, as you'd expect. Otherwise, there is no allocation. E.g., for example, there are calls to a std::string_view constructor because it gets optimized away. Remember that std::string_view instances are non-allocating. That's why, for example, we pass them by value (not by reference), typically.

The adversarial scenario is one where the entire string is made of '%'. You can benchmark this case using my code, while passing --adversarial to the benchmark program (benchmark --adversarial). In that scenario, the repeated calls to find are not a positive. All tested functions are slow in this adversarial case, but the new code is about 50% slower. That's to be expected. If you do expect strings to contain a large fraction of '%' characters, then the PR is not beneficial. But the point of the PR is that on realistic inputs, the PR can multiply the performance by 10x.

Results on my macBook (LLVM14, you can run your own benchmarks):

Capture d’écran, le 2023-10-22 à 12 52 55

It is possible to use a slightly more complicated function that does a first pass, counting the number of '%' characters and allocates accordingly. Empirically (see the _count results in the screenshot), it does not make much of a difference in the performance, whether you are in the realistic or adversarial case. However, requiring two passes over the data is slightly more complex so I opt for the simplest efficient implementation.

Note that appending to an std::string, even character by character, has linear complexity. Each append does not translate into a new allocation. Rather the complexity grows exponentially, doubling a few times.

@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/url

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. whatwg-url Issues and PRs related to the WHATWG URL implementation. labels Oct 19, 2023
@anonrig anonrig added request-ci Add this label to start a Jenkins CI on a PR. commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. labels Oct 19, 2023
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Oct 19, 2023
@nodejs-github-bot
Copy link
Collaborator

@H4ad H4ad added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Oct 19, 2023
@nodejs-github-bot
Copy link
Collaborator

src/node_url.cc Outdated Show resolved Hide resolved
src/node_url.cc Outdated Show resolved Hide resolved
src/node_url.cc Outdated Show resolved Hide resolved
lemire and others added 2 commits October 19, 2023 18:33
Co-authored-by: Tobias Nießen <[email protected]>
Co-authored-by: Tobias Nießen <[email protected]>
@lemire
Copy link
Member Author

lemire commented Oct 19, 2023

@tniessen Thanks for the review of the comments. I have committed your proposed changes. I think that the code itself is correct and contains no superfluous lines. Do you agree?

@anonrig anonrig added the request-ci Add this label to start a Jenkins CI on a PR. label Oct 19, 2023
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Oct 19, 2023
@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

src/node_url.cc Outdated Show resolved Hide resolved
@nodejs-github-bot
Copy link
Collaborator

// Escape '%' characters to a temporary string.
std::string escaped_file_path;
do {
escaped_file_path += file_path.substr(0, pos + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential poor performance due to reallocating here. Imagine a pathological input like "%".repeat(1e6).

Better to count the number of % characters, then preallocate a buffer of size file_path.size() + 2 * count.

Copy link
Member Author

@lemire lemire Oct 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reallocation due to the string_view.

Clang compiles the second inner loop to the following:

.LBB0_12: # =>This Inner Loop Header: Depth=1
  lea r14, [r13 + 1]
  cmp r15, r14
  mov rdx, r14
  cmovb rdx, r15
  mov rax, rbp
  sub rax, qword ptr [rbx + 8]
  cmp rax, rdx
  jb .LBB0_13
  mov rdi, rbx
  mov rsi, r12
  call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_append(char const*, unsigned long)@PLT
  mov rax, qword ptr [rbx + 8]
  and rax, -2
  cmp rax, qword ptr [rsp + 16] # 8-byte Folded Reload
  je .LBB0_17
  mov edx, 2
  mov rdi, rbx
  lea rsi, [rip + .L.str]
  call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_append(char const*, unsigned long)@PLT
  cmp r15, r13
  jbe .LBB0_21
  add r12, r14
  sub r15, r14
  je .LBB0_24
  mov rdi, r12
  mov esi, 37
  mov rdx, r15
  call memchr@PLT
  test rax, rax
  je .LBB0_27
  mov r13, rax
  sub r13, r12
  cmp r13, -1
  jne .LBB0_12

There are two calls to string append, as you'd expect. Otherwise, there is no allocation. E.g., for example, there are calls to a std::string_view constructor because it gets optimized away. Remember that std::string_view instances are non-allocating. That's why, for example, we pass them by value (not by reference), typically.

The adversarial scenario is one where the entire string is made of '%'. You can benchmark this case using my code, while passing --adversarial to the benchmark program (benchmark --adversarial). In that scenario, the repeated calls to find are not a positive. All tested functions are slow in this adversarial case, but the new code is about 50% slower. That's to be expected. If you do expect strings to contain a large fraction of '%' characters, then the PR is not beneficial. But the point of the PR is that on realistic inputs, the PR can multiply the performance by 10x.

Results on my macBook (LLVM14, you can run your own benchmarks):
Capture d’écran, le 2023-10-22 à 12 20 27

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirically, doing a count + reserve does not improve the performance in the adversarial case where every character is a '%'. It makes the code slightly more complicated, however.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that std::string_view instances are non-allocating

Yes, but escaped_file_path is a string, not a string_view, and that's what being appended to.

Empirically, doing a count + reserve does not improve the performance in the adversarial case

Honestly, I'm not worried about maximum performance here. I'd rather have predictable worst-case performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnoordhuis I'm going to merge this in a couple of hours since this is not a review that blocks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have predictable worst-case performance.

Appending character-by-character to an std::string in C++ provides predictable performance: https://lemire.me/blog/2023/10/23/appending-to-an-stdstring-character-by-character-how-does-the-capacity-grow/

It is a common usage pattern in C++ (with std::vector, std::string, etc.). Complexity-wise, it is linear time with or withour reserve. A reserve may improve the performance, but it does not change the complexity of the algorithm. In my mind, you'd only want to do a reserve if it improves the real-world performance. It is effectively a performance/efficiency optimization.

Bugs are always possible, but in my mind, neither this PR nor the previous code (written by @anonrig I think) could degenerate performance-wise. They use linear time, safe algorithms.

I should stress that this code is not nearly as good as it could be, but further gains require changing ada.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appending character-by-character to an std::string in C++ provides predictable performance

You're making an observation based on a sample size of one. I suspect you're also working on the assumption that (re)allocation is a O(1) operation.

Now, the change in this PR doesn't make the code materially worse (only longer and more complex) so I'm not going to block it but it also doesn't make it materially better, it's just rummaging in the margin.

Count + alloc on the other hand is a material improvement because it changes the asymptotic runtime behavior.

(I'm not interested in discussing this further because time is finite and if I haven't gotten my point across by now, it's never going to happen.)

Copy link
Member Author

@lemire lemire Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not interested in discussing this further because time is finite and if I haven't gotten my point across by now, it's never going to happen.

There must be a misunderstanding. I suspect I explained myself poorly. I can see no reason why we would even disagree. Let me try to clarify.

I am saying that if you benchmark the following function...

std::string my_string; 
while (my_string.size() <= volume) { my_string += "a"; }

... the time per entry (time/volume) is effectively constant with existing C++ runtime libraries (glibc++, libc++, Visual Studio). My blog post has an analysis, and even a benchmark (complete with C++ code).

Another way to put it is that the running time is proportional to volume. And yes, I mean "the measured, wall-clock time".

They make it so because it is common usage. E.g., it would be considered a bug if it were quadratic time. We would know. The dynamic arrays in Swift, early on, could be tricked into doing one allocation per append. They quickly updated it.

I suspect you're also working on the assumption that (re)allocation is a O(1) operation.

I am assuming that allocating (or copying) N bytes takes O(N) time. With this model, insertion in a dynamic array with capacity that is expanded by a constant factor (which is how C++ runtime libraries do it) ensures that inserting an element is constant time (amortized).

Let us consider a toy example where you construct a 1024-char string, doubling each time the capacity, and starting with a capacity of 1 byte... (that's not realistic but it is simple)

  • first character: allocate 1 byte.
  • second character: double the capacity to 2 bytes.
  • 4th character: double the capacity to 4 bytes.
  • 8th character: double the capacity to 8 bytes.
  • 16th character: double the capacity to 16 bytes.
  • 32nd character: double the capacity to 32 bytes.
  • 64th character: double the capacity to 64 bytes.
  • 128th character: double the capacity to 128 bytes.
  • 256th character: double the capacity to 256 bytes.
  • 512th character: double the capacity to 512 bytes.
  • 1024th character: double the capacity to 1024 bytes.

If you sum it up you get 1 + 2 + 4 + ... 512 + 1024 = 2047. And the result is general. To construct a string of N bytes, you need ~2N bytes being allocated. The key part is that it has amortized linear complexity. Both glibc++ and libc++ use a factor of 2 whereas Microsoft and Facebook prefer a smaller factor, but the analysis is the same.

Feel free to reach out to me at [email protected] if I did not clarify that point. We can set up a videoconference if needed.

@H4ad H4ad added the request-ci Add this label to start a Jenkins CI on a PR. label Oct 22, 2023
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Oct 22, 2023
@nodejs-github-bot
Copy link
Collaborator

@anonrig anonrig added the commit-queue Add this label to land a pull request using GitHub Actions. label Oct 24, 2023
@nodejs-github-bot nodejs-github-bot removed the commit-queue Add this label to land a pull request using GitHub Actions. label Oct 24, 2023
@nodejs-github-bot nodejs-github-bot merged commit c89bae1 into nodejs:main Oct 24, 2023
25 checks passed
@nodejs-github-bot
Copy link
Collaborator

Landed in c89bae1

alexfernandez pushed a commit to alexfernandez/node that referenced this pull request Nov 1, 2023
PR-URL: nodejs#50288
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Vinícius Lourenço Claro Cardoso <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
targos pushed a commit that referenced this pull request Nov 11, 2023
PR-URL: #50288
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Vinícius Lourenço Claro Cardoso <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
UlisesGascon pushed a commit that referenced this pull request Dec 11, 2023
PR-URL: #50288
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Vinícius Lourenço Claro Cardoso <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
@UlisesGascon UlisesGascon mentioned this pull request Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. c++ Issues and PRs that require attention from people who are familiar with C++. commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. needs-ci PRs that need a full CI run. whatwg-url Issues and PRs related to the WHATWG URL implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants