Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further optimize find_first_of #4497

Closed
4 tasks done
AlexGuteniev opened this issue Mar 22, 2024 · 2 comments · Fixed by #4744
Closed
4 tasks done

Further optimize find_first_of #4497

AlexGuteniev opened this issue Mar 22, 2024 · 2 comments · Fixed by #4744
Labels
fixed Something works now, yay! performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Mar 22, 2024

  • For one-element needle use find
  • For more than 16 bytes needle check it by parts
  • Do something for 4 and 8 byte size elements
  • Consider enabling for some cases basic_string::find_first_of (or prove it is not possible)
@frederick-vs-ja
Copy link
Contributor

  • Consider enabling for some cases basic_string::find_first_of (or prove it is not possible)

I think this fallback part (wouldn't be used with char or char8_t, but may be used with wide character types) is definitely doable.

STL/stl/inc/xstring

Lines 732 to 737 in 8e2d724

const auto _End = _Haystack + _Hay_size;
for (auto _Match_try = _Haystack + _Start_at; _Match_try < _End; ++_Match_try) {
if (_Traits::find(_Needle, _Needle_size, *_Match_try)) {
return static_cast<size_t>(_Match_try - _Haystack); // found a match
}
}

@StephanTLavavej StephanTLavavej added the performance Must go faster label Mar 27, 2024
@AlexGuteniev
Copy link
Contributor Author

I think this fallback part (wouldn't be used with char or char8_t, but may be used with wide character types) is definitely doable.

I now see that it is possible. I agree that it would work here. But I'm not sure whether the bitmap thing is worth doing for narrow string, if the vector algorithm can be used.

The bitmap would check up to 256 characters of the needle at a time, but only one character of the source string at a time. The pcmpestri would check 16 needle characters against 16 source string characters at a time. The later looks better, unless we usually deal with short strings and long needles, which is not what I'd expect.

Even for wide characters, the pcmpestri would check 8 chars of the needle and 8 chars of the source string at a time, and this looks typically better than what the bitmap gives. And the bitmap may fail in this case.

Of course, there's more than just number of comparisons. There are different instructions, with different latencies and throughputs. Overall, the bitmap is expected to use more instructions, but these are the usual fast ones, unlike high latency pcmpestri, so one iteration of the bitmap loop is likely to be faster. All this should be measured. But my expectation is that vectorized find_first_of may be better than a bitmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Something works now, yay! performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants