Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call CRT wmemcmp/wmemchr when possible in char_traits for better performance #4873

Merged
merged 6 commits into from
Aug 8, 2024

Conversation

mcfi
Copy link
Contributor

@mcfi mcfi commented Aug 2, 2024

__builtin_wmemcmp and __builtin_wmemchr are simply lowered to wchar-by-wchar loops by MSVC backend, which may perform poorly than vectorized CRT wmemcmp and wmemchr. Therefore, for non-constexpr cases in string view, we call CRT functions instead of the builtins.

@mcfi mcfi requested a review from a team as a code owner August 2, 2024 15:35
@AlexGuteniev
Copy link
Contributor

I'd like to see a benchmark and its results. I expect that maintainers would be interested in that too.
#4387 is an example for a similar benchmark added to this repo.

I also recall my past observation that wmemchr is slow and not vectorized. See DevCom-1614562. If this still confirms, I suggest reusing the std::find vectorization implementation instead of wmemchr

stl/inc/__msvc_string_view.hpp Outdated Show resolved Hide resolved
stl/inc/__msvc_string_view.hpp Show resolved Hide resolved
@StephanTLavavej StephanTLavavej added the performance Must go faster label Aug 2, 2024
@StephanTLavavej
Copy link
Member

StephanTLavavej commented Aug 2, 2024

@mcfi's internal OS-PR-11111227 "[Perf] Optimize wmemcmp/wmemchr in UCRT" has enhanced <wchar.h> with vectorized implementations for x64 and ARM64, which have been benchmarked there as up to 10x faster. Although it'll be some unspecified amount of time before we're able to pick up updated UCRT headers, I'm fine with calling the UCRT now - this is our historical practice and only in highly unusual cases (e.g. #4654) do we avoid the UCRT.

Since this is deferring to our usual dependency, I also don't think it's absolutely necessary to add benchmarks to the STL repo.

Also, we're already calling wmemcmp and wmemchr in C++14 mode (expand context below the diffs), so I have absolutely no concerns here.

stl/inc/__msvc_string_view.hpp Outdated Show resolved Hide resolved
stl/inc/__msvc_string_view.hpp Outdated Show resolved Hide resolved
stl/inc/__msvc_string_view.hpp Show resolved Hide resolved
@StephanTLavavej StephanTLavavej changed the title Call CRT wmemcmp/wmemchr when possible in string view for better performance Call CRT wmemcmp/wmemchr when possible in char_traits for better performance Aug 2, 2024
@StephanTLavavej
Copy link
Member

Updated title - the header is <__msvc_string_view.hpp> but this applies to char_traits and thus std::wstring etc.

@AlexGuteniev
Copy link
Contributor

Should anything be done with this then?

// TRANSITION, ABI: preserved for binary compatibility
const void* __stdcall __std_find_trivial_unsized_2(const void* const _First, const uint16_t _Val) noexcept {
// TRANSITION, DevCom-1614562: not trying wmemchr
return __std_find_trivial_unsized_impl(_First, _Val);
}

STL/stl/inc/xutility

Lines 6005 to 6017 in a357ff1

#else // ^^^ _USE_STD_VECTOR_ALGORITHMS / !_USE_STD_VECTOR_ALGORITHMS vvv
if constexpr (sizeof(_Iter_value_t<_InIt>) == 1) {
const auto _First_ptr = _STD _To_address(_First);
const auto _Result = static_cast<remove_reference_t<_Iter_ref_t<_InIt>>*>(
_CSTD memchr(_First_ptr, static_cast<unsigned char>(_Val), static_cast<size_t>(_Last - _First)));
if constexpr (is_pointer_v<_InIt>) {
return _Result ? _Result : _Last;
} else {
return _Result ? _First + (_Result - _First_ptr) : _Last;
}
}
// TRANSITION, DevCom-1614562: not trying wmemchr
#endif // ^^^ !_USE_STD_VECTOR_ALGORITHMS ^^^

STL/stl/inc/xutility

Lines 6058 to 6064 in a357ff1

_NODISCARD constexpr _It _Find_unchecked(_It _First, const _Se _Last, const _Ty& _Val, _Pj _Proj = {}) {
// TRANSITION, DevCom-1614562: not trying wmemchr
// Only single-byte elements are suitable for unsized optimization
constexpr bool _Single_byte_elements = sizeof(_Iter_value_t<_It>) == 1;
constexpr bool _Is_sized = sized_sentinel_for<_Se, _It>;
if constexpr (_Vector_alg_in_find_is_safe<_It, _Ty>

@StephanTLavavej StephanTLavavej self-assigned this Aug 6, 2024
@StephanTLavavej
Copy link
Member

Good catch @AlexGuteniev! I think we can handle those in a followup PR (except that changing the ABI-retained function is pointless). I've recorded a note.

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@AlexGuteniev
Copy link
Contributor

except that changing the ABI-retained function is pointless

Don't think so. The function used to be the optimization that we reverted to make ASAN happy.
Using wmemchr brings the optimization back, mitigating the ASAN happiness impact. (Not fully eliminating though, as there are 32 and 64 bit versions too)

@StephanTLavavej StephanTLavavej merged commit b5285d1 into microsoft:main Aug 8, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

Thanks for this (future) performance improvement! 🚀 📆 ⏳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants