-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2 #4415
Comments
I mentioned this on Discord, but it seems some of it didn't make it into edits, so I'll just
Fair chance some of the above aren't worth the effort, but I don't have the numbers or intuition to guess which. Nor do I have any experience or knowledge regarding ARM vectorization. |
This sounds like a reasonable analysis to us. We would want to consider PRs for this, with benchmarks, for individual algorithms at a time (not all at once please!). #813 tracks ARM64 vectorization and is orthogonal to this issue. |
Speeding things up for 92% of people is definitely worth it. Needs a cpuid check so the last 8% don't crash, but they can just fall back to the scalar version. But yes, 11% is significantly less. And that's 11% of gamers, who tend to have better processors than average. And that's for the third level of AVX512; std::remove needs VBMI2, which is the fifth level. Though judging by the difference between first, 11.26%, and third, 11.14%, fair chance fifth is also about 11%. But while 11% (or whatever the real number is) is a small fraction, it's 11% of Windows installations, which is still many millions. That's why that 'doubt' is there. I don't know how Microsoft's priorities look at that scale. (Definitely should take the AVX2-capable ones first, though.) |
We're still talking about very small fraction of users, because not many programs would spend in these algorithms significant amount of the time (if any time at all). I was pleased to know about #3617 in the sense there's at least one attempt to use the improved algorithm on a large amount of data.
|
This comment was marked as resolved.
This comment was marked as resolved.
I guess Not sure whether such strategy is viable:
|
Tracking any remaining algorithms to vectorize.
Optimized via C runtime library functions, like
memcpy
, counts as vectorized too, as these functions are optimized.See also #7.
for_each
,for_each_n
all_of
,any_of
,none_of
contains
contains_subrange
find
find_if
,find_if_not
find_last
find_last_if
,find_last_if_not
find_end
find_first_of
adjacent_find
count
count_if
mismatch
equal
search
search_n
starts_with
,ends_with
fold
familycopy
,copy_n
,copy_backward
copy_if
move
,move_backward
swap
swap_ranges
iter_swap
transform
replace
replace_if
replace_copy
,replace_copy_if
fill
,fill_n
generate
remove
remove_if
remove_copy
remove_copy_if
unique
unique_copy
reverse
,reverse_copy
rotate
rotate_copy
shift_left
,shift_right
shuffle
familysample
is_partition
,partition_point
partition
familysort
family,nth_element
is_sorted_until
binary_search
familyincludes
set_*
familymerge
family*_heap
familyminmax_element
familyminmax
familyclamp
lexicographical_compare
*_permutation
familyiota
accumulate
inner_product
reduce
,transform_reduce
,*_scan
familyadjacent_difference
partial_sum
The text was updated successfully, but these errors were encountered: