Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2 #4415

Open
AlexGuteniev opened this issue Feb 20, 2024 · 9 comments
Open

Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2 #4415

AlexGuteniev opened this issue Feb 20, 2024 · 9 comments
Labels
performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Feb 20, 2024

Tracking any remaining algorithms to vectorize.

Optimized via C runtime library functions, like memcpy, counts as vectorized too, as these functions are optimized.

See also #7.

Algorithm Vectorized? Further work Comment
for_each,for_each_n compiler no
all_of,any_of,none_of no no depends on predicate
contains library no
contains_subrange library no
find library no
find_if,find_if_not no no depends on predicate
find_last library no
find_last_if,find_last_if_not no no depends on predicate
find_end library no
find_first_of library no
adjacent_find no maybe can be done in the library, although seems rarely useful
count library no manual vectorization complete, auto vectorization rejected
count_if no no auto vectorization takes too much additional code, rejected
mismatch library no
equal library no
search library no vectorized for 1 and 2 bytes elements, no good results predicted for 4 and 8 bytes
search_n library yes
starts_with, ends_with library no
fold family compiler no
copy, copy_n,copy_backward library no
copy_if no no depends on predicate
move,move_backward library no
swap compiler no
swap_ranges library no
iter_swap non-range no
transform compiler no
replace library no remove library vectorization if compiler implements DevCom-10610477
replace_if no no compiler may implement DevCom-10610477, otherwise nothing can be done
replace_copy,replace_copy_if compiler no remove workaround if compiler implements DevCom-10606350
fill, fill_n library no yield to compiler in the future; see #3642, DevCom-10334038, DevCom-10334032; see also #3167
generate compiler no compiler does it, and depends on predicate
remove library no
remove_if no no depends on predicate
remove_copy no no attempted and reverted due to slow AMD AVX2 mask stores
remove_copy_if no no depends on predicate
unique no yes PR in review
unique_copy no no doesn't look feasible due to slow AMD AVX2 mask stores
reverse, reverse_copy library no
rotate library maybe might be dedicated algorithm instead of reverse
rotate_copy library no
shift_left, shift_right library no
shuffle family no no it is random
sample no no it is random
is_partition,partition_point no no depends on predicate
partition family no no depends on predicate
sort family, nth_element no no too complex to implement/maintain
is_sorted_until no yes see the comment below
binary_search family no no vectorization isn't expected to improve a lot
includes no no too complex, successful vectorization unlikely
set_* family no no too complex, successful vectorization unlikely
merge family no no too complex, successful vectorization unlikely
*_heap family no no too complex, successful vectorization unlikely
minmax_element family library no
minmax family library no
clamp non-range no
lexicographical_compare library no
*_permutation family no maybe can be done in library, although complex and rare enough
iota compiler maybe Vectorized for 4 and 8 bytes elements with some workaround, this can potentially be improved and extended by improving the compiler or by vectorizing manually
accumulate compiler no
inner_product compiler no
reduce,transform_reduce, compiler no
*_scan family no maybe can be done in the library, although seems rarely useful
adjacent_difference compiler no
partial_sum no maybe
@Alcaro
Copy link
Contributor

Alcaro commented Feb 20, 2024

I mentioned this on Discord, but it seems some of it didn't make it into edits, so I'll just take credit repost it.

  • Vectorizing find_first_of in the general case does indeed sound impossible, but specializing it for the most common sizes of s_{first,last} would be significantly more possible.
  • For int8 and int16 (and less than 16 bytes to search for), find_first_of looks like an almost perfect match for one of the SSE4.2 string compare instructions. (Not entirely sure which one, though; it's a complicated instruction, and the documentation is written more from a hardware/implementation perspective than user perspective.)
  • is_sorted sounds easy. Just load pointer and pointer+sizeof(int32), compare less-or-equal, and check if all true. Sure, it's unaligned, but it's faster than scalar. (For unsigned compare, xor both sides with 0x80000000 then do signed compare.) (Alternatively, you could perform aligned loads then use SSSE3 _mm_alignr_epi8 to get the bytes where you want them. AVX2 _mm256_alignr_epi8 is, as far as I can determine, useless - it can't transfer bytes across 128bit lanes.)
  • remove looks like a good match for AVX512_VBMI2 vpcompress*, but AVX512 is currently rare, so I doubt that effort is justified. At least not yet, it can be revisited a few years.

Fair chance some of the above aren't worth the effort, but I don't have the numbers or intuition to guess which. Nor do I have any experience or knowledge regarding ARM vectorization.

@StephanTLavavej StephanTLavavej added the performance Must go faster label Feb 21, 2024
@StephanTLavavej
Copy link
Member

This sounds like a reasonable analysis to us. We would want to consider PRs for this, with benchmarks, for individual algorithms at a time (not all at once please!).

#813 tracks ARM64 vectorization and is orthogonal to this issue.

@jovibor
Copy link
Contributor

jovibor commented Feb 22, 2024

  • but AVX512 is currently rare, so I doubt that effort is justified.

According to the Steam Hardware & Software Survey: January 2024, AVX512 has only ~11% incidence, so it's definitely not worth it. Even the relatively old AVX2 has only 92%.
image

@Alcaro
Copy link
Contributor

Alcaro commented Feb 22, 2024

Speeding things up for 92% of people is definitely worth it. Needs a cpuid check so the last 8% don't crash, but they can just fall back to the scalar version.

But yes, 11% is significantly less. And that's 11% of gamers, who tend to have better processors than average. And that's for the third level of AVX512; std::remove needs VBMI2, which is the fifth level.

image

Though judging by the difference between first, 11.26%, and third, 11.14%, fair chance fifth is also about 11%.

But while 11% (or whatever the real number is) is a small fraction, it's 11% of Windows installations, which is still many millions.

That's why that 'doubt' is there. I don't know how Microsoft's priorities look at that scale.

(Definitely should take the AVX2-capable ones first, though.)

@AlexGuteniev
Copy link
Contributor Author

We're still talking about very small fraction of users, because not many programs would spend in these algorithms significant amount of the time (if any time at all). I was pleased to know about #3617 in the sense there's at least one attempt to use the improved algorithm on a large amount of data.

lexicographical_compare / mismatch and improving minmax look like the priority directions, many programs do compare strings and arrays, and minmax is apparently used.

@AlexGuteniev

This comment was marked as resolved.

@frederick-vs-ja
Copy link
Contributor

I guess swap should be considered conditionally performed on range, see #2683.

Not sure whether such strategy is viable:

(We could possibly separate out the pragma that drags in the import lib / static lib from the rest of the <yvals.h> machinery.)

@AlexGuteniev

This comment was marked as outdated.

@AlexGuteniev AlexGuteniev changed the title Vectorize more algorithms Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2 Sep 11, 2024
@AlexGuteniev

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants