Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2 #4415

AlexGuteniev · 2024-02-20T13:33:44Z

Tracking any remaining algorithms to vectorize.

Optimized via C runtime library functions, like memcpy, counts as vectorized too, as these functions are optimized.

See also #7.

Algorithm	Vectorized?	Further work	Comment
`for_each`,`for_each_n`	compiler	no
`all_of`,`any_of`,`none_of`	no	no	depends on predicate
`contains`	library	no
`contains_subrange`	library	no
`find`	library	no
`find_if`,`find_if_not`	no	no	depends on predicate
`find_last`	library	no
`find_last_if`,`find_last_if_not`	no	no	depends on predicate
`find_end`	library	no
`find_first_of`	library	no
`adjacent_find`	no	maybe	can be done in the library, although seems rarely useful
`count`	library	no	manual vectorization complete, auto vectorization rejected
`count_if`	no	no	auto vectorization takes too much additional code, rejected
`mismatch`	library	no
`equal`	library	no
`search`	library	no	vectorized for 1 and 2 bytes elements, no good results predicted for 4 and 8 bytes
`search_n`	library	yes
`starts_with`, `ends_with`	library	no
`fold` family	compiler	no
`copy`, `copy_n`,`copy_backward`	library	no
`copy_if`	no	no	depends on predicate
`move`,`move_backward`	library	no
`swap`	compiler	no
`swap_ranges`	library	no
`iter_swap`	non-range	no
`transform`	compiler	no
`replace`	library	no	remove library vectorization if compiler implements DevCom-10610477
`replace_if`	no	no	compiler may implement DevCom-10610477, otherwise nothing can be done
`replace_copy`,`replace_copy_if`	compiler	no	remove workaround if compiler implements DevCom-10606350
`fill`, `fill_n`	library	no	yield to compiler in the future; see #3642, DevCom-10334038, DevCom-10334032; see also #3167
`generate`	compiler	no	compiler does it, and depends on predicate
`remove`	library	no
`remove_if`	no	no	depends on predicate
`remove_copy`	no	no	attempted and reverted due to slow AMD AVX2 mask stores
`remove_copy_if`	no	no	depends on predicate
`unique`	no	yes	PR in review
`unique_copy`	no	no	doesn't look feasible due to slow AMD AVX2 mask stores
`reverse`, `reverse_copy`	library	no
`rotate`	library	maybe	might be dedicated algorithm instead of reverse
`rotate_copy`	library	no
`shift_left`, `shift_right`	library	no
`shuffle` family	no	no	it is random
`sample`	no	no	it is random
`is_partition`,`partition_point`	no	no	depends on predicate
`partition` family	no	no	depends on predicate
`sort` family, `nth_element`	no	no	too complex to implement/maintain
`is_sorted_until`	no	yes	see the comment below
`binary_search` family	no	no	vectorization isn't expected to improve a lot
`includes`	no	no	too complex, successful vectorization unlikely
`set_*` family	no	no	too complex, successful vectorization unlikely
`merge` family	no	no	too complex, successful vectorization unlikely
`*_heap` family	no	no	too complex, successful vectorization unlikely
`minmax_element` family	library	no
`minmax` family	library	no
`clamp`	non-range	no
`lexicographical_compare`	library	no
`*_permutation` family	no	maybe	can be done in library, although complex and rare enough
`iota`	compiler	maybe	Vectorized for 4 and 8 bytes elements with some workaround, this can potentially be improved and extended by improving the compiler or by vectorizing manually
`accumulate`	compiler	no
`inner_product`	compiler	no
`reduce`,`transform_reduce`,	compiler	no
`*_scan` family	no	maybe	can be done in the library, although seems rarely useful
`adjacent_difference`	compiler	no
`partial_sum`	no	maybe

The text was updated successfully, but these errors were encountered:

Alcaro · 2024-02-20T17:30:15Z

I mentioned this on Discord, but it seems some of it didn't make it into edits, so I'll just ~~take credit~~ repost it.

Vectorizing find_first_of in the general case does indeed sound impossible, but specializing it for the most common sizes of s_{first,last} would be significantly more possible.
For int8 and int16 (and less than 16 bytes to search for), find_first_of looks like an almost perfect match for one of the SSE4.2 string compare instructions. (Not entirely sure which one, though; it's a complicated instruction, and the documentation is written more from a hardware/implementation perspective than user perspective.)
is_sorted sounds easy. Just load pointer and pointer+sizeof(int32), compare less-or-equal, and check if all true. Sure, it's unaligned, but it's faster than scalar. (For unsigned compare, xor both sides with 0x80000000 then do signed compare.) (Alternatively, you could perform aligned loads then use SSSE3 _mm_alignr_epi8 to get the bytes where you want them. AVX2 _mm256_alignr_epi8 is, as far as I can determine, useless - it can't transfer bytes across 128bit lanes.)
remove looks like a good match for AVX512_VBMI2 vpcompress*, but AVX512 is currently rare, so I doubt that effort is justified. At least not yet, it can be revisited a few years.

Fair chance some of the above aren't worth the effort, but I don't have the numbers or intuition to guess which. Nor do I have any experience or knowledge regarding ARM vectorization.

StephanTLavavej · 2024-02-21T22:25:18Z

This sounds like a reasonable analysis to us. We would want to consider PRs for this, with benchmarks, for individual algorithms at a time (not all at once please!).

#813 tracks ARM64 vectorization and is orthogonal to this issue.

jovibor · 2024-02-22T09:33:57Z

but AVX512 is currently rare, so I doubt that effort is justified.

According to the Steam Hardware & Software Survey: January 2024, AVX512 has only ~11% incidence, so it's definitely not worth it. Even the relatively old AVX2 has only 92%.

Alcaro · 2024-02-22T10:50:58Z

Speeding things up for 92% of people is definitely worth it. Needs a cpuid check so the last 8% don't crash, but they can just fall back to the scalar version.

But yes, 11% is significantly less. And that's 11% of gamers, who tend to have better processors than average. And that's for the third level of AVX512; std::remove needs VBMI2, which is the fifth level.

Though judging by the difference between first, 11.26%, and third, 11.14%, fair chance fifth is also about 11%.

But while 11% (or whatever the real number is) is a small fraction, it's 11% of Windows installations, which is still many millions.

That's why that 'doubt' is there. I don't know how Microsoft's priorities look at that scale.

(Definitely should take the AVX2-capable ones first, though.)

AlexGuteniev · 2024-02-22T11:56:00Z

We're still talking about very small fraction of users, because not many programs would spend in these algorithms significant amount of the time (if any time at all). I was pleased to know about #3617 in the sense there's at least one attempt to use the improved algorithm on a large amount of data.

lexicographical_compare / mismatch and improving minmax look like the priority directions, many programs do compare strings and arrays, and minmax is apparently used.

frederick-vs-ja · 2024-04-12T10:54:53Z

I guess swap should be considered conditionally performed on range, see #2683.

Not sure whether such strategy is viable:

(We could possibly separate out the pragma that drags in the import lib / static lib from the rest of the <yvals.h> machinery.)

StephanTLavavej added the performance Must go faster label Feb 21, 2024

AlexGuteniev mentioned this issue Mar 10, 2024

Vectorize find_first_of for 8 and 16 bit elements with SSE4.2 pcmpestri #4466

Merged

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

AlexGuteniev changed the title ~~Vectorize more algorithms~~ Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2 Sep 11, 2024

This comment was marked as resolved.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2 #4415

Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2 #4415

AlexGuteniev commented Feb 20, 2024 •

edited

Loading

Alcaro commented Feb 20, 2024 •

edited

Loading

StephanTLavavej commented Feb 21, 2024

jovibor commented Feb 22, 2024

Alcaro commented Feb 22, 2024

AlexGuteniev commented Feb 22, 2024

This comment was marked as resolved.

frederick-vs-ja commented Apr 12, 2024

This comment was marked as outdated.

This comment was marked as resolved.

Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2 #4415

Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2 #4415

Comments

AlexGuteniev commented Feb 20, 2024 • edited Loading

Alcaro commented Feb 20, 2024 • edited Loading

StephanTLavavej commented Feb 21, 2024

jovibor commented Feb 22, 2024

Alcaro commented Feb 22, 2024

AlexGuteniev commented Feb 22, 2024

This comment was marked as resolved.

frederick-vs-ja commented Apr 12, 2024

This comment was marked as outdated.

This comment was marked as resolved.

AlexGuteniev commented Feb 20, 2024 •

edited

Loading

Alcaro commented Feb 20, 2024 •

edited

Loading