Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-vectorize arrays swap #4991

Merged
merged 14 commits into from
Oct 24, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Sep 29, 2024

Resolves #2683

📜 The Approach

Instead of trying to call a separately compiled implementation in the import library, reimplement the specific case in headers the way that compiler is able to vectorize it.

Do memcpy swap by portions of nice length, and then tail, so that compiler can use registers as immediate storage, vector registers when possible. Due to compile-time known length this will unroll perfectly.

I didn't investigate a lot which length is nice, I think 64 is a good choice:

  • Not too much stack consumed in debug mode where there will be actual stack allocation
  • Enough to use AVX-512 if available
  • Not too much to use only SSE2
  • I hope it is good for Arm64 too 🤷

⚖️ Self swap check

[utility.swap]/7 says:

Effects: As if by swap_ranges(a, a + N, b).

[alg.swap]/2 says:

Preconditions: The two ranges [first1, last1) and [first2, last2) do not overlap.

Therefore it is UB to swap overlapping arrays.

Added an assert for that and removed the check.

@frederick-vs-ja reported LWG-4165, and the check preserved.

⏱️ Benchmark results

Benchmark main this
std_swap<1, uint8_t> 1.22 ns 1.20 ns
std_swap<5, uint8_t> 2.83 ns 1.44 ns
std_swap<15, uint8_t> 10.7 ns 1.92 ns
std_swap<26, uint8_t> 14.9 ns 1.93 ns
std_swap<38, uint8_t> 25.6 ns 2.41 ns
std_swap<60, uint8_t> 32.4 ns 2.64 ns
std_swap<125, uint8_t> 62.0 ns 4.31 ns
std_swap<800, uint8_t> 399 ns 22.0 ns
std_swap<3000, uint8_t> 1490 ns 90.4 ns
std_swap<9000, uint8_t> 4365 ns 203 ns
std_swap_ranges<uint8_t>/1 1.97 ns 1.91 ns
std_swap_ranges<uint8_t>/5 4.03 ns 3.58 ns
std_swap_ranges<uint8_t>/15 5.26 ns 5.01 ns
std_swap_ranges<uint8_t>/26 3.55 ns 3.12 ns
std_swap_ranges<uint8_t>/38 5.34 ns 4.53 ns
std_swap_ranges<uint8_t>/60 5.61 ns 5.05 ns
std_swap_ranges<uint8_t>/125 7.01 ns 6.50 ns
std_swap_ranges<uint8_t>/800 19.1 ns 16.9 ns
std_swap_ranges<uint8_t>/3000 81.8 ns 79.4 ns
std_swap_ranges<uint8_t>/9000 139 ns 138 ns

🥇 Results interpretation

  • Improvement even for arrays of 5 elements!
  • Sometimes wins and sometimes loses to manually vectorized implementation
    • The auto vectorization of middle sized arrays benefits from knowing the number of elements in advance and full unrolling
    • The manual vectorization of large sized array benefits from CPU dispatch and small loop body

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner September 29, 2024 10:05
@philnik777
Copy link

Can't you just throw in a __restrict? Works just fine with Clang/libc++: https://godbolt.org/z/Ez1hM11jG

@AlexGuteniev
Copy link
Contributor Author

Can't you just throw in a __restrict? Works just fine with Clang/libc++:

No, unfortunately. Pragma loop ivdep also doesn't help

@CaseyCarter CaseyCarter added the performance Must go faster label Sep 29, 2024
@StephanTLavavej StephanTLavavej self-assigned this Sep 29, 2024
@AlexGuteniev
Copy link
Contributor Author

I want to note that ranges::swap does not have any self-swap check, unlike std::swap.

@StephanTLavavej StephanTLavavej changed the title Vectorize arrays swap Auto-vectorize arrays swap Oct 22, 2024
@StephanTLavavej
Copy link
Member

Thanks! 🐈 I pushed fairly minor changes.

Results on my 5950X:

Benchmark Before After Speedup
std_swap<1, uint8_t> 2.98 ns 3.63 ns 0.82
std_swap<5, uint8_t> 4.72 ns 3.84 ns 1.23
std_swap<15, uint8_t> 8.33 ns 3.87 ns 2.15
std_swap<26, uint8_t> 24.0 ns 3.65 ns 6.58
std_swap<38, uint8_t> 38.7 ns 3.86 ns 10.03
std_swap<60, uint8_t> 62.5 ns 3.87 ns 16.15
std_swap<125, uint8_t> 129 ns 4.75 ns 27.16
std_swap<800, uint8_t> 816 ns 21.4 ns 38.13
std_swap<3000, uint8_t> 3019 ns 80.6 ns 37.46
std_swap<9000, uint8_t> 9262 ns 241 ns 38.43
std_swap_ranges<uint8_t>/1 4.70 ns 4.48 ns 1.05
std_swap_ranges<uint8_t>/5 5.60 ns 6.22 ns 0.90
std_swap_ranges<uint8_t>/15 7.72 ns 7.75 ns 1.00
std_swap_ranges<uint8_t>/26 4.96 ns 4.97 ns 1.00
std_swap_ranges<uint8_t>/38 6.71 ns 6.69 ns 1.00
std_swap_ranges<uint8_t>/60 6.30 ns 6.31 ns 1.00
std_swap_ranges<uint8_t>/125 8.02 ns 7.97 ns 1.01
std_swap_ranges<uint8_t>/800 22.5 ns 23.8 ns 0.95
std_swap_ranges<uint8_t>/3000 86.2 ns 89.9 ns 0.96
std_swap_ranges<uint8_t>/9000 124 ns 123 ns 1.01

@StephanTLavavej StephanTLavavej removed their assignment Oct 22, 2024
@AlexGuteniev
Copy link
Contributor Author

Results on my 5950X:

A big difference in "After" columnbetween std_swap<9000, uint8_t> and std_swap_ranges<uint8_t>/9000 made me curious.
It is big for me too.

Looks like the main reason is vector over-alignment. adding alignas(64) to stack array makes significant improvement there.

I don't think the benchmark should be modified to have that though.

@CaseyCarter
Copy link
Contributor

Results on my 5950X:

A big difference in "After" columnbetween std_swap<9000, uint8_t> and std_swap_ranges<uint8_t>/9000 made me curious. It is big for me too.

Looks like the main reason is vector over-alignment. adding alignas(64) to stack array makes significant improvement there.

I don't think the benchmark should be modified to have that though.

If we know that alignment has a significant effect on the results, shouldn't we control it instead of letting it vary and skew the benchmark? Note that "control" doesn't mean "pick an unrealistic value that skews the results" but either pick a value that is representative or average over a set of values that are representative.

@AlexGuteniev
Copy link
Contributor Author

but either pick a value that is representative or average over a set of values that are representative

x64 stack is 16-aligned on function entry, but for multiple variables on the stack the location of each is limited by its own alignment and alignment of the variable next to it.

malloc has also 16 bytes alignment.

So 8 is a good skew maybe to imitate being next to a pointer.
Or 16 to imitate top or the only variable on stack or malloc allocation.

@StephanTLavavej StephanTLavavej self-assigned this Oct 23, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej
Copy link
Member

🔀 🚀 😻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

std::swap of arrays, why is there no specialization for trivial types
5 participants