-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle the remainder in MemoryExtensions.Count vectorized #82687
Conversation
Saves an additional vpbroadcastb.
Avoids the signed integer division.
Benchmarking showed that the cost is quite high, so for just a few elements the scalar loop seems better.
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsDescriptionIn the current implementation the remainder of the vectorized code-path is processed scalar. This is done by reading a vector from the end, comparing with the target vector, and extracting the most significant bits as usual. Benchmarking showed that the cost of doing the remainder vectorized is higher than scalar processing if there are just a few elements. BenchmarksBenchmark-code, run on win-x64 with .NET 8 Preview 1.
|
Author: | gfoidl |
---|---|
Assignees: | - |
Labels: |
|
Milestone: | - |
How much higher? The / 2 feels a bit arbitrary. |
The Benchmarks for
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Played around with these ideas (back then when creating this PR, may not be fully fleshed out), but I don't think that's something that should be merged here, as
|
Description
In the current implementation the remainder of the vectorized code-path is processed scalar.
This PR processes the remainder vectorized too.
This is done by reading a vector from the end, comparing with the target vector, and extracting the most significant bits as usual.
As some elements may overlap now, we need to shift them off from the mask to get the correct count.
Benchmarking showed that the cost of doing the remainder vectorized is higher than scalar processing if there are just a few elements.
Thus the remainder is done vectorized only if remaining length is more than half of a vector size.
Benchmarks
Benchmark-code, run on win-x64 with .NET 8 Preview 1.
Bencharks are done for the lengths
Vector256<T>.Count + 1
andVector256<T>.Count * 2 - 1
, so the both extreme cases for the remainder.byte
short
int
long
I have some other ideas on how to improve perf for
Count
, but a) I'd like to keep the changes separate to make it easier to track the improvements, and b) at the moment it's a bit difficult with time for me...