Skip to content

Conversation

@folkertdev
Copy link
Member

I'm picking 16 because on SSE the loop is just unrolled, and on other targets with 256-bit wide registers they could actually use the full width.

The table is always a size that is a power of 2, either 1 << 16 for state.head or between 1 << 8 and 1 << 15 for state.prev (it's based on the window size). So it's totally legit to use chunks here, and there is no risk of ignoring elements.

@codecov
Copy link

codecov bot commented Feb 20, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
zlib-rs/src/deflate/slide_hash.rs 99.14% <100.00%> (+0.03%) ⬆️

@folkertdev folkertdev requested a review from bjorn3 February 20, 2025 14:54
@folkertdev
Copy link
Member Author

On neon our custom version appears to generate better code (hard to judge, but it's fewer instructions in the hot loop)

https://godbolt.org/z/sd3Me8fnG

@folkertdev
Copy link
Member Author

turns out a chunk size of 32 generates less code because that appears to be, on aarch64 and x86_64, how much LLVM wants to unroll (probably based on the number of concurrent loads that the CPU can do).

@folkertdev folkertdev merged commit e53c1a6 into main Feb 20, 2025
20 checks passed
@folkertdev folkertdev deleted the vectorize-slide-hash branch February 20, 2025 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants