-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider SIMD for unmasking #36
Comments
It contains unsafe blocks already... so it won't be worse. I'll do benchmarks. Thanks! |
FWIW I've experimented with Faster which would let us get rid of I feel this is something that Faster API could improve on, so I've opened an issue on their bug tracker. |
I have some C code with benchmarks of different block sizes. I'm not sure if it really gives "true" results but you can try it anyway! The SIMD code assumes x86_64 with AVX2. https://gist.github.com/bluetech/36ac1d0b21864a4f42fa723de569e5f8 |
I believe benchmarks against SSE2 would be more interesting because enabling AVX2 would require runtime detection, which will likely hurt small inputs (e.g. by interfering with inlining). While on x86_64 SSE2 can be assumed to be present unconditionally. But that is interesting nevertheless, thank you! |
Yes, I agree. I thought there might be a way to do the runtime detection just once and store in a function pointer or something like that, but it's probably difficult. Anyway, I am not sure SIMD actually helps. Assuming my benchmark is correct, on my CPU ( |
Sorry, closed accidentally due to a browser bug. |
I've prototyped SIMD masking using Faster. The good news is that basically the entire let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
let mut output = vec![0u8; buf.len()];
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_collect() The bad news is that this code is 2x slower than the scalar version that mutates the buffer in-place. Faster does not have in-place mutation yet, but it is in progress. (In case you're wondering, preallocating a buffer and writing there instead of using |
Thank you. I believe in-place is not doable without |
Now that SIMD intrinsics for x86 have been stabilized, it might be worthwhile to add explicit SIMD to accelerate unmasking. For example, autobahn-python uses SIMD exactly for this.
This stackoverflow thread contains a barebones implementation of XOR unmasking for websockets in C and x86 intrinsics.
SIMDeez crate might be useful for abstracting over instruction widths and providing runtime selection of the SIMD implementation depending on availability. Its description also links to other SIMD crates to consider.
Things to look out for:
unsafe
blocks, although it is inherently fragileThe text was updated successfully, but these errors were encountered: