Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider SIMD for unmasking #36

Open
Shnatsel opened this issue Jul 9, 2018 · 8 comments
Open

Consider SIMD for unmasking #36

Shnatsel opened this issue Jul 9, 2018 · 8 comments

Comments

@Shnatsel
Copy link
Contributor

Shnatsel commented Jul 9, 2018

Now that SIMD intrinsics for x86 have been stabilized, it might be worthwhile to add explicit SIMD to accelerate unmasking. For example, autobahn-python uses SIMD exactly for this.

This stackoverflow thread contains a barebones implementation of XOR unmasking for websockets in C and x86 intrinsics.

SIMDeez crate might be useful for abstracting over instruction widths and providing runtime selection of the SIMD implementation depending on availability. Its description also links to other SIMD crates to consider.

Things to look out for:

  • Explicit SIMD instructions are unsafe. Crate faster provides safe iterator abstraction on top of them, but sacrifices runtime selection of the instruction set and only works on nightly
  • Compiler auto-vectorization into SSE might be able to achieve the same without unsafe blocks, although it is inherently fragile
  • Runtime selection of SIMD implementation may hurt performance on small inputs
  • Achieving alignment to 16 bytes instead of 4 bytes in the current implementation may hurt performance on small inputs
@agalakhov
Copy link
Member

It contains unsafe blocks already... so it won't be worse. I'll do benchmarks. Thanks!

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jul 10, 2018

FWIW I've experimented with Faster which would let us get rid of unsafes in Tungstenite, and I think I've figured out a workable solution, but it still involves casting &[u8; 4] to u32, which may interact with endianness in nontrivial ways.

I feel this is something that Faster API could improve on, so I've opened an issue on their bug tracker.

@bluetech
Copy link
Contributor

I have some C code with benchmarks of different block sizes. I'm not sure if it really gives "true" results but you can try it anyway! The SIMD code assumes x86_64 with AVX2. https://gist.github.com/bluetech/36ac1d0b21864a4f42fa723de569e5f8

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jul 10, 2018

I believe benchmarks against SSE2 would be more interesting because enabling AVX2 would require runtime detection, which will likely hurt small inputs (e.g. by interfering with inlining). While on x86_64 SSE2 can be assumed to be present unconditionally.

But that is interesting nevertheless, thank you!

@bluetech
Copy link
Contributor

Yes, I agree. I thought there might be a way to do the runtime detection just once and store in a function pointer or something like that, but it's probably difficult.

Anyway, I am not sure SIMD actually helps. Assuming my benchmark is correct, on my CPU (Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz) I could not get a speedup compared to 8 bytes at a time. Maybe the benchmark is wrong.

@agalakhov agalakhov reopened this Jul 10, 2018
@agalakhov
Copy link
Member

Sorry, closed accidentally due to a browser bug.

@Shnatsel
Copy link
Contributor Author

I've prototyped SIMD masking using Faster.

The good news is that basically the entire mask_fast_32() with all its unsafes can be rewritten into this:

let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
let mut output = vec![0u8; buf.len()];
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_collect()

The bad news is that this code is 2x slower than the scalar version that mutates the buffer in-place. Faster does not have in-place mutation yet, but it is in progress.

(In case you're wondering, preallocating a buffer and writing there instead of using .scalar_collect() is even slower).

@agalakhov
Copy link
Member

Thank you. I believe in-place is not doable without unsafe right now. And I also think it is doable in 64-bit blocks by duplicating the mask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants