Consider SIMD for unmasking #36

Shnatsel · 2018-07-09T23:02:28Z

Now that SIMD intrinsics for x86 have been stabilized, it might be worthwhile to add explicit SIMD to accelerate unmasking. For example, autobahn-python uses SIMD exactly for this.

This stackoverflow thread contains a barebones implementation of XOR unmasking for websockets in C and x86 intrinsics.

SIMDeez crate might be useful for abstracting over instruction widths and providing runtime selection of the SIMD implementation depending on availability. Its description also links to other SIMD crates to consider.

Things to look out for:

Explicit SIMD instructions are unsafe. Crate faster provides safe iterator abstraction on top of them, but sacrifices runtime selection of the instruction set and only works on nightly
Compiler auto-vectorization into SSE might be able to achieve the same without unsafe blocks, although it is inherently fragile
Runtime selection of SIMD implementation may hurt performance on small inputs
Achieving alignment to 16 bytes instead of 4 bytes in the current implementation may hurt performance on small inputs

The text was updated successfully, but these errors were encountered:

agalakhov · 2018-07-10T08:01:52Z

It contains unsafe blocks already... so it won't be worse. I'll do benchmarks. Thanks!

Shnatsel · 2018-07-10T13:09:54Z

FWIW I've experimented with Faster which would let us get rid of unsafes in Tungstenite, and I think I've figured out a workable solution, but it still involves casting &[u8; 4] to u32, which may interact with endianness in nontrivial ways.

I feel this is something that Faster API could improve on, so I've opened an issue on their bug tracker.

bluetech · 2018-07-10T16:41:13Z

I have some C code with benchmarks of different block sizes. I'm not sure if it really gives "true" results but you can try it anyway! The SIMD code assumes x86_64 with AVX2. https://gist.github.com/bluetech/36ac1d0b21864a4f42fa723de569e5f8

Shnatsel · 2018-07-10T16:58:41Z

I believe benchmarks against SSE2 would be more interesting because enabling AVX2 would require runtime detection, which will likely hurt small inputs (e.g. by interfering with inlining). While on x86_64 SSE2 can be assumed to be present unconditionally.

But that is interesting nevertheless, thank you!

bluetech · 2018-07-10T17:07:14Z

Yes, I agree. I thought there might be a way to do the runtime detection just once and store in a function pointer or something like that, but it's probably difficult.

Anyway, I am not sure SIMD actually helps. Assuming my benchmark is correct, on my CPU (Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz) I could not get a speedup compared to 8 bytes at a time. Maybe the benchmark is wrong.

agalakhov · 2018-07-10T17:35:23Z

Sorry, closed accidentally due to a browser bug.

Shnatsel · 2018-07-12T21:52:06Z

I've prototyped SIMD masking using Faster.

The good news is that basically the entire mask_fast_32() with all its unsafes can be rewritten into this:

let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
let mut output = vec![0u8; buf.len()];
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_collect()

The bad news is that this code is 2x slower than the scalar version that mutates the buffer in-place. Faster does not have in-place mutation yet, but it is in progress.

(In case you're wondering, preallocating a buffer and writing there instead of using .scalar_collect() is even slower).

agalakhov · 2018-07-12T22:21:49Z

Thank you. I believe in-place is not doable without unsafe right now. And I also think it is doable in 64-bit blocks by duplicating the mask.

agalakhov closed this as completed Jul 10, 2018

agalakhov reopened this Jul 10, 2018

daniel-abramov added enhancement question labels Jul 11, 2018

daniel-abramov removed the question label Jul 17, 2018

daniel-abramov mentioned this issue May 7, 2023

benchmark result to compare with other crates #352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider SIMD for unmasking #36

Consider SIMD for unmasking #36

Shnatsel commented Jul 9, 2018

agalakhov commented Jul 10, 2018

Shnatsel commented Jul 10, 2018 •

edited

Loading

bluetech commented Jul 10, 2018

Shnatsel commented Jul 10, 2018 •

edited

Loading

bluetech commented Jul 10, 2018

agalakhov commented Jul 10, 2018

Shnatsel commented Jul 12, 2018

agalakhov commented Jul 12, 2018

Consider SIMD for unmasking #36

Consider SIMD for unmasking #36

Comments

Shnatsel commented Jul 9, 2018

agalakhov commented Jul 10, 2018

Shnatsel commented Jul 10, 2018 • edited Loading

bluetech commented Jul 10, 2018

Shnatsel commented Jul 10, 2018 • edited Loading

bluetech commented Jul 10, 2018

agalakhov commented Jul 10, 2018

Shnatsel commented Jul 12, 2018

agalakhov commented Jul 12, 2018

Shnatsel commented Jul 10, 2018 •

edited

Loading

Shnatsel commented Jul 10, 2018 •

edited

Loading