-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
Thanks @zeux for filing this, and collapsing issues into a structured PR. Given the high interest in this family of instructions, but also the high performance cost on ARM this would need some prototype data - I think an engine prototype experiment here is justified. In the meantime if others have opinions about semantic changes, or have benchmarks to share that we can experiment please share them here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add an entry to ImplementationStatus.md to include bitmask operations?
That's how it could be lowered for ARM and x86-64 by LLVM codegen: |
@AndrewScheidecker reached out to me to note that WAVM implemented 8x16 bitmask (under a different name / encoding, ltz_mask) and the two other instructions were just added: WAVM/WAVM@1b803bb |
@dtig Thanks, I've added the instructions to the status table as well. The main three reasons that motivated me to submit this were:
Also the SIMD proposal in general right now is heavily skewed towards ARM64 being as fast as possible so maybe it's fair to have an instruction that favors x64 more 😆 |
I've come up with an alternative lowering for ARM64 that is slightly faster, at 8 instructions (6 vector and 2 scalar, including address generation/vector load):
generated assembly (gcc; clang is subject to an issue similar to #196 and generates less efficient code here):
I've updated the lowering in PR message with this code. Both original code snippet and the new one can be seen here https://gcc.godbolt.org/z/DBRexR. |
@zeux @dtig This looks great, thanks for submitted this! This looks like it could be a very important add to the current SIMD instruction proposal. The spec is not set yet and more data is desired, but looking here: https://github.com/zeux/wasm-simd/blob/master/Instructions.md there is a clear disparity w.r.t the instruction lowerings between arm64 and x64 for v8. This is concerning, specifically there are a number of instructions that have a much higher instruction cost on IA (compared to many 1-to-1 mapping on arm64) which would/should disallow near native performance in real workloads on x64. Alternative solutions aside, I am thinking maybe this proposal should help mitigate some of those higher costs by helping to provide an alternative approach to lowering some processing tasks. An example would maybe be the all_true mentioned here #189? This looks very important? What are the next steps for making this officially part of the SIMD proposal? |
@jlb6740 Concrete next steps here are that this needs to be prototyped in an engine, and benchmarked to see how the performance compares on real world benchmarks. (If there are benchmarks that rely on this heavily, pointers or links appreciated!). There is work underway in V8 to do this, and here is the tracking bug if you would like to follow along. Regarding the mismatch of number of instructions for x64/Arm64 -
|
Hi @dtig, thanks a lot for the thoughtful reply. Especially appreciate the comment highlighting that the disparities shown in the table are more born from the legacy of Simd.js and not from reasons that began with the efforts here. I completely understand that and understand that is part of the challenge. I also agree with the sentiment about AVX code gen being important for future performance. Certainly a lot of thought and effort has gone into getting SIMD this far. The table, that skew though, is none-the-less concerning for me at least, and is something that I think collectively deserves pause on. It is why even without numbers yet, I am excited for the possibilities of this proposal. I am maybe coming at that table from a different perspective … when it’s mentioned that
for me, this is not the test. Comparing Wasm scalar to Wasm SIMD is not the test .. in general some improvement should be shown. The test is to be near native performance and platform agnostic. Any hardware advantages that exist with native performance due to hardware instruction issue, power consumption, CPU frequency, etc, you would expect to remain if your ISA is truly agnostic. The benchmark then is comparing native workloads targeting SIMD to the equivalent compiled to WASM SIMD running on the same platform. Unfortunately, there is a dearth of workloads that allow for such comparisons (I’m looking), this table though is probably more telling anyway in some respects than an arbitrary workload. Workloads are obviously important, but they do change because applications change over time and right now they don’t really exist for doing the comparison I speak of (particularly off-browser). Also, workload performance should be analyzed case-by-case in order to study all of the other factors that went into its performance before takeaways on what performance scores says about the ISA it targets. This table gives us a true bounds on the potential losslessness of efficiency when lowering a specific SIMD instruction. All things equal, anything other than a 1:1 mapping of the WASM to hardware introduces the ISA as a bottleneck for potential lossless performance/power. As you rightfully point out its not expected for all instructions to be equally efficient for all hardware targets, but perhaps before things are baked in there are a few open issues/tweaks that can be visited. For example, this issue here for bitmask instruction add, another maybe being swizzle #93 and revisiting the out of range handling, and even the issue you point out, the propagation of NaN for floating point min/max and IEEE754 compliance. From what is here the WASM spec states “propagate NaN payloads from their operands is permitted but not required”. To me, to be most compliant with this spec would mean to not mandate any requirement at all. In general with some of these issues, relaxing some aspects of these definitions may go a long way in performance for some future application. I am interested in any comments here. BTW … would this new bitmask instruction being proposed here (and any instruction) also require an implementation in a second VM? Also, definitely eager to follow work already begun here. However, I did not have permission even though I have access to other chromium bugs. How does one request permission to view? Thanks. |
@jlb6740 Just a note that instruction count is generally indicative more of latency than of throughput, and as such the penalty from a multi-instruction sequence is going to vary case by case. For example some 8-instruction sequences that v8 synthesizes end up running at 3 cycles/iteration throughput (and 15 cycles latency in some cases), so it's hard to estimate the full real-world impact without profiling real workloads. Conversely, many single instructions in SSE have several cycles of latency, so a apples to apples comparison is challenging. This is not to downplay the imbalance - the reason why I created the documentation is that I observed the instruction selection to have a real impact on real-world benchmarks - but just to say that you shouldn't look at the table and expect a 10:1 ratio in performance (... with one exception which is byte shuffles). We discussed the general performance imbalance questions in Wasm calls before and I think the overall stance is that while in some cases it's concerning to have an imbalance, the reality of the situation is that in many cases there's no good way to alter the spec to fix that. E.g. 8-bit signed shifts in v8 on x64 (as of a recent submit) lower to an instruction sequence with 3-cycle throughput, 8-bit unsigned shifts lower to an instruction sequence with a 2-cycle throughput, and 16-bit unsigned shifts lower to a single instruction with 1-cycle throughput - and that's just what it is until everybody gets AVX512. I would say that at this point (in my opinion - but this is based on thinking about this for a few problematic instructions) it's unlikely that we're going to get semantical changes to existing instructions from this proposal. The good news is that in most cases the performance issues can be avoided by tuning the application carefully. The other piece of good news is that "fast semantics" instructions, which will come after this proposal as I understand, will likely make it easier to write fast & portable code. And the third piece of good news is that v8 codegen continuously evolves, as you can see from the commit history on zeux/wasm-simd repository. |
@zeux Thanks for the comment and also for your table. I understand your point about throughput data showing promise, that's definitely a positive. The key though in my mind still remains approaching those 1:1 mappings and comparing to native, including power as a metric. Obviously, that apples to apples comparison with actual workloads has not been trivial. In fact that for me sort of magnifies the significance of your table. I’d be interested to follow and help this proposal, and hear comments on NaN propagation, and some other open issues that remain. That said, I do see that these issues are apparent and have been discussed so I am encouraged. |
Good point, I agree this is important. https://gitlab.com/wg1/jpeg-xl is one such example, third_party/highway is a cross-platform SIMD layer targeting both Wasm and native intrinsics, so the codec can be tested with either. |
Implement i8x16.bitmask, i16x8.bitmask, i32x4.bitmask on interpreter and arm64. These operations are behind wasm_simd_post_mvp flag, as we are only prototyping to evaluate performance. The codegen is based on guidance at WebAssembly/simd#201. Bug: v8:10308 Change-Id: I835aa8a23e677a00ee7897c1c31a028850e238a9 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2099451 Reviewed-by: Tobias Tebbi <[email protected]> Reviewed-by: Deepti Gandluri <[email protected]> Commit-Queue: Zhi An Ng <[email protected]> Cr-Commit-Position: refs/heads/master@{#66793}
@zeux Agree with you that the actual performance impact is complicated, as it depends on the CPU architecture, workloads and data dependency, etc. The reason that some WASM workloads do not show throughput penalty despite complicated lowering on x64 is that its SIMD implementation is generally wider than others. For example, Ice Lake has 3 ALU, 3 FMA, 2 Shift, and 2 Shuffle vector execution units. For workloads that can utilize more of those SIMD execution units in parallel, throughput impact would be bigger. I think a better measurement of a "performance cliff" would be a WASM/native performance ratio, which measures how much potential performance loss the WASM ISA would introduce. But I do recognize the challenge of finding those SIMD workloads. You probably have more insight into this. |
@mingqiusun Yeah I agree that ratio between native & wasm is the real test, but it gets complicated. We should probably carry the discussion elsewhere, but the complexity is that beyond the lack of deep code scheduling optimizations (v8 scheduler is weaker than LLVMs but it has to run much much much faster), there's also the issue of not being able to special-case the SIMD code to the target ISA which is commonly done in native land. In essense, WASM SIMD is pursuing the holy grail of (consistent semantics, maximum performance, fast codegen), and as it turns out it's really hard to maximize all three here :) |
This may not be a test you agree with, but in the general context of adding this an extension to Wasm MVP, knowing how this performs against scalar Wasm with (mostly) consistent performance across different architectures is still useful, and a test that the broader community accepts. While I agree that the ultimate goal is to get near native performance, in the absence of the ideal benchmark that you also recognize a dearth of, I do prefer actual forward progress with real world benchmarks.
Calling any of the workloads used arbitrary is not particularly useful here. For more context, one of the reasons that initial work on this proposal was stalled, was because of opposition from the CG that adding a SIMD extension would not guarantee consistent performance across architectures, and that while it may perform somewhat reasonably in synthetic benchmarks, it has no real world applications. Applications that are working to use this proposal are not arbitrary, but traces of real world usage that benefit from having SIMD operations, so while workloads in general are important, real world workloads and applications are even more important towards standardizing this particular proposal, and I would weigh feedback from these to be more important than synthetic benchmarks. If you do have an analysis that you would be willing to share as an example of the case-by-case study that you mention, please share so we can discuss this with specifics.
I think the compromise in #93 was reached after quite a bit of discussion, but if you have better ideas for balanced semantics, please follow up on that issue. What do you mean by not mandating any requirement at all? That we not specify out of range handling? Or that we allow operations on different platforms to behave differently? One of the requirements to be able to standardize Wasm SIMD as it is now, is the focus on minimizing non-determinism, and formally specify behavior to the extent possible. Also portability which has been talked about across different issues, that the behavior of operations has to be consistent across architectures, and the semantics specified have to guarantee portable performance the discussion is documented in older meeting notes so please take a look at #37 and the linked meeting notes for more context. In general while measuring performance cliffs, if this affects real world applications I think we should evaluate them with care, but when it comes to the performance/non-determinism tradeoff, by design to be able to standardize a SIMD proposal that is usable in a variety of contexts, this proposal should always be deterministic. In the future, relaxing this criteria in favor of performance in a future add on proposal is something we've discussed multiple times, and still on the roadmap.
I've adjusted the permissions, you should be able to see it now. |
There is a high likelihood that web use-cases and standalone/off-browser use-cases evolve differently with different priorities, requirements and limitations. Comparison with native perf could be more relevant for wasm standalone applications. AFAIK standalone uses were mostly out of scope for the portable-simd or simd.js spec that seeded the current proposal. Agree with @zeux that a spec to meet all the desired goodness will indeed be the holy-grail here. A way to address this problem will be to offer ops allowing developers more flexibility in fine tuning their application to the platforms they care about. In that sense bitmasks seems to be very handy to have. |
Summary: These experimental new instructions are proposed in WebAssembly/simd#201. Reviewers: aheejin Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits Tags: #clang Differential Revision: https://reviews.llvm.org/D76397
Adds full support for the {i8x16,i16x8,i32x4}.abs instructions merged to the SIMD proposal in WebAssembly/simd#128 as well as the {i8x16,i16x8,i32x4}.bitmask instructions proposed in WebAssembly/simd#201.
My comments weren’t about workloads but were about the baseline. I see that there is how well an application targets the SIMD ISA and separately there is how well that SIMD ISA targets the underlying hardware. The comment of “for me, this is not the test” was about the later. Definitely don’t get me wrong, there is no question that when it comes to assessing the end goal of improving real world performance, workloads and particularly those designed from real world usages are absolutely critical for assessing improvements being made in a VM.
Yes, the comments here were just put out there as suggestions for achieving a better balance with hopefully minimal invasiveness. These are thoughts and there may be other suggestions. Certainly this issue is promising, and #93 has been discussed and can be followed up on. For the min/max issue, that quote “permitted but not required” I did read that to mean that behavior of NaN propagation is not guaranteed to be the same on all platforms … but maybe the SIMD min/max is not the context this applies? Certainly the goal of minimizing non-determinism makes sense but the suggestions I gave is to relax the criteria for min/max for this iteration.
Yes I do … Thanks. And all my statements are only meant to find improvement, so thanks a lot for your comments! |
@arunetm I agree. VMs improve and change, and applications and usage patterns change, hardware is much slower to evolve. That mapping to hardware if not 1:1, bakes in an inefficiency in lowering that is always there. If after tuning it is not easily seen in some workload, if the instruction is still being executed that extra work done is likely just manifested in some other metric. You best assess the ISAs relationship with hardware by comparing Wasm SIMD to Native SIMD .. I think comparing Wasm SIMD to WASM scalar won’t really give you that insight. More instruction options such as bitmask may come in handy as you say. |
Adds full support for the {i8x16,i16x8,i32x4}.abs instructions merged to the SIMD proposal in WebAssembly/simd#128 as well as the {i8x16,i16x8,i32x4}.bitmask instructions proposed in WebAssembly/simd#201.
FYI this is implemented in LLVM and Binaryen and can be used with tot Emscripten using the clang builtins |
Summary: These experimental new instructions are proposed in WebAssembly/simd#201. Reviewers: aheejin Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits Tags: #clang Differential Revision: https://reviews.llvm.org/D76397
@zeux @ngzhian |
I did 10 runs (each, emulated and native), concatenated the output, and dumped them all into a sheet: https://docs.google.com/spreadsheets/d/1F8_vb16_zpnIaHKkjxU8C1-HsaYJEmOfhhxAXDPA4gg/edit#gid=789530331. |
@ngzhian 👍 Thanks. Not sure if the sheet was intended to be protected by I did request access. |
Sorry, made it open! |
@ngzhian .. Thanks, I checked it out. Run data does not show too much variance. I assume its sorted by time and that this was taken successively. |
Oops sorry, I messed up the sorting, I didn't mean to sort by times across all 10 runs. Updated the sheet with actual raw results across 10 runs, so it's easier to see variance run to run. please take a look again https://docs.google.com/spreadsheets/d/1F8_vb16_zpnIaHKkjxU8C1-HsaYJEmOfhhxAXDPA4gg/edit#gid=937366701 |
Added an agenda item for the next CG meeting - WebAssembly/meetings#564. @zeux Would you be interested in driving this agenda item at the CG meeting? |
@dtig Works for me |
I've also rebased this against master with new opcode numbering (using the slots left for bitmask in renumbering pass). |
Unsure if I'm going to get an invite to the meeting but fwiw I put together slides summarizing the PR discussion and performance measurements: https://docs.google.com/presentation/d/1OZSlYWWnm_pPFsjSu6iGmTQb8Qs0O2Wg2TU34-Fw-jQ/edit?usp=sharing |
Thanks @zeux for presenting today, we have consensus from the CG to include this in the proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes lgtm after resolving conflicts.
Another note: As far as I can tell, RISC-V also does not have good support for this operation. Then again, their SIMD set is still WIP. |
i8x16.bitmask and i32x4.bitmask directly map to SSE movemask instructions; i16x8.bitmask can be synthesized using packs+movemask. These instructions are important to be able to do lane-wise processing after a vector comparison - for example, these can be used together with ctz to find the index of the first lane with the matching values after a comparison instruction.
This relands commit d04b5e4. The fix here is in the assembler for pmovmskb, emit_optional_rex_32 should be called after emitting the prefix byte. Original change's description: > [wasm-simd][liftoff][ia32][x64] Implement bitmask > > Implements i8x16 i16x8 i32x4 bitmask. > > This was merged into the proposal in > WebAssembly/simd#201. > > Bug: v8:9909,v8:10308 > Change-Id: I882f0c2697213cdf593e745112e0897cee252009 > Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2222607 > Commit-Queue: Zhi An Ng <[email protected]> > Reviewed-by: Clemens Backes <[email protected]> > Cr-Commit-Position: refs/heads/master@{#68090} Bug: v8:9909, v8:10308 Change-Id: I4897585c86b87f72dc8f142b275171276d135a24 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2225090 Reviewed-by: Clemens Backes <[email protected]> Commit-Queue: Zhi An Ng <[email protected]> Cr-Commit-Position: refs/heads/master@{#68106}
This reverts commit dfbbb4a. Reason for revert: Bitmask added post 84 cut, so it is not part of origin trial. Therefore it is still a post-mvp. Original change's description: > [wasm-simd] Add bitmask to SIMD MVP > > This removes the post-mvp flag for bitmask, since it was accepted into > the proposal, see WebAssembly/simd#201. > > Bug: v8:10308 > Change-Id: I4ced43a6484660125d773bc9de46bdea9f72b13b > Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2216532 > Reviewed-by: Deepti Gandluri <[email protected]> > Commit-Queue: Zhi An Ng <[email protected]> > Cr-Commit-Position: refs/heads/master@{#67993} [email protected],[email protected] # Not skipping CQ checks because original CL landed > 1 day ago. Bug: v8:10308 Change-Id: I53294be4ea816f37c7cc5f545afb572538dd4770 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2233183 Reviewed-by: Zhi An Ng <[email protected]> Reviewed-by: Deepti Gandluri <[email protected]> Commit-Queue: Zhi An Ng <[email protected]> Cr-Commit-Position: refs/heads/master@{#68216}
Port 6da647f Original Commit Message: Now that 86 has branched, we can move bitmask into the SIMD MVP, it will not affect the current OT. (We want any OT extension to include bitmask.) Bitmask was accepted into the proposal in WebAssembly/simd#201. [email protected], [email protected], [email protected], [email protected] BUG= LOG=N Change-Id: I7518e1e8d7513a6931ff026eb3089fa896a6b587 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2379227 Reviewed-by: Junliang Yan <[email protected]> Commit-Queue: Milad Farazmand <[email protected]> Cr-Commit-Position: refs/heads/master@{#69587}
…it(to #32373348) Summary: find_first_set and find_last_set method is not optimal for neon, it need to be improved by synthesized with horizontal adds(vaddv) which will reduce the generated assembly code; in the following cases, vaddvq_s16 will generate 2 instructions but vpadd_s16 will generate 4 instrunctions ``` #ifdef __aarch64__ return vaddvq_s16(__asint); // addv h0, v1.8h // smov w1, v0.h[0] #else return vpadd_s16( vpadd_s16(vpadd_s16(__lo64(__asint), __hi64(__asint)), __zero), __zero)[0]; // addp v1.8h,v1.8h,v2.8h // addp v1.8h,v1.8h,v2.8h // addp v1.8h,v1.8h,v2.8h // smov w1, v1.h[0] #endif ``` Further discussion following the linking: [1]. WebAssembly/simd#201 [2]. WebAssembly/simd#131 Test Plan: test_run.sh Reviewers: chengbin.cb, liangbin.mj, yifeng.dongyifeng, longfei.alf, chuanqi.xcq Issue: https://aone.alibaba-inc.com/req/32373348 CR: https://code.aone.alibaba-inc.com/cpp_libs/std-simd/codereview/4534679
Introduction
A class of SIMD operations requires to perform processing in vector registers first, and then post-process some of the lanes using scalar code. Often this involves doing a vectorized comparison, followed by inspection of the lanes of the comparison mask.
This can be done using lane extraction, but it's usually slow - for example, in a vectorized memchr implementation, if any of the lanes compare as equal to the pattern, we need to quickly find the right lane.
This proposal suggests adding three new instruction that can faciliate this operation - they take all high bits from the lanes of the input vector and concatenate it into the scalar mask. The resulting mask can then be used with ctz instructions to find the first matching element, or as a table lookup index / for other needs.
See #131 and #169 for other examples of workloads where this can be beneficial.
These instructions are impossible to synthesize or emulate given the current set of WASM SIMD instructions short of extracting individual lane and performing the bit extraction.
Lowering for x64 (SSE2)
i8x16.bitmask maps directly to pmovmskb (SSE2)
i16x8.bitmask maps to a two-instruction sequence, assuming we have a zeroed vector register: packsswb converts from 16-bit integers to 8-bit integers with saturation, pmovmskb computes the mask. (or 3 instructions if we need to zero a register, or use packsswb on the same vector and then mask off the low 8 bits of the result)
i32x4.bitmask maps directly to movmskps (SSE1)
Lowering for PowerPC
PowerPC supports a more general form of this, vbpermq/vbpermd, that allow selecting arbitrary bits from the input vector:
https://www.ibm.com/support/knowledgecenter/SSGH3R_16.1.0/com.ibm.xlcpp161.aix.doc/compiler_ref/vec_bperm_p8.html
Using a predefined mask with indices of the bits (which can be loaded from memory or synthesized using lvlsl and a vector shift), the lowering is equivalent to x64 from that point on - i8x16.bitmask lowers to lvsl+shift+vbpermq, etc.
Lowering for ARM64
While ARM doesn't directly support this instruction, it supports horizontal add (which is not exposed in WASM), which makes it possible to emulate these as follows:
i8x16.bitmask lowering is the most expensive, 6 vector instructions and 2 scalar instructions; other bitmask instructions lower to 4 vector instructions + 2 scalar instructions (including 1 vector load).
32-bit ARM doesn't have vaddvq but it can be emulated using paired adds (vpadd)
Performance cliffs
The highest disparity in this proposal is between i8x16.bitmask for ARM64 (6 vector + 2 scalar instructions) and x64 (1 instruction). However, this type of disparity is relatively common for other instructions in this proposal, and, crucially, if the algorithm in question requires i8x16.bitmask, the lowering by the WASM backend is going to be substantially better than any lowering possible in WASM SIMD by using other instructions to emulate this.
Closes #131
Closes #169