Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Add .bitmask instruction family #201

Merged
merged 2 commits into from
May 28, 2020
Merged

Add .bitmask instruction family #201

merged 2 commits into from
May 28, 2020

Conversation

zeux
Copy link
Contributor

@zeux zeux commented Feb 27, 2020

Introduction

A class of SIMD operations requires to perform processing in vector registers first, and then post-process some of the lanes using scalar code. Often this involves doing a vectorized comparison, followed by inspection of the lanes of the comparison mask.

This can be done using lane extraction, but it's usually slow - for example, in a vectorized memchr implementation, if any of the lanes compare as equal to the pattern, we need to quickly find the right lane.

This proposal suggests adding three new instruction that can faciliate this operation - they take all high bits from the lanes of the input vector and concatenate it into the scalar mask. The resulting mask can then be used with ctz instructions to find the first matching element, or as a table lookup index / for other needs.

See #131 and #169 for other examples of workloads where this can be beneficial.

These instructions are impossible to synthesize or emulate given the current set of WASM SIMD instructions short of extracting individual lane and performing the bit extraction.

Lowering for x64 (SSE2)

i8x16.bitmask maps directly to pmovmskb (SSE2)

i16x8.bitmask maps to a two-instruction sequence, assuming we have a zeroed vector register: packsswb converts from 16-bit integers to 8-bit integers with saturation, pmovmskb computes the mask. (or 3 instructions if we need to zero a register, or use packsswb on the same vector and then mask off the low 8 bits of the result)

i32x4.bitmask maps directly to movmskps (SSE1)

Lowering for PowerPC

PowerPC supports a more general form of this, vbpermq/vbpermd, that allow selecting arbitrary bits from the input vector:

https://www.ibm.com/support/knowledgecenter/SSGH3R_16.1.0/com.ibm.xlcpp161.aix.doc/compiler_ref/vec_bperm_p8.html

Using a predefined mask with indices of the bits (which can be loaded from memory or synthesized using lvlsl and a vector shift), the lowering is equivalent to x64 from that point on - i8x16.bitmask lowers to lvsl+shift+vbpermq, etc.

Lowering for ARM64

While ARM doesn't directly support this instruction, it supports horizontal add (which is not exposed in WASM), which makes it possible to emulate these as follows:

int bitmask_8(int8x16_t val) {
    static const uint8x16_t mask = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};

    uint8x16_t masked = vandq_u8(mask, (uint8x16_t)vshrq_n_s8(val, 7));
    uint8x16_t maskedhi = vextq_u8(masked, masked, 8);

    return vaddvq_u16((uint16x8_t)vzip1q_u8(masked, maskedhi));
}

int bitmask_16(int16x8_t val) {
    static const uint16x8_t mask = { 1, 2, 4, 8, 16, 32, 64, 128 };
    return vaddvq_u16(vandq_u16((uint16x8_t)vshrq_n_s16(val, 15), mask));
}

int bitmask_32(int32x4_t val) {
    static const uint32x4_t mask = { 1, 2, 4, 8 };
    return vaddvq_u32(vandq_u32((uint32x4_t)vshrq_n_s32(val, 31), mask));
} 

i8x16.bitmask lowering is the most expensive, 6 vector instructions and 2 scalar instructions; other bitmask instructions lower to 4 vector instructions + 2 scalar instructions (including 1 vector load).

32-bit ARM doesn't have vaddvq but it can be emulated using paired adds (vpadd)

Performance cliffs

The highest disparity in this proposal is between i8x16.bitmask for ARM64 (6 vector + 2 scalar instructions) and x64 (1 instruction). However, this type of disparity is relatively common for other instructions in this proposal, and, crucially, if the algorithm in question requires i8x16.bitmask, the lowering by the WASM backend is going to be substantially better than any lowering possible in WASM SIMD by using other instructions to emulate this.

Closes #131
Closes #169

@zeux
Copy link
Contributor Author

zeux commented Feb 27, 2020

Thanks @gnzlbg for the naming suggestion and @MaxGraey for suggestions for ARM lowering. Comments welcome - after thinking about this and studying other instructions in the proposal I really started thinking that we should add this one.

@tlively tlively requested a review from dtig February 27, 2020 17:50
@dtig
Copy link
Member

dtig commented Feb 28, 2020

Thanks @zeux for filing this, and collapsing issues into a structured PR. Given the high interest in this family of instructions, but also the high performance cost on ARM this would need some prototype data - I think an engine prototype experiment here is justified. In the meantime if others have opinions about semantic changes, or have benchmarks to share that we can experiment please share them here.

Copy link
Member

@dtig dtig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add an entry to ImplementationStatus.md to include bitmask operations?

@MaxGraey
Copy link

MaxGraey commented Feb 28, 2020

That's how it could be lowered for ARM and x86-64 by LLVM codegen:
https://godbolt.org/z/0AT8hI

@zeux
Copy link
Contributor Author

zeux commented Feb 28, 2020

@AndrewScheidecker reached out to me to note that WAVM implemented 8x16 bitmask (under a different name / encoding, ltz_mask) and the two other instructions were just added: WAVM/WAVM@1b803bb

@zeux
Copy link
Contributor Author

zeux commented Feb 28, 2020

@dtig Thanks, I've added the instructions to the status table as well. The main three reasons that motivated me to submit this were:

  • There's continuous interest in these, and some usecases just aren't implementable with existing all_ones instructions
  • There is not a good emulation path. Trying to emulate with existing instructions tends to produce less efficient assembly even on ARM64, and dramatically less efficient assembly on x64/PPC.
  • The cost is somewhat steep for 8x16 on ARM, but worth noting that the all_ones/etc are actually very much not free either (4 instructions for arm/x64 for all, 6 instructions for x64 for any). My previous mental model for these were that they're 1-2 instructions everywhere but that's very much not true, so going from SIMD to scalar domain in general is something that needs a bit of caution.

Also the SIMD proposal in general right now is heavily skewed towards ARM64 being as fast as possible so maybe it's fair to have an instruction that favors x64 more 😆

@zeux
Copy link
Contributor Author

zeux commented Mar 3, 2020

I've come up with an alternative lowering for ARM64 that is slightly faster, at 8 instructions (6 vector and 2 scalar, including address generation/vector load):

int bitmask_8B(int8x16_t val) {
    static const uint8x16_t mask = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};

    uint8x16_t masked = vandq_u8(mask, (uint8x16_t)vshrq_n_s8(val, 7));
    uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
    uint16x8_t x = (uint16x8_t)vzip1q_u8(masked, maskedhi);
    return vaddvq_u16(x);
}

generated assembly (gcc; clang is subject to an issue similar to #196 and generates less efficient code here):

bitmask_8B(__Int8x16_t):
        sshr    v0.16b, v0.16b, 7
        adrp    x0, .LC0
        ldr     q1, [x0, #:lo12:.LC0]
        and     v0.16b, v0.16b, v1.16b
        ext     v1.16b, v0.16b, v0.16b, #8
        zip1    v0.16b, v0.16b, v1.16b
        addv    h0, v0.8h
        umov    w0, v0.h[0]

I've updated the lowering in PR message with this code. Both original code snippet and the new one can be seen here https://gcc.godbolt.org/z/DBRexR.

@jlb6740
Copy link

jlb6740 commented Mar 11, 2020

@zeux @dtig This looks great, thanks for submitted this! This looks like it could be a very important add to the current SIMD instruction proposal. The spec is not set yet and more data is desired, but looking here: https://github.com/zeux/wasm-simd/blob/master/Instructions.md there is a clear disparity w.r.t the instruction lowerings between arm64 and x64 for v8. This is concerning, specifically there are a number of instructions that have a much higher instruction cost on IA (compared to many 1-to-1 mapping on arm64) which would/should disallow near native performance in real workloads on x64. Alternative solutions aside, I am thinking maybe this proposal should help mitigate some of those higher costs by helping to provide an alternative approach to lowering some processing tasks. An example would maybe be the all_true mentioned here #189? This looks very important? What are the next steps for making this officially part of the SIMD proposal?

@dtig
Copy link
Member

dtig commented Mar 17, 2020

@jlb6740 Concrete next steps here are that this needs to be prototyped in an engine, and benchmarked to see how the performance compares on real world benchmarks. (If there are benchmarks that rely on this heavily, pointers or links appreciated!). There is work underway in V8 to do this, and here is the tracking bug if you would like to follow along.

Regarding the mismatch of number of instructions for x64/Arm64 -

  • For the set of operations initially proposed as a part of Simd.js, and then adopted to Wasm - the x64 mapping is not 1:1 mostly because of the unavailability of a 1:1 instruction mapping for SSE4.1 which we are looking at as a baseline for this proposal on Intel architectures. This rationale for this set of operations is that they are useful for a broad set of applications, and not having these in the proposal leads to suboptimal performance (even on Intel platforms) as the number of available operations to emulate them won't result in better codegen.
  • As engines gain support for more AVX code generation I expect that the codegen will get better as Intel fixed some of these inconsistencies in future AVX extensions.
  • Emulation on Intel platforms with a small number of instructions does not cause performance cliffs, and the performance in real world benchmarks is consistently better than scalar and on par, or at times better than ARM even with these instructions being emulated. In most cases if you take a closer look at the codegen, these are all one cycle logical operations which in practice isn't very expensive at all on Intel platforms. The absence of 1:1 mapping especially on Intel platforms doesn't always imply a guaranteed performance cliff, and where possible we've tried to be judicious about this and bias towards what's useful for applications. (The one case where this is a performance cliff is for floating point min/max, but this is for consistency with the Wasm MVP semantics which are IEEE754 compliant.)
  • From a V8 implementer perspective, V8 currently has a prototype implementation for experimentation, and is at times not always the most optimal codegen (suggestions welcome, please file bugs!). We are continuing to work on optimizations to make this better so while the documentation is useful for applications, it may change as we optimize the engine.

@jlb6740
Copy link

jlb6740 commented Mar 19, 2020

Hi @dtig, thanks a lot for the thoughtful reply. Especially appreciate the comment highlighting that the disparities shown in the table are more born from the legacy of Simd.js and not from reasons that began with the efforts here. I completely understand that and understand that is part of the challenge. I also agree with the sentiment about AVX code gen being important for future performance. Certainly a lot of thought and effort has gone into getting SIMD this far. The table, that skew though, is none-the-less concerning for me at least, and is something that I think collectively deserves pause on. It is why even without numbers yet, I am excited for the possibilities of this proposal. I am maybe coming at that table from a different perspective … when it’s mentioned that

and the performance in real world benchmarks is consistently better than scalar and on par, or at times better than ARM even with these instructions being emulated

for me, this is not the test. Comparing Wasm scalar to Wasm SIMD is not the test .. in general some improvement should be shown. The test is to be near native performance and platform agnostic. Any hardware advantages that exist with native performance due to hardware instruction issue, power consumption, CPU frequency, etc, you would expect to remain if your ISA is truly agnostic. The benchmark then is comparing native workloads targeting SIMD to the equivalent compiled to WASM SIMD running on the same platform. Unfortunately, there is a dearth of workloads that allow for such comparisons (I’m looking), this table though is probably more telling anyway in some respects than an arbitrary workload. Workloads are obviously important, but they do change because applications change over time and right now they don’t really exist for doing the comparison I speak of (particularly off-browser). Also, workload performance should be analyzed case-by-case in order to study all of the other factors that went into its performance before takeaways on what performance scores says about the ISA it targets. This table gives us a true bounds on the potential losslessness of efficiency when lowering a specific SIMD instruction. All things equal, anything other than a 1:1 mapping of the WASM to hardware introduces the ISA as a bottleneck for potential lossless performance/power.

As you rightfully point out its not expected for all instructions to be equally efficient for all hardware targets, but perhaps before things are baked in there are a few open issues/tweaks that can be visited. For example, this issue here for bitmask instruction add, another maybe being swizzle #93 and revisiting the out of range handling, and even the issue you point out, the propagation of NaN for floating point min/max and IEEE754 compliance. From what is here the WASM spec states “propagate NaN payloads from their operands is permitted but not required”. To me, to be most compliant with this spec would mean to not mandate any requirement at all. In general with some of these issues, relaxing some aspects of these definitions may go a long way in performance for some future application. I am interested in any comments here.

BTW … would this new bitmask instruction being proposed here (and any instruction) also require an implementation in a second VM? Also, definitely eager to follow work already begun here. However, I did not have permission even though I have access to other chromium bugs. How does one request permission to view? Thanks.

@zeux
Copy link
Contributor Author

zeux commented Mar 19, 2020

@jlb6740 Just a note that instruction count is generally indicative more of latency than of throughput, and as such the penalty from a multi-instruction sequence is going to vary case by case. For example some 8-instruction sequences that v8 synthesizes end up running at 3 cycles/iteration throughput (and 15 cycles latency in some cases), so it's hard to estimate the full real-world impact without profiling real workloads. Conversely, many single instructions in SSE have several cycles of latency, so a apples to apples comparison is challenging.

This is not to downplay the imbalance - the reason why I created the documentation is that I observed the instruction selection to have a real impact on real-world benchmarks - but just to say that you shouldn't look at the table and expect a 10:1 ratio in performance (... with one exception which is byte shuffles).

We discussed the general performance imbalance questions in Wasm calls before and I think the overall stance is that while in some cases it's concerning to have an imbalance, the reality of the situation is that in many cases there's no good way to alter the spec to fix that. E.g. 8-bit signed shifts in v8 on x64 (as of a recent submit) lower to an instruction sequence with 3-cycle throughput, 8-bit unsigned shifts lower to an instruction sequence with a 2-cycle throughput, and 16-bit unsigned shifts lower to a single instruction with 1-cycle throughput - and that's just what it is until everybody gets AVX512.

I would say that at this point (in my opinion - but this is based on thinking about this for a few problematic instructions) it's unlikely that we're going to get semantical changes to existing instructions from this proposal. The good news is that in most cases the performance issues can be avoided by tuning the application carefully. The other piece of good news is that "fast semantics" instructions, which will come after this proposal as I understand, will likely make it easier to write fast & portable code. And the third piece of good news is that v8 codegen continuously evolves, as you can see from the commit history on zeux/wasm-simd repository.

@jlb6740
Copy link

jlb6740 commented Mar 19, 2020

@zeux Thanks for the comment and also for your table. I understand your point about throughput data showing promise, that's definitely a positive. The key though in my mind still remains approaching those 1:1 mappings and comparing to native, including power as a metric. Obviously, that apples to apples comparison with actual workloads has not been trivial. In fact that for me sort of magnifies the significance of your table. I’d be interested to follow and help this proposal, and hear comments on NaN propagation, and some other open issues that remain. That said, I do see that these issues are apparent and have been discussed so I am encouraged.

@jan-wassenberg
Copy link

@jlb6740

The benchmark then is comparing native workloads targeting SIMD to the equivalent compiled to WASM SIMD running on the same platform. Unfortunately, there is a dearth of workloads that allow for such comparisons

Good point, I agree this is important. https://gitlab.com/wg1/jpeg-xl is one such example, third_party/highway is a cross-platform SIMD layer targeting both Wasm and native intrinsics, so the codec can be tested with either.

pull bot pushed a commit to p-g-krish/v8 that referenced this pull request Mar 19, 2020
Implement i8x16.bitmask, i16x8.bitmask, i32x4.bitmask on interpreter and
arm64.

These operations are behind wasm_simd_post_mvp flag, as we are only
prototyping to evaluate performance. The codegen is based on guidance at
WebAssembly/simd#201.

Bug: v8:10308
Change-Id: I835aa8a23e677a00ee7897c1c31a028850e238a9
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2099451
Reviewed-by: Tobias Tebbi <[email protected]>
Reviewed-by: Deepti Gandluri <[email protected]>
Commit-Queue: Zhi An Ng <[email protected]>
Cr-Commit-Position: refs/heads/master@{#66793}
@mingqiusun
Copy link

@zeux Agree with you that the actual performance impact is complicated, as it depends on the CPU architecture, workloads and data dependency, etc. The reason that some WASM workloads do not show throughput penalty despite complicated lowering on x64 is that its SIMD implementation is generally wider than others. For example, Ice Lake has 3 ALU, 3 FMA, 2 Shift, and 2 Shuffle vector execution units. For workloads that can utilize more of those SIMD execution units in parallel, throughput impact would be bigger. I think a better measurement of a "performance cliff" would be a WASM/native performance ratio, which measures how much potential performance loss the WASM ISA would introduce. But I do recognize the challenge of finding those SIMD workloads. You probably have more insight into this.

@zeux
Copy link
Contributor Author

zeux commented Mar 19, 2020

@mingqiusun Yeah I agree that ratio between native & wasm is the real test, but it gets complicated. We should probably carry the discussion elsewhere, but the complexity is that beyond the lack of deep code scheduling optimizations (v8 scheduler is weaker than LLVMs but it has to run much much much faster), there's also the issue of not being able to special-case the SIMD code to the target ISA which is commonly done in native land. In essense, WASM SIMD is pursuing the holy grail of (consistent semantics, maximum performance, fast codegen), and as it turns out it's really hard to maximize all three here :)

@dtig
Copy link
Member

dtig commented Mar 19, 2020

Hi @dtig, thanks a lot for the thoughtful reply. Especially appreciate the comment highlighting that the disparities shown in the table are more born from the legacy of Simd.js and not from reasons that began with the efforts here. I completely understand that and understand that is part of the challenge. I also agree with the sentiment about AVX code gen being important for future performance. Certainly a lot of thought and effort has gone into getting SIMD this far. The table, that skew though, is none-the-less concerning for me at least, and is something that I think collectively deserves pause on. It is why even without numbers yet, I am excited for the possibilities of this proposal. I am maybe coming at that table from a different perspective … when it’s mentioned that

and the performance in real world benchmarks is consistently better than scalar and on par, or at times better than ARM even with these instructions being emulated

for me, this is not the test. Comparing Wasm scalar to Wasm SIMD is not the test .. in general some improvement should be shown. The test is to be near native performance and platform agnostic. Any hardware advantages that exist with native performance due to hardware instruction

This may not be a test you agree with, but in the general context of adding this an extension to Wasm MVP, knowing how this performs against scalar Wasm with (mostly) consistent performance across different architectures is still useful, and a test that the broader community accepts. While I agree that the ultimate goal is to get near native performance, in the absence of the ideal benchmark that you also recognize a dearth of, I do prefer actual forward progress with real world benchmarks.

issue, power consumption, CPU frequency, etc, you would expect to remain if your ISA is truly agnostic. The benchmark then is comparing native workloads targeting SIMD to the equivalent compiled to WASM SIMD running on the same platform. Unfortunately, there is a dearth of workloads that allow for such comparisons (I’m looking), this table though is probably more telling anyway in some respects than an arbitrary workload. Workloads are obviously important, but they do change because applications change over time and right now they don’t really exist for doing the comparison I speak of (particularly off-browser). Also, workload performance should be analyzed case-by-case in order to study all of the other factors that went into its performance before takeaways on what performance scores says about the ISA it targets.

Calling any of the workloads used arbitrary is not particularly useful here. For more context, one of the reasons that initial work on this proposal was stalled, was because of opposition from the CG that adding a SIMD extension would not guarantee consistent performance across architectures, and that while it may perform somewhat reasonably in synthetic benchmarks, it has no real world applications. Applications that are working to use this proposal are not arbitrary, but traces of real world usage that benefit from having SIMD operations, so while workloads in general are important, real world workloads and applications are even more important towards standardizing this particular proposal, and I would weigh feedback from these to be more important than synthetic benchmarks. If you do have an analysis that you would be willing to share as an example of the case-by-case study that you mention, please share so we can discuss this with specifics.

This table gives us a true bounds on the potential losslessness of efficiency when lowering a specific SIMD instruction. All things equal, anything other than a 1:1 mapping of the WASM to hardware introduces the ISA as a bottleneck for potential lossless performance/power.

As you rightfully point out its not expected for all instructions to be equally efficient for all hardware targets, but perhaps before things are baked in there are a few open issues/tweaks that can be visited. For example, this issue here for bitmask instruction add, another maybe being swizzle #93 and revisiting the out of range handling, and even the issue you point out, the propagation of NaN for floating point min/max and IEEE754 compliance. From what is here the WASM spec states “propagate NaN payloads from their operands is permitted but not required”. To me, to be most compliant with this spec would mean to not mandate any requirement at all. In general with some of these issues, relaxing some aspects of these definitions may go a long way in performance for some future application. I am interested in any comments here.

I think the compromise in #93 was reached after quite a bit of discussion, but if you have better ideas for balanced semantics, please follow up on that issue.

What do you mean by not mandating any requirement at all? That we not specify out of range handling? Or that we allow operations on different platforms to behave differently? One of the requirements to be able to standardize Wasm SIMD as it is now, is the focus on minimizing non-determinism, and formally specify behavior to the extent possible. Also portability which has been talked about across different issues, that the behavior of operations has to be consistent across architectures, and the semantics specified have to guarantee portable performance the discussion is documented in older meeting notes so please take a look at #37 and the linked meeting notes for more context.

In general while measuring performance cliffs, if this affects real world applications I think we should evaluate them with care, but when it comes to the performance/non-determinism tradeoff, by design to be able to standardize a SIMD proposal that is usable in a variety of contexts, this proposal should always be deterministic. In the future, relaxing this criteria in favor of performance in a future add on proposal is something we've discussed multiple times, and still on the roadmap.

BTW … would this new bitmask instruction being proposed here (and any instruction) also require an implementation in a second VM? Also, definitely eager to follow work already begun here. However, I did not have permission even though I have access to other chromium bugs. How does one request permission to view? Thanks.

I've adjusted the permissions, you should be able to see it now.

@arunetm
Copy link
Collaborator

arunetm commented Mar 19, 2020

There is a high likelihood that web use-cases and standalone/off-browser use-cases evolve differently with different priorities, requirements and limitations. Comparison with native perf could be more relevant for wasm standalone applications. AFAIK standalone uses were mostly out of scope for the portable-simd or simd.js spec that seeded the current proposal. Agree with @zeux that a spec to meet all the desired goodness will indeed be the holy-grail here.

A way to address this problem will be to offer ops allowing developers more flexibility in fine tuning their application to the platforms they care about. In that sense bitmasks seems to be very handy to have.

tlively added a commit to llvm/llvm-project that referenced this pull request Mar 20, 2020
Summary:
These experimental new instructions are proposed in
WebAssembly/simd#201.

Reviewers: aheejin

Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits

Tags: #clang

Differential Revision: https://reviews.llvm.org/D76397
tlively added a commit to tlively/binaryen that referenced this pull request Mar 20, 2020
Adds full support for the {i8x16,i16x8,i32x4}.abs instructions merged
to the SIMD proposal in WebAssembly/simd#128
as well as the {i8x16,i16x8,i32x4}.bitmask instructions proposed in
WebAssembly/simd#201.
@jlb6740
Copy link

jlb6740 commented Mar 20, 2020

Calling any of the workloads used arbitrary is not particularly useful here. For more context, one of the reasons that initial work on this proposal was stalled, was because of opposition from the CG that adding a SIMD extension would not guarantee consistent performance across architectures, and that while it may perform somewhat reasonably in synthetic benchmarks, it has no real world applications. Applications that are working to use this proposal are not arbitrary, but traces of real world usage that benefit from having SIMD operations, so while workloads in general are important, real world workloads and applications are even more important towards standardizing this particular proposal, and I would weigh feedback from these to be more important than synthetic benchmarks. If you do have an analysis that you would be willing to share as an example of the case-by-case study that you mention, please share so we can discuss this with specifics.

My comments weren’t about workloads but were about the baseline. I see that there is how well an application targets the SIMD ISA and separately there is how well that SIMD ISA targets the underlying hardware. The comment of “for me, this is not the test” was about the later. Definitely don’t get me wrong, there is no question that when it comes to assessing the end goal of improving real world performance, workloads and particularly those designed from real world usages are absolutely critical for assessing improvements being made in a VM.

What do you mean by not mandating any requirement at all? That we not specify out of range handling? Or that we allow operations on different platforms to behave differently? One of the requirements to be able to standardize Wasm SIMD as it is now, is the focus on minimizing non-determinism, and formally specify behavior to the extent possible. Also portability which has been talked about across different issues, that the behavior of operations has to be consistent across architectures, and the semantics specified have to guarantee portable performance the discussion is documented in older meeting notes so please take a look at #37 and the linked meeting notes for more context.

In general while measuring performance cliffs, if this affects real world applications I think we should evaluate them with care, but when it comes to the performance/non-determinism tradeoff, by design to be able to standardize a SIMD proposal that is usable in a variety of contexts, this proposal should always be deterministic. In the future, relaxing this criteria in favor of performance in a future add on proposal is something we've discussed multiple times, and still on the roadmap.

Yes, the comments here were just put out there as suggestions for achieving a better balance with hopefully minimal invasiveness. These are thoughts and there may be other suggestions. Certainly this issue is promising, and #93 has been discussed and can be followed up on. For the min/max issue, that quote “permitted but not required” I did read that to mean that behavior of NaN propagation is not guaranteed to be the same on all platforms … but maybe the SIMD min/max is not the context this applies? Certainly the goal of minimizing non-determinism makes sense but the suggestions I gave is to relax the criteria for min/max for this iteration.

I've adjusted the permissions, you should be able to see it now.

Yes I do … Thanks. And all my statements are only meant to find improvement, so thanks a lot for your comments!

@jlb6740
Copy link

jlb6740 commented Mar 20, 2020

There is a high likelihood that web use-cases and standalone/off-browser use-cases evolve differently with different priorities, requirements and limitations. Comparison with native perf could be more relevant for wasm standalone applications. AFAIK standalone uses were mostly out of scope for the portable-simd or simd.js spec that seeded the current proposal. Agree with @zeux that a spec to meet all the desired goodness will indeed be the holy-grail here.

A way to address this problem will be to offer ops allowing developers more flexibility in fine tuning their application to the platforms they care about. In that sense bitmasks seems to be very handy to have.

@arunetm I agree. VMs improve and change, and applications and usage patterns change, hardware is much slower to evolve. That mapping to hardware if not 1:1, bakes in an inefficiency in lowering that is always there. If after tuning it is not easily seen in some workload, if the instruction is still being executed that extra work done is likely just manifested in some other metric. You best assess the ISAs relationship with hardware by comparing Wasm SIMD to Native SIMD .. I think comparing Wasm SIMD to WASM scalar won’t really give you that insight. More instruction options such as bitmask may come in handy as you say.

tlively added a commit to WebAssembly/binaryen that referenced this pull request Mar 20, 2020
Adds full support for the {i8x16,i16x8,i32x4}.abs instructions merged
to the SIMD proposal in WebAssembly/simd#128
as well as the {i8x16,i16x8,i32x4}.bitmask instructions proposed in
WebAssembly/simd#201.
@tlively
Copy link
Member

tlively commented Mar 23, 2020

FYI this is implemented in LLVM and Binaryen and can be used with tot Emscripten using the clang builtins __builtin_wasm_bitmask_{i8x16,i16x8,i32x4}.

arichardson pushed a commit to arichardson/llvm-project that referenced this pull request Apr 2, 2020
Summary:
These experimental new instructions are proposed in
WebAssembly/simd#201.

Reviewers: aheejin

Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits

Tags: #clang

Differential Revision: https://reviews.llvm.org/D76397
@jlb6740
Copy link

jlb6740 commented May 1, 2020

@ngzhian Thanks! native/emulated ratio summary on ARM, based on the average of reported datapoints:

vertex decode: 1.01x
string search, gap 1: 0.99x
string search, gap 2: 1.01x
string search, gap 4: 1.06x
string search, gap 8: 1.01x
string search, gap 16: 1.21x
string search, gap 32: 1.69x
string search, gap 64: 1.61x

@zeux @ngzhian
Overall this data looks very good imho! How consistent is the data run to run? In reference to discussion, is there anyway with the data already collected to calculate the deviation to understand what's reasonable to be considered noise?

@ngzhian
Copy link
Member

ngzhian commented May 5, 2020

I did 10 runs (each, emulated and native), concatenated the output, and dumped them all into a sheet: https://docs.google.com/spreadsheets/d/1F8_vb16_zpnIaHKkjxU8C1-HsaYJEmOfhhxAXDPA4gg/edit#gid=789530331.

@jlb6740
Copy link

jlb6740 commented May 5, 2020

I did 10 runs (each, emulated and native), concatenated the output, and dumped them all into a sheet: https://docs.google.com/spreadsheets/d/1F8_vb16_zpnIaHKkjxU8C1-HsaYJEmOfhhxAXDPA4gg/edit#gid=789530331.

@ngzhian 👍 Thanks. Not sure if the sheet was intended to be protected by I did request access.

@ngzhian
Copy link
Member

ngzhian commented May 6, 2020

Sorry, made it open!

@jlb6740
Copy link

jlb6740 commented May 7, 2020

@ngzhian .. Thanks, I checked it out. Run data does not show too much variance. I assume its sorted by time and that this was taken successively.

@ngzhian
Copy link
Member

ngzhian commented May 7, 2020

Oops sorry, I messed up the sorting, I didn't mean to sort by times across all 10 runs. Updated the sheet with actual raw results across 10 runs, so it's easier to see variance run to run. please take a look again https://docs.google.com/spreadsheets/d/1F8_vb16_zpnIaHKkjxU8C1-HsaYJEmOfhhxAXDPA4gg/edit#gid=937366701

@dtig
Copy link
Member

dtig commented May 20, 2020

Added an agenda item for the next CG meeting - WebAssembly/meetings#564. @zeux Would you be interested in driving this agenda item at the CG meeting?

@zeux
Copy link
Contributor Author

zeux commented May 20, 2020

@dtig Works for me

@zeux
Copy link
Contributor Author

zeux commented May 21, 2020

I've also rebased this against master with new opcode numbering (using the slots left for bitmask in renumbering pass).

@zeux
Copy link
Contributor Author

zeux commented May 26, 2020

Unsure if I'm going to get an invite to the meeting but fwiw I put together slides summarizing the PR discussion and performance measurements: https://docs.google.com/presentation/d/1OZSlYWWnm_pPFsjSu6iGmTQb8Qs0O2Wg2TU34-Fw-jQ/edit?usp=sharing

@dtig
Copy link
Member

dtig commented May 26, 2020

Thanks @zeux for presenting today, we have consensus from the CG to include this in the proposal.

Copy link
Member

@dtig dtig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes lgtm after resolving conflicts.

@taralx
Copy link

taralx commented May 26, 2020

Another note: As far as I can tell, RISC-V also does not have good support for this operation. Then again, their SIMD set is still WIP.

zeux added 2 commits May 26, 2020 11:29
i8x16.bitmask and i32x4.bitmask directly map to SSE movemask
instructions; i16x8.bitmask can be synthesized using packs+movemask.

These instructions are important to be able to do lane-wise processing
after a vector comparison - for example, these can be used together with
ctz to find the index of the first lane with the matching values after a
comparison instruction.
@zeux
Copy link
Contributor Author

zeux commented May 26, 2020

@dtig Thanks, resolved!

@taralx Yeah - tmk Wasm SIMD proposal currently doesn't track RISC-V SIMD spec. This would require a separate analysis, I suspect that some standardized instructions would have a similar problem (e.g. I don't think RISC-V supports variable byte shuffle, aka v8x16.swizzle).

@dtig dtig merged commit 710f870 into WebAssembly:master May 28, 2020
pull bot pushed a commit to p-g-krish/v8 that referenced this pull request Jun 2, 2020
This relands commit d04b5e4.

The fix here is in the assembler for pmovmskb, emit_optional_rex_32 should be
called after emitting the prefix byte.

Original change's description:
> [wasm-simd][liftoff][ia32][x64] Implement bitmask
>
> Implements i8x16 i16x8 i32x4 bitmask.
>
> This was merged into the proposal in
> WebAssembly/simd#201.
>
> Bug: v8:9909,v8:10308
> Change-Id: I882f0c2697213cdf593e745112e0897cee252009
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2222607
> Commit-Queue: Zhi An Ng <[email protected]>
> Reviewed-by: Clemens Backes <[email protected]>
> Cr-Commit-Position: refs/heads/master@{#68090}

Bug: v8:9909, v8:10308
Change-Id: I4897585c86b87f72dc8f142b275171276d135a24
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2225090
Reviewed-by: Clemens Backes <[email protected]>
Commit-Queue: Zhi An Ng <[email protected]>
Cr-Commit-Position: refs/heads/master@{#68106}
pull bot pushed a commit to p-g-krish/v8 that referenced this pull request Jun 5, 2020
This reverts commit dfbbb4a.

Reason for revert: Bitmask added post 84 cut, so it is not part of origin trial. Therefore it is still a post-mvp.

Original change's description:
> [wasm-simd] Add bitmask to SIMD MVP
> 
> This removes the post-mvp flag for bitmask, since it was accepted into
> the proposal, see WebAssembly/simd#201.
> 
> Bug: v8:10308
> Change-Id: I4ced43a6484660125d773bc9de46bdea9f72b13b
> Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2216532
> Reviewed-by: Deepti Gandluri <[email protected]>
> Commit-Queue: Zhi An Ng <[email protected]>
> Cr-Commit-Position: refs/heads/master@{#67993}

[email protected],[email protected]

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug: v8:10308
Change-Id: I53294be4ea816f37c7cc5f545afb572538dd4770
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2233183
Reviewed-by: Zhi An Ng <[email protected]>
Reviewed-by: Deepti Gandluri <[email protected]>
Commit-Queue: Zhi An Ng <[email protected]>
Cr-Commit-Position: refs/heads/master@{#68216}
pull bot pushed a commit to p-g-krish/v8 that referenced this pull request Aug 27, 2020
Port 6da647f

Original Commit Message:

    Now that 86 has branched, we can move bitmask into the SIMD MVP, it will
    not affect the current OT. (We want any OT extension to include
    bitmask.)

    Bitmask was accepted into the proposal in
    WebAssembly/simd#201.

[email protected], [email protected], [email protected], [email protected]
BUG=
LOG=N

Change-Id: I7518e1e8d7513a6931ff026eb3089fa896a6b587
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2379227
Reviewed-by: Junliang Yan <[email protected]>
Commit-Queue: Milad Farazmand <[email protected]>
Cr-Commit-Position: refs/heads/master@{#69587}
yaozhongxiao pushed a commit to yaozhongxiao/std-simd that referenced this pull request May 12, 2021
…it(to #32373348)

Summary: find_first_set and find_last_set method is not optimal for neon,
it need to be improved by synthesized with horizontal adds(vaddv) which
will reduce the generated assembly code; in the following cases, vaddvq_s16
will generate 2 instructions but vpadd_s16 will generate 4 instrunctions
```
 #ifdef __aarch64__
   return vaddvq_s16(__asint);
   //  addv    h0, v1.8h
   //  smov    w1, v0.h[0]
 #else
   return vpadd_s16(
       vpadd_s16(vpadd_s16(__lo64(__asint), __hi64(__asint)), __zero),
       __zero)[0];

   // addp v1.8h,v1.8h,v2.8h
   // addp v1.8h,v1.8h,v2.8h
   // addp v1.8h,v1.8h,v2.8h
   // smov    w1, v1.h[0]
 #endif
```

Further discussion following the linking:
[1]. WebAssembly/simd#201
[2]. WebAssembly/simd#131

Test Plan: test_run.sh
Reviewers: chengbin.cb, liangbin.mj, yifeng.dongyifeng, longfei.alf,
           chuanqi.xcq
Issue: https://aone.alibaba-inc.com/req/32373348
CR: https://code.aone.alibaba-inc.com/cpp_libs/std-simd/codereview/4534679
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generalise i8x16.any_true so as to support vectorised C strlen, strcmp, strstr, et al. movemask instruction
10 participants