Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

i32x4.dot_i16x8_s instruction #127

Merged
merged 2 commits into from
Oct 19, 2020
Merged

i32x4.dot_i16x8_s instruction #127

merged 2 commits into from
Oct 19, 2020

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Oct 28, 2019

Introduction

Integer arithmetic instructions in WebAssembly SIMD produce results of the same element type as their inputs. To avoid overflow, developers have to pre-extend inputs to wider element types and do (twice more) arithmetic operations on twice wider elements. The need to use this work-around is particularly concerning for multiplications:

  • Unlike additions and subtractions, where overflow is an exceptional situation, multiplications normally produce twice wider result than their inputs, and overflowing input type is a common, rather than exceptional, case. Thus nearly every multiplication must be accompanied by pre-extending inputs to wider element types.
  • Both on x86 and on ARM, multiplications which produce twice wider results are cheaper (typically by 2x) than multiplications on the wider type itself. The table below quantifies the throughput cost (from [1]) of a combination of PMULLW + PMULHW instructions (which together compute 32-bit products of 8 16-bit inputs) and a combination of two PMULLD instructions (which together compute 32-bit products of 8 32-bit inputs) on various x86 microarchitectures:
Microarchitecture 2x PMULLD xmm, xmm PMULLW xmm, xmm + PMULHW xmm, xmm
AMD Piledriver 4 2
AMD Zen 4 2
Intel Nehalem 4 2
Intel Sandy Bridge 2 2
Intel Haswell 4 2
Intel Skylake 2 1
Intel Goldmont 4 2

Operations that produce twice wider results would need to return two SIMD vectors, and thus depend on the future multi-value proposal. To stay within baseline WebAssembly features, we have to aggregate the two wide results into one, so the instruction can produce a single output vector. Luckily, there is an aggregating operation directly supported in x86, MIPS, and POWER instruction sets, that can also be efficiently lowered on ARM and ARM64: addition of adjacent multiplication results. The resulting combination of a full multiplication and addition of adjacent products can be interpreted as a dot product of 2-wide subvectors within a SIMD vector, producing twice wider results than input vectors (albeit twice fewer result elements than input elements).

This PR introduce the 2-element dot product instructions with signed 16-bit integer input elements and signed 32-bit integer output elements. We don't consider other data types, because they can't be efficiently expressed on x86 (e.g. the only multiplication on byte inputs on x86 multiplies signed bytes by unsigned bytes with signed saturation - too exotic to build a portable instruction on top of it). The new i32x4.dot_i16x8_s instruction returns the dot product right away, and i32x4.dot_i16x8_add_s additionally accumulates it with a third input vector of 32-bit elements. This second instruction was added because accumulation of dot product results is common, and many instruction sets provide a specialized instruction for this case.

[October 31 update] Applications

Below are examples of optimized libraries using close equivalents of the proposed i32x4.dot_i16x8_s and i32x4.dot_i16x8_add_s instructions:

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512VNNI and AVX512VL instruction sets

  • i32x4.dot_i16x8_add_s
    • c = i32x4.dot_i16x8_add_s(a, b, c) is lowered to VPDPWSSD xmm_c, xmm_a, xmm_b
    • y = i32x4.dot_i16x8_add_s(a, b, c) is lowered to VMOVDQA xmm_y, xmm_c + VPDPWSSD xmm_c, xmm_a, xmm_b

x86/x86-64 processors with XOP instruction set

  • i32x4.dot_i16x8_add_s
    • y = i32x4.dot_i16x8_add_s(a, b, c) is lowered to VPMADCSWD xmm_y, xmm_a, xmm_b, xmm_c

x86/x86-64 processors with AVX instruction set

  • i32x4.dot_i16x8_s
    • y = i32x4.dot_i16x8_s(a, b) is lowered to VPMADDWD xmm_y, xmm_a, xmm_b
  • i32x4.dot_i16x8_add_s
    • y = i32x4.dot_i16x8_add_s(a, b, c) is lowered to VPMADDWD xmm_tmp, xmm_a, xmm_b + VPADDD xmm_y, xmm_tmp, xmm_c

x86/x86-64 processors with SSE2 instruction set

  • i32x4.dot_i16x8_s
    • a = i32x4.dot_i16x8_s(a, b) is lowered to PMADDWD xmm_a, xmm_b
    • y = i32x4.dot_i16x8_s(a, b) is lowered to MOVDQA xmm_y, xmm_a + PMADDWD xmm_y, xmm_b
  • i32x4.dot_i16x8_add_s
    • c = i32x4.dot_i16x8_add_s(a, b, c) is lowered to MOVDQA xmm_tmp, xmm_a + PMADDWD xmm_tmp, xmm_b + PADDD xmm_c, xmm_tmp
    • y = i32x4.dot_i16x8_add_s(a, b, c) is lowered to MOVDQA xmm_y, xmm_a + PMADDWD xmm_y, xmm_b + PADDD xmm_y, xmm_c

ARM64 processors

  • i32x4.dot_i16x8_s
    • y = i32x4.dot_i16x8_s(a, b) is lowered to:
      • SMULL Vtmp.4S, Va.4H, Vb.4H
      • SMULL2 Vtmp2.4S, Va.8H, Vb.8H
      • ADDP Vy.4S, Vtmp.4S, Vtmp2.4S
  • i32x4.dot_i16x8_add_s
    • y = i32x4.dot_i16x8_add_s(a, b) is lowered to:
      • SMULL Vtmp.4S, Va.4H, Vb.4H
      • SMULL2 Vtmp2.4S, Va.8H, Vb.8H
      • ADDP Vtmp.4S, Vtmp.4S, Vtmp2.4S
      • ADD Vy.4S, Vy.4S, Vtmp.4S

ARMv7 processors with NEON instruction set

  • i32x4.dot_i16x8_s
    • y = i32x4.dot_i16x8_s(a, b) is lowered to:
      • VMULL.S16 Qtmp, Da_lo, Vb_lo
      • VMULL.S16 Qtmp2, Da_hi, Db_hi
      • VPADD.I32 Dy_lo, Dtmp_lo, Dtmp_hi
      • VPADD.I32 Dy_hi, Dtmp2_lo, Dtmp2_hi
  • i32x4.dot_i16x8_add_s
    • y = i32x4.dot_i16x8_add_s(a, b) is lowered to:
      • VMULL.S16 Qtmp, Da_lo, Vb_lo
      • VMULL.S16 Qtmp2, Da_hi, Db_hi
      • VPADD.I32 Dtmp_lo, Dtmp_lo, Dtmp_hi
      • VPADD.I32 Dtmp_hi, Dtmp2_lo, Dtmp2_hi
      • VADD.I32 Qy, Qy, Qtmp

POWER processors with VMX (Altivec) instruction set

  • i32x4.dot_i16x8_s
    • y = i32x4.dot_i16x8_s(a, b) is lowered to VXOR VRy, VRy, VRy + VMSUMSHM VRy, VRa, VRb, VRy
  • i32x4.dot_i16x8_add_s
    • y = i32x4.dot_i16x8_add_s(a, b, c) is lowered to VMSUMSHM VRy, VRa, VRb, VRc

MIPS processors with MSA instruction set

  • i32x4.dot_i16x8_s
    • y = i32x4.dot_i16x8_s(a, b) is lowered to DOTP_S.W Wy, Wa, Wb
  • i32x4.dot_i16x8_add_s
    • c = i32x4.dot_i16x8_add_s(a, b, c) is lowered to DPADD_S.W Wc, Wa, Wb
    • y = i32x4.dot_i16x8_add_s(a, b, c) is lowered to MOVE.V Wy, Wc + DPADD_S.W Wy, Wa, Wb

References

[1] Fog, A. "Instruction tables (2019)." URL: www.agner.org/optimize/instruction_tables.pdf.

@dtig
Copy link
Member

dtig commented Oct 30, 2019

Thanks for the detailed write-up, what do you think about paring this down to just including the dot2_s operations and not the dot2add_s operations. The dot2_s operations as detailed here are useful to have, but the dot2add_s operations to me seem out of scope for MVP of the SIMD proposal.

The i16x8.dot2add_s operation maps directly to an instruction only on x86-64 with XOP enabled, on most other architectures, the codegen would be equivalent to generating an add operation after the dot2_s operation.

@penzn
Copy link
Contributor

penzn commented Oct 31, 2019

We have had two expansions of instruction set that specifically address intermediate overflow, load with extend (#98) and widening operations (part of #89). Load and extend particularly absorbs the cost of extending into the load operation. Is there specific code that regresses without this extension?

@tlively
Copy link
Member

tlively commented Oct 31, 2019

For the instruction naming, the type prefix should be i32x4 because those prefixes always specify the output type. Also, there is no need for the _s suffix because there are no accompanying unsigned variant. Finally, we have avoided using numerals in instruction names this far and I would like to maintain that. How about using i32x4.dot or i32x4.dot_prod instead?

@Maratyszcza
Copy link
Contributor Author

@dtig I updated the PR description with the list of codes using _mm_madd_epi16 (SSE2 intrinsic for PMADDWD) or vec_msum (VMX/Altivec intrinsic for VMSUMSHM). Nearly all use cases for _mm_madd_epi16 involve _mm_add_epi32 (SIMD addition) on its result (on POWER all cases involve addition because POWER doesn't provide an instruction without accumulation of dot product result). Thus, having "2-wide dot product with addition" instruction would be useful in practice, and deliver speedups at least on some processors:

  • The most recent Intel processors with AVX512-VNNI and AVX512-VL, e.g. Cascade Lake and Ice Lake.
  • AMD processors with XOP (Bulldozer/Piledriver/Steamroller/Excavator cores)
  • All POWER processors with SIMD capabilities
  • All MIPS processors with MSA instruction set

@Maratyszcza
Copy link
Contributor Author

@penzn I added a list of applications to the PR description. Compared to using load-with-extend and 32x32->32 multiplications, these "dot product" operations have several performance advantages:

  • "Load with extend" doesn't exit on ARM and will be lowered into two instructions
  • On most x86 microarchitectures 32x32->32 multiplication is twice more expensive (in throughput) than PMADDWD. E.g. Intel Skylake can issue two PMADDWD instructions per cycle, but only one PMULLD instruction per cycle.
  • On ARM Cortex-A72/A73/75/etc, 32x32->32 multiplication is twice more expensive (in throughput) than 16x16->32 multiplication used to simulate i16x8.dot2.
  • Not only is the 32x32->32 multiplication more expensive, but we'd need twice more 32x32->32 multiplication instructions to do the same number of multiplications as i16x8.dot2.

@Maratyszcza Maratyszcza changed the title i16x8.dot2_s and i16x8.dot2acc_s instructions i32x4.dot2_s and i32x4.dot2acc_s instructions Nov 1, 2019
@Maratyszcza Maratyszcza changed the title i32x4.dot2_s and i32x4.dot2acc_s instructions i32x4.dot2_s and i32x4.dot2_acc_s instructions Nov 1, 2019
@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Nov 1, 2019

@tlively Good point about i32x4 output type, I renamed the instructions to i32x4.dot2_s and i32x4.dot2_add_s accordingly. As for _s suffix and 2 in the name, I think it is best to keep them to allow for future extensions:

  • AVX512 VNNI and NEON DOT extensions provide instructions for a dot product of 4 8-bit integers with accumulation to 32 bits. Thus, we might need to distinguish between dot2 and dot4 in the future.
  • It is also possible that we'd want to have unsigned variants in the future, and need to distinguish between dot2_s and dot2_u variants. Unsigned variants wouldn't lower as nicely across all architectures, so I left them out of this proposal.

@tlively
Copy link
Member

tlively commented Nov 1, 2019

If we may need to differentiate between different input types, how about i32x4.dot_i16x8_s?

@Maratyszcza Maratyszcza changed the title i32x4.dot2_s and i32x4.dot2_acc_s instructions i32x4.dot_i16x8_s and i32x4.dot_i16x8_add_s instructions Nov 1, 2019
@Maratyszcza
Copy link
Contributor Author

@tlively Sounds reasonable, there are already instructions with similar names. Updated the commit & PR description.

tlively added a commit to llvm/llvm-project that referenced this pull request Nov 1, 2019
Summary:
This instruction is not merged to the spec proposal, but we need it to
be implemented in the toolchain to experiment with it. It is available
only on an opt-in basis through a clang builtin.

Defined in WebAssembly/simd#127.

Depends on D69696.

Reviewers: aheejin

Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D69697
@tlively
Copy link
Member

tlively commented Nov 2, 2019

@Maratyszcza Could you add pseudocode for the semantics of these operations as well? I want to make sure I implement the interpreter correctly in Binaryen.

@Maratyszcza
Copy link
Contributor Author

@tlively I don't quite understand the pseudo-code specification in WAsm SIMD, especially given that these "dot product" instructions are the few to have some "horizontal" component in them. You may refer to PMADDWD instruction in Intel architecture manual which is the analog of i32x4.dot_i16x8_s.

tlively added a commit to tlively/binaryen that referenced this pull request Nov 4, 2019
This experimental instruction is specified in
WebAssembly/simd#127 and is being implemented
to enable further investigation of its performance impact.
tlively added a commit to tlively/binaryen that referenced this pull request Nov 4, 2019
This experimental instruction is specified in
WebAssembly/simd#127 and is being implemented
to enable further investigation of its performance impact.
tlively added a commit to WebAssembly/binaryen that referenced this pull request Nov 4, 2019
This experimental instruction is specified in
WebAssembly/simd#127 and is being implemented
to enable further investigation of its performance impact.
arichardson pushed a commit to arichardson/llvm-project that referenced this pull request Nov 16, 2019
Summary:
This instruction is not merged to the spec proposal, but we need it to
be implemented in the toolchain to experiment with it. It is available
only on an opt-in basis through a clang builtin.

Defined in WebAssembly/simd#127.

Depends on D69696.

Reviewers: aheejin

Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D69697
@tlively
Copy link
Member

tlively commented Dec 17, 2019

The opcodes in this PR collide with the opcodes used for the {i8x16,i16x8}.avgr_u instructions. Since the averaging instructions have been merged, I will reassign the opcodes for the dot product instructions to 0xdb and 0xdc in the LLVM and Binaryen implementations.

Edit: I forgot there was only one dot product instruction implemented. i32x4.dot_i16x8_s will have ocode 0xdb.

@Maratyszcza
Copy link
Contributor Author

@tlively Renumbered opcodes in the PR similarly to LLVM

@bjacob
Copy link

bjacob commented May 5, 2020

I would like to +1 this request, especially the 8bit*8bit accumulating into 32bit flavor as in VNNI. It would be a necessary prerequisite to even consider targeting WebAsm for integer-quantized neural network inference code. Without such an instruction, 8bit quantization of neural networks simply won't provide a meaningful computational advantage over float; people would merely 8bit-quantize to shrink the download size but then dequantize to float for the client side computation.

On the ARM side, the VNNI-equivalent instruction is SDOT/UDOT, available in current commercially available ARM CPUs / Android devices such as the Pixel4 and available in lower-end CPUs as well now (Cortex-A55) so it's a present issue, not future. It is a 4x speed difference today.

(For both points made above, see this data).

Example production code (used by TensorFlow Lite using these instructions) (the machine-encodings are here to support older assemblers).
https://github.com/google/ruy/blob/57e64b4c8f32e813ce46eb495d13ee301826e498/ruy/kernel_arm64.cc#L3066

moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Oct 14, 2020
…ccepted status. r=jseward

Background: WebAssembly/simd#127

For the widening dot product instruction:

- remove the internal 'Experimental' opcode suffix in the C++ code
- remove the guard on the instruction in all the C++ decoders
- move the test cases from simd/experimental.js to simd/ad-hack.js

I have checked that current V8 and wasm-tools use the same opcode
mapping.  V8 in turn guarantees the correct mapping for LLVM and
binaryen.

Differential Revision: https://phabricator.services.mozilla.com/D92929
jamienicol pushed a commit to jamienicol/gecko that referenced this pull request Oct 15, 2020
…ccepted status. r=jseward

Background: WebAssembly/simd#127

For the widening dot product instruction:

- remove the internal 'Experimental' opcode suffix in the C++ code
- remove the guard on the instruction in all the C++ decoders
- move the test cases from simd/experimental.js to simd/ad-hack.js

I have checked that current V8 and wasm-tools use the same opcode
mapping.  V8 in turn guarantees the correct mapping for LLVM and
binaryen.

Differential Revision: https://phabricator.services.mozilla.com/D92929
omnisip added a commit to omnisip/simd that referenced this pull request Oct 15, 2020
@ngzhian
Copy link
Member

ngzhian commented Oct 19, 2020

There is no pseducode for this op, but I think the text description is straightforward enough. Merging.

@ngzhian ngzhian merged commit 1cfd484 into WebAssembly:master Oct 19, 2020
bmeurer added a commit to bmeurer/wasmparser that referenced this pull request Oct 24, 2020
bmeurer added a commit to wasdk/wasmparser that referenced this pull request Oct 24, 2020
@julian-seward1
Copy link

@Maratyszcza are there any .wast testcases available for i32x4.dot_i16x8_s? I looked around the repo but didn't find any -- I may have looked in the wrong place though.

julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Oct 27, 2020
This patch implements, for aarch64, the following wasm SIMD extensions

  i32x4.dot_i16x8_s instruction
  WebAssembly/simd#127

It also updates dependencies as follows, in order that the new instruction can
be parsed, decoded, etc:

  wat          to  1.0.27
  wast         to  26.0.1
  wasmparser   to  0.65.0
  wasmprinter  to  0.2.12

The changes are straightforward:

* new CLIF instruction `widening_pairwise_dot_product_s`

* translation from wasm into `widening_pairwise_dot_product_s`

* new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group)

* translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv`

There is no testcase in this commit, because that is a separate repo.  The
implementation has been tested, nevertheless.
@Maratyszcza
Copy link
Contributor Author

Not that I'm aware of. @ngzhian and @tlively probably know better.

@ngzhian
Copy link
Member

ngzhian commented Oct 27, 2020

Not yet, this is a pretty new instruction, so it's not implemented in the interpreter yet. Ofc, contributions welcome, let me know if you're interested (either to contribute implementation or tests, or both).

julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Nov 3, 2020
This patch implements, for aarch64, the following wasm SIMD extensions

  i32x4.dot_i16x8_s instruction
  WebAssembly/simd#127

It also updates dependencies as follows, in order that the new instruction can
be parsed, decoded, etc:

  wat          to  1.0.27
  wast         to  26.0.1
  wasmparser   to  0.65.0
  wasmprinter  to  0.2.12

The changes are straightforward:

* new CLIF instruction `widening_pairwise_dot_product_s`

* translation from wasm into `widening_pairwise_dot_product_s`

* new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group)

* translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv`

There is no testcase in this commit, because that is a separate repo.  The
implementation has been tested, nevertheless.
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Nov 3, 2020
This patch implements, for aarch64, the following wasm SIMD extensions

  i32x4.dot_i16x8_s instruction
  WebAssembly/simd#127

It also updates dependencies as follows, in order that the new instruction can
be parsed, decoded, etc:

  wat          to  1.0.27
  wast         to  26.0.1
  wasmparser   to  0.65.0
  wasmprinter  to  0.2.12

The changes are straightforward:

* new CLIF instruction `widening_pairwise_dot_product_s`

* translation from wasm into `widening_pairwise_dot_product_s`

* new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group)

* translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv`

There is no testcase in this commit, because that is a separate repo.  The
implementation has been tested, nevertheless.
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Nov 3, 2020
This patch implements, for aarch64, the following wasm SIMD extensions

  i32x4.dot_i16x8_s instruction
  WebAssembly/simd#127

It also updates dependencies as follows, in order that the new instruction can
be parsed, decoded, etc:

  wat          to  1.0.27
  wast         to  26.0.1
  wasmparser   to  0.65.0
  wasmprinter  to  0.2.12

The changes are straightforward:

* new CLIF instruction `widening_pairwise_dot_product_s`

* translation from wasm into `widening_pairwise_dot_product_s`

* new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group)

* translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv`

There is no testcase in this commit, because that is a separate repo.  The
implementation has been tested, nevertheless.
julian-seward1 added a commit to bytecodealliance/wasmtime that referenced this pull request Nov 3, 2020
This patch implements, for aarch64, the following wasm SIMD extensions

  i32x4.dot_i16x8_s instruction
  WebAssembly/simd#127

It also updates dependencies as follows, in order that the new instruction can
be parsed, decoded, etc:

  wat          to  1.0.27
  wast         to  26.0.1
  wasmparser   to  0.65.0
  wasmprinter  to  0.2.12

The changes are straightforward:

* new CLIF instruction `widening_pairwise_dot_product_s`

* translation from wasm into `widening_pairwise_dot_product_s`

* new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group)

* translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv`

There is no testcase in this commit, because that is a separate repo.  The
implementation has been tested, nevertheless.
ambroff pushed a commit to ambroff/gecko that referenced this pull request Nov 4, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982
ambroff pushed a commit to ambroff/gecko that referenced this pull request Nov 4, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982
ngzhian added a commit to ngzhian/simd that referenced this pull request Nov 4, 2020
It multiplies respective lanes from the 2 input operands, then adds
adjacent lanes.

This was merged into the proposal in WebAssembly#127.
ngzhian added a commit that referenced this pull request Nov 4, 2020
It multiplies respective lanes from the 2 input operands, then adds
adjacent lanes.

This was merged into the proposal in #127.
cfallin pushed a commit to bytecodealliance/wasmtime that referenced this pull request Nov 30, 2020
This patch implements, for aarch64, the following wasm SIMD extensions

  i32x4.dot_i16x8_s instruction
  WebAssembly/simd#127

It also updates dependencies as follows, in order that the new instruction can
be parsed, decoded, etc:

  wat          to  1.0.27
  wast         to  26.0.1
  wasmparser   to  0.65.0
  wasmprinter  to  0.2.12

The changes are straightforward:

* new CLIF instruction `widening_pairwise_dot_product_s`

* translation from wasm into `widening_pairwise_dot_product_s`

* new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group)

* translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv`

There is no testcase in this commit, because that is a separate repo.  The
implementation has been tested, nevertheless.
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 18, 2021
This instruction was added in WebAssembly#127.
ngzhian added a commit that referenced this pull request Feb 24, 2021
This instruction was added in #127.

Co-authored-by: Andreas Rossberg <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants