i32x4.dot_i16x8_s instruction #127

Maratyszcza · 2019-10-28T23:06:13Z

Introduction

Integer arithmetic instructions in WebAssembly SIMD produce results of the same element type as their inputs. To avoid overflow, developers have to pre-extend inputs to wider element types and do (twice more) arithmetic operations on twice wider elements. The need to use this work-around is particularly concerning for multiplications:

Unlike additions and subtractions, where overflow is an exceptional situation, multiplications normally produce twice wider result than their inputs, and overflowing input type is a common, rather than exceptional, case. Thus nearly every multiplication must be accompanied by pre-extending inputs to wider element types.
Both on x86 and on ARM, multiplications which produce twice wider results are cheaper (typically by 2x) than multiplications on the wider type itself. The table below quantifies the throughput cost (from [1]) of a combination of PMULLW + PMULHW instructions (which together compute 32-bit products of 8 16-bit inputs) and a combination of two PMULLD instructions (which together compute 32-bit products of 8 32-bit inputs) on various x86 microarchitectures:

Microarchitecture	2x `PMULLD xmm, xmm`	`PMULLW xmm, xmm` + `PMULHW xmm, xmm`
AMD Piledriver	4	2
AMD Zen	4	2
Intel Nehalem	4	2
Intel Sandy Bridge	2	2
Intel Haswell	4	2
Intel Skylake	2	1
Intel Goldmont	4	2

Operations that produce twice wider results would need to return two SIMD vectors, and thus depend on the future multi-value proposal. To stay within baseline WebAssembly features, we have to aggregate the two wide results into one, so the instruction can produce a single output vector. Luckily, there is an aggregating operation directly supported in x86, MIPS, and POWER instruction sets, that can also be efficiently lowered on ARM and ARM64: addition of adjacent multiplication results. The resulting combination of a full multiplication and addition of adjacent products can be interpreted as a dot product of 2-wide subvectors within a SIMD vector, producing twice wider results than input vectors (albeit twice fewer result elements than input elements).

This PR introduce the 2-element dot product instructions with signed 16-bit integer input elements and signed 32-bit integer output elements. We don't consider other data types, because they can't be efficiently expressed on x86 (e.g. the only multiplication on byte inputs on x86 multiplies signed bytes by unsigned bytes with signed saturation - too exotic to build a portable instruction on top of it). The new i32x4.dot_i16x8_s instruction returns the dot product right away, and i32x4.dot_i16x8_add_s additionally accumulates it with a third input vector of 32-bit elements. This second instruction was added because accumulation of dot product results is common, and many instruction sets provide a specialized instruction for this case.

[October 31 update] Applications

Below are examples of optimized libraries using close equivalents of the proposed i32x4.dot_i16x8_s and i32x4.dot_i16x8_add_s instructions:

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512VNNI and AVX512VL instruction sets

i32x4.dot_i16x8_add_s
- c = i32x4.dot_i16x8_add_s(a, b, c) is lowered to VPDPWSSD xmm_c, xmm_a, xmm_b
- y = i32x4.dot_i16x8_add_s(a, b, c) is lowered to VMOVDQA xmm_y, xmm_c + VPDPWSSD xmm_c, xmm_a, xmm_b

x86/x86-64 processors with XOP instruction set

i32x4.dot_i16x8_add_s
- y = i32x4.dot_i16x8_add_s(a, b, c) is lowered to VPMADCSWD xmm_y, xmm_a, xmm_b, xmm_c

x86/x86-64 processors with AVX instruction set

i32x4.dot_i16x8_s
- y = i32x4.dot_i16x8_s(a, b) is lowered to VPMADDWD xmm_y, xmm_a, xmm_b
i32x4.dot_i16x8_add_s
- y = i32x4.dot_i16x8_add_s(a, b, c) is lowered to VPMADDWD xmm_tmp, xmm_a, xmm_b + VPADDD xmm_y, xmm_tmp, xmm_c

x86/x86-64 processors with SSE2 instruction set

i32x4.dot_i16x8_s
- a = i32x4.dot_i16x8_s(a, b) is lowered to PMADDWD xmm_a, xmm_b
- y = i32x4.dot_i16x8_s(a, b) is lowered to MOVDQA xmm_y, xmm_a + PMADDWD xmm_y, xmm_b
i32x4.dot_i16x8_add_s
- c = i32x4.dot_i16x8_add_s(a, b, c) is lowered to MOVDQA xmm_tmp, xmm_a + PMADDWD xmm_tmp, xmm_b + PADDD xmm_c, xmm_tmp
- y = i32x4.dot_i16x8_add_s(a, b, c) is lowered to MOVDQA xmm_y, xmm_a + PMADDWD xmm_y, xmm_b + PADDD xmm_y, xmm_c

ARM64 processors

i32x4.dot_i16x8_s
- y = i32x4.dot_i16x8_s(a, b) is lowered to:
  - SMULL Vtmp.4S, Va.4H, Vb.4H
  - SMULL2 Vtmp2.4S, Va.8H, Vb.8H
  - ADDP Vy.4S, Vtmp.4S, Vtmp2.4S
i32x4.dot_i16x8_add_s
- y = i32x4.dot_i16x8_add_s(a, b) is lowered to:
  - SMULL Vtmp.4S, Va.4H, Vb.4H
  - SMULL2 Vtmp2.4S, Va.8H, Vb.8H
  - ADDP Vtmp.4S, Vtmp.4S, Vtmp2.4S
  - ADD Vy.4S, Vy.4S, Vtmp.4S

ARMv7 processors with NEON instruction set

i32x4.dot_i16x8_s
- y = i32x4.dot_i16x8_s(a, b) is lowered to:
  - VMULL.S16 Qtmp, Da_lo, Vb_lo
  - VMULL.S16 Qtmp2, Da_hi, Db_hi
  - VPADD.I32 Dy_lo, Dtmp_lo, Dtmp_hi
  - VPADD.I32 Dy_hi, Dtmp2_lo, Dtmp2_hi
i32x4.dot_i16x8_add_s
- y = i32x4.dot_i16x8_add_s(a, b) is lowered to:
  - VMULL.S16 Qtmp, Da_lo, Vb_lo
  - VMULL.S16 Qtmp2, Da_hi, Db_hi
  - VPADD.I32 Dtmp_lo, Dtmp_lo, Dtmp_hi
  - VPADD.I32 Dtmp_hi, Dtmp2_lo, Dtmp2_hi
  - VADD.I32 Qy, Qy, Qtmp

POWER processors with VMX (Altivec) instruction set

i32x4.dot_i16x8_s
- y = i32x4.dot_i16x8_s(a, b) is lowered to VXOR VRy, VRy, VRy + VMSUMSHM VRy, VRa, VRb, VRy
i32x4.dot_i16x8_add_s
- y = i32x4.dot_i16x8_add_s(a, b, c) is lowered to VMSUMSHM VRy, VRa, VRb, VRc

MIPS processors with MSA instruction set

i32x4.dot_i16x8_s
- y = i32x4.dot_i16x8_s(a, b) is lowered to DOTP_S.W Wy, Wa, Wb
i32x4.dot_i16x8_add_s
- c = i32x4.dot_i16x8_add_s(a, b, c) is lowered to DPADD_S.W Wc, Wa, Wb
- y = i32x4.dot_i16x8_add_s(a, b, c) is lowered to MOVE.V Wy, Wc + DPADD_S.W Wy, Wa, Wb

References

[1] Fog, A. "Instruction tables (2019)." URL: www.agner.org/optimize/instruction_tables.pdf.

dtig · 2019-10-30T23:44:41Z

Thanks for the detailed write-up, what do you think about paring this down to just including the dot2_s operations and not the dot2add_s operations. The dot2_s operations as detailed here are useful to have, but the dot2add_s operations to me seem out of scope for MVP of the SIMD proposal.

The i16x8.dot2add_s operation maps directly to an instruction only on x86-64 with XOP enabled, on most other architectures, the codegen would be equivalent to generating an add operation after the dot2_s operation.

penzn · 2019-10-31T17:23:27Z

We have had two expansions of instruction set that specifically address intermediate overflow, load with extend (#98) and widening operations (part of #89). Load and extend particularly absorbs the cost of extending into the load operation. Is there specific code that regresses without this extension?

tlively · 2019-10-31T22:06:23Z

For the instruction naming, the type prefix should be i32x4 because those prefixes always specify the output type. Also, there is no need for the _s suffix because there are no accompanying unsigned variant. Finally, we have avoided using numerals in instruction names this far and I would like to maintain that. How about using i32x4.dot or i32x4.dot_prod instead?

Maratyszcza · 2019-11-01T00:11:23Z

@dtig I updated the PR description with the list of codes using _mm_madd_epi16 (SSE2 intrinsic for PMADDWD) or vec_msum (VMX/Altivec intrinsic for VMSUMSHM). Nearly all use cases for _mm_madd_epi16 involve _mm_add_epi32 (SIMD addition) on its result (on POWER all cases involve addition because POWER doesn't provide an instruction without accumulation of dot product result). Thus, having "2-wide dot product with addition" instruction would be useful in practice, and deliver speedups at least on some processors:

The most recent Intel processors with AVX512-VNNI and AVX512-VL, e.g. Cascade Lake and Ice Lake.
AMD processors with XOP (Bulldozer/Piledriver/Steamroller/Excavator cores)
All POWER processors with SIMD capabilities
All MIPS processors with MSA instruction set

Maratyszcza · 2019-11-01T00:22:34Z

@penzn I added a list of applications to the PR description. Compared to using load-with-extend and 32x32->32 multiplications, these "dot product" operations have several performance advantages:

"Load with extend" doesn't exit on ARM and will be lowered into two instructions
On most x86 microarchitectures 32x32->32 multiplication is twice more expensive (in throughput) than PMADDWD. E.g. Intel Skylake can issue two PMADDWD instructions per cycle, but only one PMULLD instruction per cycle.
On ARM Cortex-A72/A73/75/etc, 32x32->32 multiplication is twice more expensive (in throughput) than 16x16->32 multiplication used to simulate i16x8.dot2.
Not only is the 32x32->32 multiplication more expensive, but we'd need twice more 32x32->32 multiplication instructions to do the same number of multiplications as i16x8.dot2.

Maratyszcza · 2019-11-01T00:50:13Z

@tlively Good point about i32x4 output type, I renamed the instructions to i32x4.dot2_s and i32x4.dot2_add_s accordingly. As for _s suffix and 2 in the name, I think it is best to keep them to allow for future extensions:

AVX512 VNNI and NEON DOT extensions provide instructions for a dot product of 4 8-bit integers with accumulation to 32 bits. Thus, we might need to distinguish between dot2 and dot4 in the future.
It is also possible that we'd want to have unsigned variants in the future, and need to distinguish between dot2_s and dot2_u variants. Unsigned variants wouldn't lower as nicely across all architectures, so I left them out of this proposal.

tlively · 2019-11-01T01:00:29Z

If we may need to differentiate between different input types, how about i32x4.dot_i16x8_s?

Maratyszcza · 2019-11-01T06:41:49Z

@tlively Sounds reasonable, there are already instructions with similar names. Updated the commit & PR description.

Summary: This instruction is not merged to the spec proposal, but we need it to be implemented in the toolchain to experiment with it. It is available only on an opt-in basis through a clang builtin. Defined in WebAssembly/simd#127. Depends on D69696. Reviewers: aheejin Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D69697

tlively · 2019-11-02T00:55:40Z

@Maratyszcza Could you add pseudocode for the semantics of these operations as well? I want to make sure I implement the interpreter correctly in Binaryen.

Maratyszcza · 2019-11-03T03:58:57Z

@tlively I don't quite understand the pseudo-code specification in WAsm SIMD, especially given that these "dot product" instructions are the few to have some "horizontal" component in them. You may refer to PMADDWD instruction in Intel architecture manual which is the analog of i32x4.dot_i16x8_s.

This experimental instruction is specified in WebAssembly/simd#127 and is being implemented to enable further investigation of its performance impact.

Summary: This instruction is not merged to the spec proposal, but we need it to be implemented in the toolchain to experiment with it. It is available only on an opt-in basis through a clang builtin. Defined in WebAssembly/simd#127. Depends on D69696. Reviewers: aheejin Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D69697

tlively · 2019-12-17T21:37:14Z

The opcodes in this PR collide with the opcodes used for the {i8x16,i16x8}.avgr_u instructions. Since the averaging instructions have been merged, I will reassign the opcodes for the dot product instructions to 0xdb and 0xdc in the LLVM and Binaryen implementations.

Edit: I forgot there was only one dot product instruction implemented. i32x4.dot_i16x8_s will have ocode 0xdb.

Maratyszcza · 2020-01-14T18:25:06Z

@tlively Renumbered opcodes in the PR similarly to LLVM

bjacob · 2020-05-05T19:40:43Z

I would like to +1 this request, especially the 8bit*8bit accumulating into 32bit flavor as in VNNI. It would be a necessary prerequisite to even consider targeting WebAsm for integer-quantized neural network inference code. Without such an instruction, 8bit quantization of neural networks simply won't provide a meaningful computational advantage over float; people would merely 8bit-quantize to shrink the download size but then dequantize to float for the client side computation.

On the ARM side, the VNNI-equivalent instruction is SDOT/UDOT, available in current commercially available ARM CPUs / Android devices such as the Pixel4 and available in lower-end CPUs as well now (Cortex-A55) so it's a present issue, not future. It is a 4x speed difference today.

(For both points made above, see this data).

Example production code (used by TensorFlow Lite using these instructions) (the machine-encodings are here to support older assemblers).
https://github.com/google/ruy/blob/57e64b4c8f32e813ce46eb495d13ee301826e498/ruy/kernel_arm64.cc#L3066

…ccepted status. r=jseward Background: WebAssembly/simd#127 For the widening dot product instruction: - remove the internal 'Experimental' opcode suffix in the C++ code - remove the guard on the instruction in all the C++ decoders - move the test cases from simd/experimental.js to simd/ad-hack.js I have checked that current V8 and wasm-tools use the same opcode mapping. V8 in turn guarantees the correct mapping for LLVM and binaryen. Differential Revision: https://phabricator.services.mozilla.com/D92929

ngzhian · 2020-10-19T22:52:33Z

There is no pseducode for this op, but I think the text description is straightforward enough. Merging.

Ref: WebAssembly/simd#127

julian-seward1 · 2020-10-27T13:37:24Z

@Maratyszcza are there any .wast testcases available for i32x4.dot_i16x8_s? I looked around the repo but didn't find any -- I may have looked in the wrong place though.

This patch implements, for aarch64, the following wasm SIMD extensions i32x4.dot_i16x8_s instruction WebAssembly/simd#127 It also updates dependencies as follows, in order that the new instruction can be parsed, decoded, etc: wat to 1.0.27 wast to 26.0.1 wasmparser to 0.65.0 wasmprinter to 0.2.12 The changes are straightforward: * new CLIF instruction `widening_pairwise_dot_product_s` * translation from wasm into `widening_pairwise_dot_product_s` * new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group) * translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv` There is no testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.

Maratyszcza · 2020-10-27T21:45:03Z

Not that I'm aware of. @ngzhian and @tlively probably know better.

ngzhian · 2020-10-27T23:08:22Z

Not yet, this is a pretty new instruction, so it's not implemented in the interpreter yet. Ofc, contributions welcome, let me know if you're interested (either to contribute implementation or tests, or both).

This patch implements, for aarch64, the following wasm SIMD extensions i32x4.dot_i16x8_s instruction WebAssembly/simd#127 It also updates dependencies as follows, in order that the new instruction can be parsed, decoded, etc: wat to 1.0.27 wast to 26.0.1 wasmparser to 0.65.0 wasmprinter to 0.2.12 The changes are straightforward: * new CLIF instruction `widening_pairwise_dot_product_s` * translation from wasm into `widening_pairwise_dot_product_s` * new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group) * translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv` There is no testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982

It multiplies respective lanes from the 2 input operands, then adds adjacent lanes. This was merged into the proposal in WebAssembly#127.

It multiplies respective lanes from the 2 input operands, then adds adjacent lanes. This was merged into the proposal in #127.

This patch implements, for aarch64, the following wasm SIMD extensions i32x4.dot_i16x8_s instruction WebAssembly/simd#127 It also updates dependencies as follows, in order that the new instruction can be parsed, decoded, etc: wat to 1.0.27 wast to 26.0.1 wasmparser to 0.65.0 wasmprinter to 0.2.12 The changes are straightforward: * new CLIF instruction `widening_pairwise_dot_product_s` * translation from wasm into `widening_pairwise_dot_product_s` * new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group) * translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv` There is no testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.

This instruction was added in WebAssembly#127.

This instruction was added in #127. Co-authored-by: Andreas Rossberg <[email protected]>

Maratyszcza mentioned this pull request Oct 29, 2019

SIMD Sync meeting 10/22/2019 Agenda #121

Closed

Maratyszcza changed the title ~~i16x8.dot2_s and i16x8.dot2acc_s instructions~~ i32x4.dot2_s and i32x4.dot2acc_s instructions Nov 1, 2019

Maratyszcza force-pushed the dot2 branch from d5a8499 to 94daf3e Compare November 1, 2019 00:45

Maratyszcza changed the title ~~i32x4.dot2_s and i32x4.dot2acc_s instructions~~ i32x4.dot2_s and i32x4.dot2_acc_s instructions Nov 1, 2019

Maratyszcza force-pushed the dot2 branch from 94daf3e to 28c01fb Compare November 1, 2019 06:37

Maratyszcza changed the title ~~i32x4.dot2_s and i32x4.dot2_acc_s instructions~~ i32x4.dot_i16x8_s and i32x4.dot_i16x8_add_s instructions Nov 1, 2019

tlively added a commit to tlively/binaryen that referenced this pull request Nov 4, 2019

Add i32x4.dot_i16x8_s

6cbfc8c

This experimental instruction is specified in WebAssembly/simd#127 and is being implemented to enable further investigation of its performance impact.

tlively added a commit to tlively/binaryen that referenced this pull request Nov 4, 2019

Add i32x4.dot_i16x8_s

565d952

This experimental instruction is specified in WebAssembly/simd#127 and is being implemented to enable further investigation of its performance impact.

tlively mentioned this pull request Nov 4, 2019

Add i32x4.dot_i16x8_s WebAssembly/binaryen#2420

Merged

tlively added a commit to WebAssembly/binaryen that referenced this pull request Nov 4, 2019

Add i32x4.dot_i16x8_s (#2420)

368f8a7

This experimental instruction is specified in WebAssembly/simd#127 and is being implemented to enable further investigation of its performance impact.

Maratyszcza force-pushed the dot2 branch from 28c01fb to c397120 Compare January 14, 2020 18:24

Maratyszcza force-pushed the dot2 branch 2 times, most recently from 0e011c5 to bb31a5a Compare February 7, 2020 21:28

Maratyszcza force-pushed the dot2 branch from bb31a5a to ba9c3c1 Compare February 15, 2020 01:44

tlively mentioned this pull request Apr 1, 2020

Opcode renumbering #209

Merged

Maratyszcza mentioned this pull request Oct 13, 2020

Extended multiply horizontal add instruction #382

Open

omnisip added a commit to omnisip/simd that referenced this pull request Oct 15, 2020

Update nomenclature to match @Maratyszcza proposal in WebAssembly#127

e699ff6

ngzhian merged commit 1cfd484 into WebAssembly:master Oct 19, 2020

bmeurer added a commit to bmeurer/wasmparser that referenced this pull request Oct 24, 2020

feat: add support for i32x4.dot_i16x8_s

afe3f1e

Ref: WebAssembly/simd#127

bmeurer added a commit to wasdk/wasmparser that referenced this pull request Oct 24, 2020

feat: add support for i32x4.dot_i16x8_s

ebd88d6

Ref: WebAssembly/simd#127

julian-seward1 mentioned this pull request Oct 27, 2020

CL/aarch64: implement the wasm SIMD i32x4.dot_i16x8_s instruction bytecodealliance/wasmtime#2327

Merged

ngzhian added a commit to ngzhian/simd that referenced this pull request Nov 4, 2020

Implement i32x4.dot_i16x8_s

08df906

It multiplies respective lanes from the 2 input operands, then adds adjacent lanes. This was merged into the proposal in WebAssembly#127.

ngzhian mentioned this pull request Nov 4, 2020

Implement i32x4.dot_i16x8_s #393

Merged

ngzhian added a commit that referenced this pull request Nov 4, 2020

Implement i32x4.dot_i16x8_s (#393)

d154084

It multiplies respective lanes from the 2 input operands, then adds adjacent lanes. This was merged into the proposal in #127.

kleisauke mentioned this pull request Feb 9, 2021

Avoid scalarization in _mm_madd_epi16 emscripten-core/emscripten#13454

Merged

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 18, 2021

[spectext] Add i32x4.dot_i16x8_s

227880e

This instruction was added in WebAssembly#127.

ngzhian mentioned this pull request Feb 18, 2021

[spectext] Add i32x4.dot_i16x8_s #475

Merged

ngzhian added a commit that referenced this pull request Feb 24, 2021

[spectext] Add i32x4.dot_i16x8_s (#475)

7c0ca01

This instruction was added in #127. Co-authored-by: Andreas Rossberg <[email protected]>

bjacob mentioned this pull request Mar 19, 2021

8bit*8bit 4-D dot-product accumulating to 32bit, similar to ARM SDOT and x86 VNNI WebAssembly/relaxed-simd#9

Open

Maratyszcza mentioned this pull request Feb 18, 2022

Relaxed Integer Dot Product instructions WebAssembly/relaxed-simd#52

Open

Maratyszcza mentioned this pull request Jul 4, 2022

Relaxed BFloat16 Dot Product instruction WebAssembly/relaxed-simd#77

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i32x4.dot_i16x8_s instruction #127

i32x4.dot_i16x8_s instruction #127

Maratyszcza commented Oct 28, 2019 •

edited

Loading

dtig commented Oct 30, 2019

penzn commented Oct 31, 2019 •

edited

Loading

tlively commented Oct 31, 2019

Maratyszcza commented Nov 1, 2019

Maratyszcza commented Nov 1, 2019

Maratyszcza commented Nov 1, 2019 •

edited

Loading

tlively commented Nov 1, 2019

Maratyszcza commented Nov 1, 2019

tlively commented Nov 2, 2019

Maratyszcza commented Nov 3, 2019

tlively commented Dec 17, 2019 •

edited

Loading

Maratyszcza commented Jan 14, 2020

bjacob commented May 5, 2020 •

edited

Loading

ngzhian commented Oct 19, 2020

julian-seward1 commented Oct 27, 2020

Maratyszcza commented Oct 27, 2020

ngzhian commented Oct 27, 2020

i32x4.dot_i16x8_s instruction #127

i32x4.dot_i16x8_s instruction #127

Conversation

Maratyszcza commented Oct 28, 2019 • edited Loading

Introduction

[October 31 update] Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512VNNI and AVX512VL instruction sets

x86/x86-64 processors with XOP instruction set

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

POWER processors with VMX (Altivec) instruction set

MIPS processors with MSA instruction set

References

dtig commented Oct 30, 2019

penzn commented Oct 31, 2019 • edited Loading

tlively commented Oct 31, 2019

Maratyszcza commented Nov 1, 2019

Maratyszcza commented Nov 1, 2019

Maratyszcza commented Nov 1, 2019 • edited Loading

tlively commented Nov 1, 2019

Maratyszcza commented Nov 1, 2019

tlively commented Nov 2, 2019

Maratyszcza commented Nov 3, 2019

tlively commented Dec 17, 2019 • edited Loading

Maratyszcza commented Jan 14, 2020

bjacob commented May 5, 2020 • edited Loading

ngzhian commented Oct 19, 2020

julian-seward1 commented Oct 27, 2020

Maratyszcza commented Oct 27, 2020

ngzhian commented Oct 27, 2020

Maratyszcza commented Oct 28, 2019 •

edited

Loading

penzn commented Oct 31, 2019 •

edited

Loading

Maratyszcza commented Nov 1, 2019 •

edited

Loading

tlively commented Dec 17, 2019 •

edited

Loading

bjacob commented May 5, 2020 •

edited

Loading