v128.load32_zero and v128.load64_zero instructions #237

Maratyszcza · 2020-06-02T22:28:45Z

Introduction

This PR introduce two new variants of load instructions which load a single 32-bit or 64-bit element into the lowest part of 128-bit SIMD vector and zero-extend it to full 128 bits. These instructions natively map to SSE2 and ARM64 NEON instruction, and have two broad use-cases:

Non-contiguous loads, when we need to combine elements from disjoint locations in a single SIMD vector. Non-contiguous loads are commonly emulated by doing loads a single elements and then combining the values through shuffles. While is it possible to do through a combination of scalar loads and v128.replace_lane instructions, the resulting code would use be inefficient in using too many general-purpose registers, producing an overly long dependency chain (every v128.replace_lane depends on the previous one), and hitting the long-latency/low-throughput instructions to copy from general-purpose registers to SIMD registers. Non-contiguous loads using the proposed v128.load32_zero and v128.load64_zero instructions avoid all these bottlenecks.
Processing fewer than 128 bits of data. Sometimes the algorithm or data structures just don't expose enough data to utilize all 128 bits of a SIMD vector, but would nevertheless benefit from processing fewer elements in parallel (e.g. adding 8 bytes in one SIMD instruction rather than eight scalar instructions).

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

v128.load32_zero
- v = v128.load32_zero(mem) is lowered to VMOVSS xmm_v, [mem]
v128.load64_zero
- v = v128.load64_zero(mem) is lowered to VMOVSD xmm_v, [mem]

x86/x86-64 processors with SSE2 instruction set

v128.load32_zero
- v = v128.load32_zero(mem) is lowered to MOVSS xmm_v, [mem]
v128.load64_zero
- v = v128.load64_zero(mem) is lowered to MOVSD xmm_v, [mem]

ARM64 processors

v128.load32_zero
- v = v128.load32_zero(mem) is lowered to LDR Sv, [mem]
v128.load64_zero
- v = v128.load64_zero(mem) is lowered to LDR Dv, [mem]

ARMv7 processors with NEON instruction set

v128.load32_zero
- v = v128.load32_zero(mem) is lowered to VMOV.I32 Qv, 0 + VLD1.32 {Dv_lo[0]}, [mem]
v128.load64_zero
- v = v128.load64_zero(mem) is lowered to VMOV.I32 Dv_hi, 0 + VLD1.32 {Dv_lo}, [mem]

tlively · 2020-06-02T23:32:06Z

Thanks for the suggestion, @Maratyszcza! Now that we are in phase 3, we have stricter guidelines on adding new instructions. It sounds like these instructions are well supported on multiple architectures, but we need to agree that they are used in multiple important use cases and that they would be expensive to emulate. Can you point to real-world uses of this pattern that we could adapt as benchmarks to determine how much of a benefit these instructions would be?

Maratyszcza · 2020-06-03T00:29:43Z

@tlively Added examples of applications using these instructions

Maratyszcza · 2020-06-03T00:31:54Z

XNNPACK has SIMD table-based exp and sigmoid implementations that could be used for evaluation

jan-wassenberg · 2020-06-03T07:00:03Z

@tlively I agree these would be helpful. Another expensive to emulate use case is when you have an existing data structure of 1-2 floats and we can't be sure 4 floats are accessible. Or the much more common case of remainder handling - using the same code as the main loop, but with 32-bit loads/stores going one element at a time. JPEG XL has several examples of this.

lemaitre · 2020-06-03T08:13:06Z

Maybe the more general vld1q_lane from Neon might be desirable (https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf#G15.1154120)

Basically, it can loads a single element into any lane, not just the first one, and leaves the other lanes untouched.

The problem I see is that it would need some kind pattern matching to make this the most efficient on x86 where we only have "load into the first lane and put the rest at zero".
But we can envision that a sequence of load_lane could be converted into shuffles in SSE and even a (masked) gather in AVX2.

If pattern recognition fails (or is disabled), the generated code for a single load_lane would still be faster than scalar load + insert_lane as it would be converted into loadl + shuffle and stay in the SIMD register space.
And the shuffle can be easily eliminated if the WASM runtime detect that the lane index is 0 and the input vector already contains zeros.

The store counterpart might also be interesting.
However, the store variant will not be able to handle 8- and 16-bit types efficiently on x86.
We can stay with 32- and 64-bit types, as proposed here, though.

tlively · 2020-07-31T20:00:49Z

For consistency with the load_splat instructions, these instructions should probably have v32x4 and v64x2 prefixes. More descriptive names might be v32x4.load_lane and v64x2.load_lane.

Specified in WebAssembly/simd#237. Since these are just prototypes necessary for benchmarking, this PR does not add support for these instructions to the fuzzer or the C or JS APIs. This PR also renumbers the QFMA instructions that previously used the opcodes for these new instructions. The renumbering matches the renumbering in V8 and LLVM.

Maratyszcza · 2020-08-02T00:25:21Z

IMO it is best to save load_lane for (future) variants which load a single lane while leaving other unchanged (i.e. analogs of vld1q_lane_XX in ARM NEON and _mm_insert_epiXX on x86).

We could rename these instructions to v128.load32_u and v128.load64_u for consistency with v128.load32x2_u and other zero-extending instructions.

Specified in WebAssembly/simd#237. Since these are just prototypes necessary for benchmarking, this PR does not add support for these instructions to the fuzzer or the C or JS APIs. This PR also renumbers the QFMA instructions that previously used the opcodes for these new instructions. The renumbering matches the renumbering in V8 and LLVM.

Specified in WebAssembly/simd#237, these instructions load the first vector lane from memory and zero the other lanes. Since these instructions are not officially part of the SIMD proposal, they are only available on an opt-in basis via LLVM intrinsics and clang builtin functions. If these instructions are merged to the proposal, this implementation will change so that the instructions will be generated from normal IR. At that point the intrinsics and builtin functions would be removed. This PR also changes the opcodes for the experimental f32x4.qfm{a,s} instructions because their opcodes conflicted with those of the v128.load{32,64}_zero instructions. The new opcodes were chosen to match those used in V8. Differential Revision: https://reviews.llvm.org/D84820

tlively · 2020-08-04T00:11:14Z

Saving load_lane for potential future instructions makes sense to me. How about v32x4.load32 and v64x2.load64? The _u suffix doesn't seem necessary because there is no sign interpretation happening. I still think it makes sense to use the prefixes for hinting at the lane interpretation, but I could probably be convinced otherwise as well.

On a different note, prototypes of these instructions have been merged to both LLVM and Binaryen and will be available in the next version of Emscripten via the builtin functions __builtin_wasm_load32_zero and __builtin_wasm_load64_zero.

ngzhian · 2020-08-04T00:47:05Z

I think the memory instructions should all start with v128. (Ref: mvp instructions are all of the form <type>.load[<n>_<sx>].)

The shape prefix suggests how the operands are treated, which doesn't apply for loads, since the operands are all memargs. This might be a point of confusion. Making everything start with v128.load_ will help categorize all these variants of load as: "load from memory to get a v128", i.e. these are all the ways you can load something from memory to get a v128. Then the format becomes:

v128.load_<splat/extend/zero/others>_<numberofbytesloaded>_<sign extension>"

For load_splat we might even consider change it to: v128.load_splat8, similar to how we have i32.load8_s.

So maybe load zeroes can be: v128.load_zero32.

I think this has a (imo nice) side effect of making the spec text a bit clearer, because you can now say, all instructions that start with the shape prefix describe how they treat their operands (and you don't have to say "except for memory instructions").

tlively · 2020-08-04T00:54:05Z

@ngzhian That seems reasonable and consistent to me. We would want to use the v128 prefix for load-extend operations as well. I think it would look more consistent with MVP if we put <numberofbytesloaded> after load, like in v128.load8_splat or v128.load32_zero. WDYT?

ngzhian · 2020-08-04T00:59:12Z

Yea that looks good to me. It becomes really clear from the name that v128 is the return type, and how many bytes will be loaded. The remaining portion will be to tell us how to get from bytes to v128, of which we can have many different wants.

Specified in WebAssembly/simd#237, these instructions load the first vector lane from memory and zero the other lanes. Since these instructions are not officially part of the SIMD proposal, they are only available on an opt-in basis via LLVM intrinsics and clang builtin functions. If these instructions are merged to the proposal, this implementation will change so that the instructions will be generated from normal IR. At that point the intrinsics and builtin functions would be removed. This PR also changes the opcodes for the experimental f32x4.qfm{a,s} instructions because their opcodes conflicted with those of the v128.load{32,64}_zero instructions. The new opcodes were chosen to match those used in V8. Differential Revision: https://reviews.llvm.org/D84820

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3

tlively · 2020-09-04T17:20:54Z

@Maratyszcza we briefly discussed this in the sync meeting today, and there is general support for these instructions, but we still need benchmarking data to make the case for including them. Would you be able to get performance numbers for these?

omnisip · 2020-10-06T01:37:15Z

In this case, it would be the same operation with a register instead of a memory argument. There's no undefined behavior.

…

On Mon, Oct 5, 2020, 18:13 Thomas Lively ***@***.***> wrote: @omnisip <https://github.com/omnisip> that sounds like v128.const, but leaving some lanes undefined. Unfortunately we can't have instructions that leave parts of values undefined in WebAssembly because we can't have undefined behavior and we don't want to make the type system so complex that it can track undefined lanes statically. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKQJQF2UM2HZL24ORTJRLSJJOLBANCNFSM4NREEJEQ> .

tlively · 2020-10-06T02:02:49Z

How is that different from v128.const then?

omnisip · 2020-10-06T02:22:57Z

The value is calculated at runtime but isn't a pointer in memory.

…

On Mon, Oct 5, 2020, 20:03 Thomas Lively ***@***.***> wrote: How is that different from v128.const then? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKQJVRHEU5APTQCBL5DGLSJJ3FLANCNFSM4NREEJEQ> .

tlively · 2020-10-06T02:39:18Z

Then it sounds like a i64x2.replace_lane 0 I guess that requires a full vector to be materialized first, though.

In general, the answer to "why doesn't the instruction set have X" is some combination of "X is not portable enough," "X is not useful enough," or "no one has suggested X yet". You're totally welcome to suggest new instructions when you identify deficiencies in the instruction set. Useful information to include is how the instruction would be lowered on Intel and ARM ISAs, applications that would benefit from the instruction, and any estimates you have for the performance improvement that instruction can bring. See many of @Maratyszcza's PRs for good examples of new instruction proposals.

omnisip · 2020-10-06T02:54:29Z

In this case, it would belong with this PR because it would still be the same underlying instructions with a different argument. @Maratyszcza do you want me to make a patch or do you want to add this yourself?

tlively · 2020-10-06T02:56:22Z

WebAssembly doesn't overload instructions with different kinds of arguments like that, so it will have to be a new instruction proposal.

omnisip · 2020-10-06T03:21:26Z

There are three identical implementations for this functionality that are useful: 1) from memory like this currently is 2) from 32 or 64 bit register 3) from another v128 vector I'll put the other two in my proposal so they can be evaluated together when nomenclature is discussed.

…

On Mon, Oct 5, 2020, 20:56 Thomas Lively ***@***.***> wrote: WebAssembly doesn't overload instructions with different kinds of arguments like that, so it will have to be a new instruction proposal. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKQJTHVPROKEOKOAS7OQTSJKBOHANCNFSM4NREEJEQ> .

penzn · 2020-10-06T16:46:51Z

Why doesn't this instruction set have a non-memory load? E.g. to load into the lower part without pulling from memory?

Memory operations either go from memory to a stack slot (load) or from stack slot to memory (store). There are no other possible semantics for memory operations in WebAssembly.

(edit: had virtual registers instead of stack slots, my bad)

ngzhian · 2020-10-16T22:09:44Z

In the sync today we agreed to add these 2 instructions.
@Maratyszcza can you please update https://github.com/WebAssembly/simd/blob/master/proposals/simd/NewOpcodes.md as well? Put it in the table of memory instructions.

Maratyszcza · 2020-10-17T03:46:48Z

@ngzhian Done

ngzhian · 2020-10-19T16:27:49Z

Thanks, LGTM

Ref: WebAssembly/simd#237

…ons. This patch implements, for aarch64, the following wasm SIMD extensions. v128.load32_zero and v128.load64_zero instructions WebAssembly/simd#237 The changes are straightforward: * no new CLIF instructions. They are translated into an existing CLIF scalar load followed by a CLIF `scalar_to_vector`. * the comment/specification for CLIF `scalar_to_vector` has been changed to match the actual intended semantics, per consulation with Andrew Brown. * translation from `scalar_to_vector` to the obvious aarch64 insns. * special-case zero in `lower_constant_f128` in order to avoid a potentially slow call to `Inst::load_fp_constant128`. * Once "Allow loads to merge into other operations during instruction selection in MachInst backends" (bytecodealliance#2340) lands, we can use that functionality to pattern match the two-CLIF pair and emit a single AArch64 instruction. There is no testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982

…ons. This patch implements, for aarch64, the following wasm SIMD extensions. v128.load32_zero and v128.load64_zero instructions WebAssembly/simd#237 The changes are straightforward: * no new CLIF instructions. They are translated into an existing CLIF scalar load followed by a CLIF `scalar_to_vector`. * the comment/specification for CLIF `scalar_to_vector` has been changed to match the actual intended semantics, per consulation with Andrew Brown. * translation from `scalar_to_vector` to aarch64 `fmov` instruction. This has been generalised slightly so as to allow both 32- and 64-bit transfers. * special-case zero in `lower_constant_f128` in order to avoid a potentially slow call to `Inst::load_fp_constant128`. * Once "Allow loads to merge into other operations during instruction selection in MachInst backends" (bytecodealliance#2340) lands, we can use that functionality to pattern match the two-CLIF pair and emit a single AArch64 instruction. * A simple filetest has been added. There is no comprehensive testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.

…ons. This patch implements, for aarch64, the following wasm SIMD extensions. v128.load32_zero and v128.load64_zero instructions WebAssembly/simd#237 The changes are straightforward: * no new CLIF instructions. They are translated into an existing CLIF scalar load followed by a CLIF `scalar_to_vector`. * the comment/specification for CLIF `scalar_to_vector` has been changed to match the actual intended semantics, per consulation with Andrew Brown. * translation from `scalar_to_vector` to aarch64 `fmov` instruction. This has been generalised slightly so as to allow both 32- and 64-bit transfers. * special-case zero in `lower_constant_f128` in order to avoid a potentially slow call to `Inst::load_fp_constant128`. * Once "Allow loads to merge into other operations during instruction selection in MachInst backends" (#2340) lands, we can use that functionality to pattern match the two-CLIF pair and emit a single AArch64 instruction. * A simple filetest has been added. There is no comprehensive testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.

Specified in WebAssembly/simd#237, these instructions load the first vector lane from memory and zero the other lanes. Since these instructions are not officially part of the SIMD proposal, they are only available on an opt-in basis via LLVM intrinsics and clang builtin functions. If these instructions are merged to the proposal, this implementation will change so that the instructions will be generated from normal IR. At that point the intrinsics and builtin functions would be removed. This PR also changes the opcodes for the experimental f32x4.qfm{a,s} instructions because their opcodes conflicted with those of the v128.load{32,64}_zero instructions. The new opcodes were chosen to match those used in V8. Differential Revision: https://reviews.llvm.org/D84820

jan-wassenberg mentioned this pull request Jun 3, 2020

Per-lane loads and stores WebAssembly/flexible-vectors#9

Open

ngzhian mentioned this pull request Jul 6, 2020

Sign Select instructions #124

Closed

tlively mentioned this pull request Jul 31, 2020

Implement prototype v128.load{32,64}_zero instructions WebAssembly/binaryen#3011

Merged

This was referenced Aug 10, 2020

Add SIMD instructions to syntax #271

Merged

Names of load instructions #297

Closed

tlively added the pending prototype data label Sep 6, 2020

omnisip mentioned this pull request Oct 6, 2020

v128.move32_zero and v128.move64_zero instructions extensions to #237 #373

Open

omnisip mentioned this pull request Oct 7, 2020

move{32,64}_zero_{r,v} instructions #374

Closed

Maratyszcza force-pushed the load-zero branch from c4bae8d to 90a79a0 Compare October 17, 2020 03:46

v128.load32_zero and v128.load64_zero instructions

6021a4e

Maratyszcza force-pushed the load-zero branch from 90a79a0 to 6021a4e Compare October 17, 2020 03:47

lars-t-hansen mentioned this pull request Oct 19, 2020

Add v128.load32_zero and v128.load64_zero SIMD instructions bytecodealliance/wasm-tools#119

Merged

ngzhian merged commit b9b54b0 into WebAssembly:master Oct 19, 2020

bmeurer added a commit to bmeurer/wasmparser that referenced this pull request Oct 24, 2020

feat: add support for v128.load32_zero and v128.load64_zero

adea187

Ref: WebAssembly/simd#237

bmeurer added a commit to wasdk/wasmparser that referenced this pull request Oct 24, 2020

feat: add support for v128.load32_zero and v128.load64_zero

51f00f5

Ref: WebAssembly/simd#237

ngzhian mentioned this pull request Oct 30, 2020

Implement v128.load32_zero and v128.load64_zero #388

Merged

julian-seward1 mentioned this pull request Nov 3, 2020

CL/aarch64: implement the wasm SIMD v128.load{32,64}_zero instructi… bytecodealliance/wasmtime#2355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v128.load32_zero and v128.load64_zero instructions #237

v128.load32_zero and v128.load64_zero instructions #237

Maratyszcza commented Jun 2, 2020 •

edited

Loading

tlively commented Jun 2, 2020

Maratyszcza commented Jun 3, 2020

Maratyszcza commented Jun 3, 2020

jan-wassenberg commented Jun 3, 2020

lemaitre commented Jun 3, 2020

tlively commented Jul 31, 2020

Maratyszcza commented Aug 2, 2020

tlively commented Aug 4, 2020

ngzhian commented Aug 4, 2020

tlively commented Aug 4, 2020

ngzhian commented Aug 4, 2020

tlively commented Sep 4, 2020

omnisip commented Oct 6, 2020 via email

tlively commented Oct 6, 2020

omnisip commented Oct 6, 2020 via email

tlively commented Oct 6, 2020

omnisip commented Oct 6, 2020

tlively commented Oct 6, 2020

omnisip commented Oct 6, 2020 via email

penzn commented Oct 6, 2020 •

edited

Loading

ngzhian commented Oct 16, 2020

Maratyszcza commented Oct 17, 2020

ngzhian commented Oct 19, 2020

v128.load32_zero and v128.load64_zero instructions #237

v128.load32_zero and v128.load64_zero instructions #237

Conversation

Maratyszcza commented Jun 2, 2020 • edited Loading

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

tlively commented Jun 2, 2020

Maratyszcza commented Jun 3, 2020

Maratyszcza commented Jun 3, 2020

jan-wassenberg commented Jun 3, 2020

lemaitre commented Jun 3, 2020

tlively commented Jul 31, 2020

Maratyszcza commented Aug 2, 2020

tlively commented Aug 4, 2020

ngzhian commented Aug 4, 2020

tlively commented Aug 4, 2020

ngzhian commented Aug 4, 2020

tlively commented Sep 4, 2020

omnisip commented Oct 6, 2020 via email

tlively commented Oct 6, 2020

omnisip commented Oct 6, 2020 via email

tlively commented Oct 6, 2020

omnisip commented Oct 6, 2020

tlively commented Oct 6, 2020

omnisip commented Oct 6, 2020 via email

penzn commented Oct 6, 2020 • edited Loading

ngzhian commented Oct 16, 2020

Maratyszcza commented Oct 17, 2020

ngzhian commented Oct 19, 2020

Maratyszcza commented Jun 2, 2020 •

edited

Loading

penzn commented Oct 6, 2020 •

edited

Loading