Add SIMD Support #903

tiehuis · 2018-04-07T12:08:37Z

SIMD is very useful for fast processing of data and given Zig's goals of going fast, I think we need to look at how exposing some way of using these instructions easily and reliably.

Status-Quo

Inline Assembly

It is possible to do simd in inline-assembly as is. This is a bit cumbersome though and I think we should strive for being able to get any speed performances in the zig language itself.

Rely on the Optimizer

The optimizer is good and comptime unrolling and support helps a lot, but it doesn't provide guarantees that any specific code will be vectorized. You are at the mercy of LLVM and you don't want to see your code lose a huge hit in performance simply due to a compiler upgrade/change.

LLVM Vector Intrinsics

LLVM supports vector types as first class objects in it's ir. These correspond to simd instructions. This provides the bulk of the work and for us, we simply need to expose a way to construct these vector types. This would be analagous to the __attribute__((vector))__ builtin found in C compilers.

If anyone has any thoughts on the implementation and or usage then that would be great since I'm not very familiar with how these are exposed by LLVM. It would be great to get some discussion going in this area since I'm sure people would like to be able to match the performance of C in all areas with Zig.

The text was updated successfully, but these errors were encountered:

abique · 2018-04-07T12:51:01Z

I think relying on the compiler vector type is a good solution.
Both LLVM and GCC have it. If they're not present, you can always have a generic "software" fallback.

Syntax: you need a way to describe a vector type, an idea could be:

const value = <[]> f32 {0, 13, 23, 0.4};

So <[ N_ELTS ]> type would be the bracket style for vectors in this examples.

Also vector are use essentially for arithetic so regular artimetic should work.

Importants things:

be able to extract a single element from a vector
needs some kind of shuffle vector: `@shuffle(v1, v2, index0, index1, index2, ...)
you should be able to do an addition or multiplication between a scalar and a vector
vector can't be nested

The standard library should also provide simd version of cos, sin, exp and so on.

andrewrk · 2018-04-07T19:03:48Z

How about adding operators for arrays? Example:

const std = @import("std");

test "simd" {
    var a = [4]i32{1, 2, 3, 4};
    var b = [4]i32{5, 6, 7, 8};
    var c = a + b;
    std.debug.assert(mem.eql(i32, c[0..], [4]i32{6, 8, 10, 12} ));
}

This would codegen to using vectors in LLVM.

abique · 2018-04-08T07:53:14Z

I believe you'll find out that using arrays for "simd vector" introduces more problems than solutions, and that's why llvm and gcc went a different way.

First thing is that they might have different alignment requirement. Plus those vector are supposed to be stored in a single register in the end, so you might want to codegen differently depending on vector / array maybe.

I also worked on a private DSL, and we had the distinction between vectors and array from the typing, and it was fine as far as I can tell. The vector type also provides useful information while doing the semantic analysis, and you see what you get. Otherwise you have some array magic which is exactly the kind of things that people want to avoid when switching to your new language right?

abique · 2018-04-08T07:58:08Z

https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

andrewrk · 2018-04-08T17:36:23Z

I think you're right - the simplest thing for everyone is to introduce a new vector primitive type and have it map exactly to the LLVM type.

BraedonWooding · 2018-04-25T05:31:11Z

Also keep in mind the rsqrtss command and others which are seriously fast on systems that support them showing speed increases of 10x. This article here demonstrates some of the differences well; http://assemblyrequired.crashworks.org/timing-square-root/
and here; http://adrianboeing.blogspot.com.au/2009/10/timing-square-root-on-gpu.html.

We should aim to try to utilise this set of faster functions when we can.

lmb · 2018-07-16T14:55:40Z

I just stumbled on this. There is a blog post series by a (former?) Intel engineer who designed a compiler for a vectorized language: http://pharr.org/matt/blog/2018/04/18/ispc-origins.html
At the least an interesting read, but maybe good inspiration as well.

abique · 2018-07-16T15:23:11Z

Dense and interesting articles!

BarabasGitHub · 2018-08-13T19:12:42Z

One thing to keep in mind here is that even though you can vectorize scalar code, there are a lot of operations that are supported by simd instructions which you can't do in 'normal' scalar code. Such as creating bit fields from floating point comparisons to later use them in bitwise operations (often to avoid branches). Plus there are integer operations which expand to wider integers and other special stuff.

The series of articles linked by @lmb also show well what the difference can be between code/compiler that's designed for SIMD and code/compiler that isn't.

See #903 * create with `@Vector(len, ElemType)` * only wrapping addition is implemented This feature is far from complete; this is only the beginning.

andrewrk · 2019-01-31T21:14:39Z

In the above commit I introduced the @Vector(len, ElemType) builtin to create vector types, and then I implemented addition (but I didn't make a test yet, hence the box is unchecked). So the effort here is started. Here is what I believe is left to do:

No mixing vector/scalar support. Instead you will use @splat(N, x) to create a vector of N elements from a scalar value x. Reasoning for this is that it more closely matches the LLVM IR. So for example multiplication would be:

fn vecMulScalar(v: @Vector(10, i32), x: i32) @Vector(10, i32) {
    return v * @splat(10, x);
}

abique · 2019-01-31T22:06:03Z

The syntax looks ugly but if it works as good as the llvm builtin vectors, then it is fine! ;-)

Thank you, and don't forget the shuflle vector!

abique · 2019-01-31T22:07:29Z

What do you think of v10i32 ?

andrewrk · 2019-02-01T01:27:47Z

What do you think of v10i32 ?

A few things:

We need the builtin function anyway (just like we have @IntType (which is planned to be renamed to @Int)), so @Vector is a good starting point. If we switch to syntax, it will be a very small change in the compiler.
If there is syntax for it, it should work for ints, floats, and pointers. I'm not sure how the v10i32 example would work for pointer elements, and, if you don't already know about vectors, @Vector seems more discoverable to me than v10i32.
Manually putting const v10i32 = @Vector(10, i32); in a file is not so bad. Let's try it out for a while, and maybe we add syntax later if it seems necessary.

Please do feel free to propose syntax for a vector type. What's been proposed so far:

<[N]> type

This syntax hasn't been rejected; I'm simply avoiding the syntax question until the feature is done since it's the easiest thing to change at the very end.

abique · 2019-02-01T08:05:16Z

Why would you want a vector of pointers? Can you do a vector load from that? Would that even be efficient? Do you want people to do vectorized pointer arithmetic? 🐙

I'd go with v4f32 style! People will really enjoy writing simd with that style. But of course it does not work with templates... :) So might need a more verbose type declaration indeed.

andrewrk · 2019-02-01T14:38:34Z

Why would you want a vector of pointers?

Mainly, because LLVM IR supports it, and they're usually pretty good about representing what hardware generally supports. We don't automatically do everything LLVM does, but it's a good null hypothesis.

Can you do a vector load from that?

Yes you can, which yields a vector. So for example you could have a vector of 4 pointers to a struct, and then obtain a vector of 4 floats which are their fields:

const Point = struct {x: f32, y: f32};
fn multiPointMagnitude(points: @Vector(4, *Point)) @Vector(4, f32) {
    return @sqrt(points.x * points.x + points.y * points.y);
}

It's planned for this code to work verbatim once this issue is closed.

Not only can you do vector loads and vector stores from vectors of pointers, you can also do @maskedGather, @maskedScatter, and more. See the LLVM LangRef links in the comment above for explanations.

travisstaloch · 2019-02-02T04:58:44Z

How are we supposed to initialize a vector? I couldn't find an example in the newest code. Or is this not implemented yet?

For example, the following doesn't work:

test "initialize vector" {
    const V4i32 = @Vector(4, i32);
    var v: V4i32 = []i32{ 0, 1, 2, 3 };
}

andrewrk · 2019-02-02T05:22:41Z

Your example is planned to work. That's the checkbox above labeled "implicit array to vector cast".

also vectors and arrays now use the same ConstExprVal representation See #903

andrewrk · 2019-02-05T01:34:09Z

@travisstaloch the array <-> vector casts work now. Here's the passing test case:

zig/test/stage1/behavior/vector.zig

Lines 4 to 19 in 8c6fa98

    
           test "implicit array to vector and vector to array" { 
        
               const S = struct { 
        
                   fn doTheTest() void { 
        
                       var v: @Vector(4, i32) = [4]i32{10, 20, 30, 40}; 
        
                       const x: @Vector(4, i32) = [4]i32{1, 2, 3, 4}; 
        
                       v +%= x; 
        
                       const result: [4]i32 = v; 
        
                       assertOrPanic(result[0] == 11); 
        
                       assertOrPanic(result[1] == 22); 
        
                       assertOrPanic(result[2] == 33); 
        
                       assertOrPanic(result[3] == 44); 
        
                   } 
        
               }; 
        
               S.doTheTest(); 
        
               comptime S.doTheTest(); 
        
           }

also fix vector behavior tests, they weren't actually testing runtime vectors, but now they are. See #903

shawnl · 2020-10-13T06:41:35Z

Oh cool! I never realized that @select could be used on vectors.

…

On Tue, Oct 13, 2020 at 10:28 AM floopfloopfloopfloopfloop < ***@***.***> wrote: http://llvm.org/docs/LangRef.html#select-instruction @Shuffle requires mask to be comptime, which means the branch must be known at comptime. @select will allow this to be done at runtime. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#903 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAD4W4W6KB7SCI7NW3S655LSKPXP7ANCNFSM4EZMHEZA> .

ghost · 2020-10-13T09:23:57Z

@floatCast is mentioned as todo, but @intCast @truncate, @as, and friends should also be included.

https://zig.godbolt.org/z/9nYcn4

const std = @import("std");

pub fn main() void {
    const v: i32 = 1;
    const a: @Vector(4, i32) = @splat(4, v);
    // These fail due to unexpected types
    const b = @intCast(@Vector(4, i64), a);
    const c = @as(@Vector(4, i64), a);
}

slimsag · 2020-11-14T01:01:25Z

Has there been thoughts already here around runtime switching of CPU SIMD feature sets? i.e. instead of compiling for a single instruction set (AVX2, AVX-512, SSE3, SSSE3 etc.) allowing compiling for multiple and, at runtime, choosing a branch that uses the latest and/or most efficient supported instruction set where reasonable?

ghost · 2020-12-29T07:51:27Z

@vector(N, bool) doesn't have and, or defined, nor &, |, making them questionably useful.

LemonBoy · 2020-12-29T09:42:33Z

@vector(N, bool) doesn't have and, or defined, nor &, |, making them questionably useful.

You can @bitCast(@vector(N,u1), your_bool_vector) and do whatever you want with the vector.

lemire · 2022-08-14T15:50:25Z

An interesting test for such an API would be whether one can implement useful artefact beyond number crunching... like High-speed UTF-8 validation or base64 encoding/decoding.

compiling for multiple and, at runtime, choosing a branch that uses the latest and/or most efficient supported instruction set where reasonable?

A related issue is that instructions sets are evolving. For example, the latest AWS graviton nodes support SVE/SVE2. The most powerful AWS nodes support a full range of AVX-512 instructions sets (up to VBMI2).

If you build something up that is unable to benefit from SVE2 or advanced AVX-512 instructions, then you are might not be future proof.

sharpobject · 2023-01-11T11:08:46Z

I agree emphatically with @lemire's comment above.

For even current fixed-pattern byte shuffling with @shuffle, the resulting assembly seems quite bad, and I'm not sure what to write to get a SIMD load or store. I ported a 4x4 transpose to use @shuffle today here https://godbolt.org/z/j584eWsx6. I think it should be 4 loads, 8 instructions to do the transpose, and 4 stores, plus whatever other instructions to do with the calling convention. Every part of the function is a lot bigger than that :(

The "correct" output for this function would be more like this.

Sahnvour · 2023-01-11T12:48:36Z

I ported a 4x4 transpose to use @shuffle today here https://godbolt.org/z/j584eWsx6. I think it should be 4 loads, 8 instructions to do the transpose, and 4 stores, plus whatever other instructions to do with the calling convention. Every part of the function is a lot bigger than that :(

It gets a lot better with -O ReleaseFast, -Drelease-fast=true is for build.zig files
cf https://godbolt.org/z/d6YvTfYGj

andrewrk · 2023-10-31T06:09:22Z

Has there been thoughts already here around runtime switching of CPU SIMD feature sets? i.e. instead of compiling for a single instruction set (AVX2, AVX-512, SSE3, SSSE3 etc.) allowing compiling for multiple and, at runtime, choosing a branch that uses the latest and/or most efficient supported instruction set where reasonable?

@slimsag yeah that's #1018

tiehuis added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Apr 7, 2018

andrewrk added the accepted This proposal is planned. label Apr 8, 2018

andrewrk added this to the 0.4.0 milestone Apr 8, 2018

isaachier mentioned this issue Jul 1, 2018

<Complex> and Quaternions/Vectors #947

Closed

BarabasGitHub mentioned this issue Jul 24, 2018

Operator definitions (in standard library) #1267

Closed

andrewrk mentioned this issue Dec 17, 2018

Proposal: Make slices behave like array sections in Cilk Plus. #1838

Closed

andrewrk added a commit that referenced this issue Jan 31, 2019

introduce vector type for SIMD

545064c

See #903 * create with `@Vector(len, ElemType)` * only wrapping addition is implemented This feature is far from complete; this is only the beginning.

andrewrk added the contributor friendly This issue is limited in scope and/or knowledge of Zig internals. label Jan 31, 2019

andrewrk added a commit that referenced this issue Feb 5, 2019

SIMD: array to vector, vector to array, wrapping int add

8c6fa98

also vectors and arrays now use the same ConstExprVal representation See #903

Hejsil mentioned this issue Feb 5, 2019

Added support for vector wrapping multiply and subtraction #1916

Merged

andrewrk added a commit that referenced this issue Feb 22, 2019

implement vector negation

52bb718

also fix vector behavior tests, they weren't actually testing runtime vectors, but now they are. See #903

andrewrk removed this from the 0.4.0 milestone Mar 22, 2019

andrewrk mentioned this issue Oct 22, 2020

SIMD vector type syntax: [|N|]T #6771

Open

data-man mentioned this issue Oct 27, 2020

remove the type parameter from builtins that support vectors #6835

Closed

lemaitre mentioned this issue Jan 6, 2021

Generic SIMD types and operations are not a substitute for intrinsics #7702

Open

momumi mentioned this issue Apr 18, 2021

Add MathType for scope limited operator overloading #8567

Closed

andrewrk modified the milestones: 0.8.0, 0.9.0 May 19, 2021

This was referenced Jul 22, 2021

RFC: SIMD spec #9389

Open

minimum/maximum builtins #9448

Merged

Add @select #9459

Merged

SuperAuguste mentioned this issue Jul 26, 2021

Vector support for @popCount, @ctz, and @clz #9458

Merged

travisstaloch mentioned this issue Aug 27, 2021

carryless multiplication builtin #9631

Open

andrewrk modified the milestones: 0.9.0, 0.10.0 Nov 20, 2021

andrewrk modified the milestones: 0.10.0, 0.11.0 Apr 16, 2022

andrewrk modified the milestones: 0.11.0, 0.12.0 Apr 9, 2023

andrewrk modified the milestones: 0.13.0, 0.12.0 Jun 29, 2023

SuperAuguste mentioned this issue Oct 30, 2023

Add @maskedScatter builtin #17783

Closed

Snektron mentioned this issue Jan 20, 2024

Make casting builtins use SIMD in the backend #18620

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SIMD Support #903

Add SIMD Support #903

tiehuis commented Apr 7, 2018 •

edited by andrewrk

Loading

abique commented Apr 7, 2018 •

edited

Loading

andrewrk commented Apr 7, 2018

abique commented Apr 8, 2018

abique commented Apr 8, 2018 •

edited

Loading

andrewrk commented Apr 8, 2018

BraedonWooding commented Apr 25, 2018 •

edited

Loading

lmb commented Jul 16, 2018

abique commented Jul 16, 2018

BarabasGitHub commented Aug 13, 2018

andrewrk commented Jan 31, 2019 •

edited by ifreund

Loading

abique commented Jan 31, 2019

abique commented Jan 31, 2019

andrewrk commented Feb 1, 2019 •

edited

Loading

abique commented Feb 1, 2019

andrewrk commented Feb 1, 2019 •

edited

Loading

travisstaloch commented Feb 2, 2019

andrewrk commented Feb 2, 2019 •

edited

Loading

andrewrk commented Feb 5, 2019

shawnl commented Oct 13, 2020 via email

ghost commented Oct 13, 2020 •

edited by ghost

Loading

slimsag commented Nov 14, 2020 •

edited

Loading

ghost commented Dec 29, 2020

LemonBoy commented Dec 29, 2020

lemire commented Aug 14, 2022

sharpobject commented Jan 11, 2023 •

edited

Loading

Sahnvour commented Jan 11, 2023

andrewrk commented Oct 31, 2023

Add SIMD Support #903

Add SIMD Support #903

Comments

tiehuis commented Apr 7, 2018 • edited by andrewrk Loading

Status-Quo

Inline Assembly

Rely on the Optimizer

LLVM Vector Intrinsics

abique commented Apr 7, 2018 • edited Loading

andrewrk commented Apr 7, 2018

abique commented Apr 8, 2018

abique commented Apr 8, 2018 • edited Loading

andrewrk commented Apr 8, 2018

BraedonWooding commented Apr 25, 2018 • edited Loading

lmb commented Jul 16, 2018

abique commented Jul 16, 2018

BarabasGitHub commented Aug 13, 2018

andrewrk commented Jan 31, 2019 • edited by ifreund Loading

abique commented Jan 31, 2019

abique commented Jan 31, 2019

andrewrk commented Feb 1, 2019 • edited Loading

abique commented Feb 1, 2019

andrewrk commented Feb 1, 2019 • edited Loading

travisstaloch commented Feb 2, 2019

andrewrk commented Feb 2, 2019 • edited Loading

andrewrk commented Feb 5, 2019

shawnl commented Oct 13, 2020 via email

ghost commented Oct 13, 2020 • edited by ghost Loading

slimsag commented Nov 14, 2020 • edited Loading

ghost commented Dec 29, 2020

LemonBoy commented Dec 29, 2020

lemire commented Aug 14, 2022

sharpobject commented Jan 11, 2023 • edited Loading

Sahnvour commented Jan 11, 2023

andrewrk commented Oct 31, 2023

tiehuis commented Apr 7, 2018 •

edited by andrewrk

Loading

abique commented Apr 7, 2018 •

edited

Loading

abique commented Apr 8, 2018 •

edited

Loading

BraedonWooding commented Apr 25, 2018 •

edited

Loading

andrewrk commented Jan 31, 2019 •

edited by ifreund

Loading

andrewrk commented Feb 1, 2019 •

edited

Loading

andrewrk commented Feb 1, 2019 •

edited

Loading

andrewrk commented Feb 2, 2019 •

edited

Loading

ghost commented Oct 13, 2020 •

edited by ghost

Loading

slimsag commented Nov 14, 2020 •

edited

Loading

sharpobject commented Jan 11, 2023 •

edited

Loading