Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SIMD Support #903

Open
tiehuis opened this issue Apr 7, 2018 · 55 comments
Open

Add SIMD Support #903

tiehuis opened this issue Apr 7, 2018 · 55 comments
Labels
accepted This proposal is planned. contributor friendly This issue is limited in scope and/or knowledge of Zig internals. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@tiehuis
Copy link
Member

tiehuis commented Apr 7, 2018

Current Progress


SIMD is very useful for fast processing of data and given Zig's goals of going fast, I think we need to look at how exposing some way of using these instructions easily and reliably.

Status-Quo

Inline Assembly

It is possible to do simd in inline-assembly as is. This is a bit cumbersome though and I think we should strive for being able to get any speed performances in the zig language itself.

Rely on the Optimizer

The optimizer is good and comptime unrolling and support helps a lot, but it doesn't provide guarantees that any specific code will be vectorized. You are at the mercy of LLVM and you don't want to see your code lose a huge hit in performance simply due to a compiler upgrade/change.

LLVM Vector Intrinsics

LLVM supports vector types as first class objects in it's ir. These correspond to simd instructions. This provides the bulk of the work and for us, we simply need to expose a way to construct these vector types. This would be analagous to the __attribute__((vector))__ builtin found in C compilers.


If anyone has any thoughts on the implementation and or usage then that would be great since I'm not very familiar with how these are exposed by LLVM. It would be great to get some discussion going in this area since I'm sure people would like to be able to match the performance of C in all areas with Zig.

@tiehuis tiehuis added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Apr 7, 2018
@abique
Copy link

abique commented Apr 7, 2018

I think relying on the compiler vector type is a good solution.
Both LLVM and GCC have it. If they're not present, you can always have a generic "software" fallback.

Syntax: you need a way to describe a vector type, an idea could be:

const value = <[]> f32 {0, 13, 23, 0.4};

So <[ N_ELTS ]> type would be the bracket style for vectors in this examples.

Also vector are use essentially for arithetic so regular artimetic should work.

Importants things:

  • be able to extract a single element from a vector
  • needs some kind of shuffle vector: `@shuffle(v1, v2, index0, index1, index2, ...)
  • you should be able to do an addition or multiplication between a scalar and a vector
  • vector can't be nested

The standard library should also provide simd version of cos, sin, exp and so on.

@andrewrk
Copy link
Member

andrewrk commented Apr 7, 2018

How about adding operators for arrays? Example:

const std = @import("std");

test "simd" {
    var a = [4]i32{1, 2, 3, 4};
    var b = [4]i32{5, 6, 7, 8};
    var c = a + b;
    std.debug.assert(mem.eql(i32, c[0..], [4]i32{6, 8, 10, 12} ));
}

This would codegen to using vectors in LLVM.

@abique
Copy link

abique commented Apr 8, 2018

I believe you'll find out that using arrays for "simd vector" introduces more problems than solutions, and that's why llvm and gcc went a different way.

First thing is that they might have different alignment requirement. Plus those vector are supposed to be stored in a single register in the end, so you might want to codegen differently depending on vector / array maybe.

I also worked on a private DSL, and we had the distinction between vectors and array from the typing, and it was fine as far as I can tell. The vector type also provides useful information while doing the semantic analysis, and you see what you get. Otherwise you have some array magic which is exactly the kind of things that people want to avoid when switching to your new language right?

@andrewrk
Copy link
Member

andrewrk commented Apr 8, 2018

I think you're right - the simplest thing for everyone is to introduce a new vector primitive type and have it map exactly to the LLVM type.

@andrewrk andrewrk added the accepted This proposal is planned. label Apr 8, 2018
@andrewrk andrewrk added this to the 0.4.0 milestone Apr 8, 2018
@BraedonWooding
Copy link
Contributor

BraedonWooding commented Apr 25, 2018

Also keep in mind the rsqrtss command and others which are seriously fast on systems that support them showing speed increases of 10x. This article here demonstrates some of the differences well; http://assemblyrequired.crashworks.org/timing-square-root/
and here; http://adrianboeing.blogspot.com.au/2009/10/timing-square-root-on-gpu.html.

We should aim to try to utilise this set of faster functions when we can.

@lmb
Copy link

lmb commented Jul 16, 2018

I just stumbled on this. There is a blog post series by a (former?) Intel engineer who designed a compiler for a vectorized language: http://pharr.org/matt/blog/2018/04/18/ispc-origins.html
At the least an interesting read, but maybe good inspiration as well.

@abique
Copy link

abique commented Jul 16, 2018

Dense and interesting articles!

@BarabasGitHub
Copy link
Contributor

One thing to keep in mind here is that even though you can vectorize scalar code, there are a lot of operations that are supported by simd instructions which you can't do in 'normal' scalar code. Such as creating bit fields from floating point comparisons to later use them in bitwise operations (often to avoid branches). Plus there are integer operations which expand to wider integers and other special stuff.

The series of articles linked by @lmb also show well what the difference can be between code/compiler that's designed for SIMD and code/compiler that isn't.

andrewrk added a commit that referenced this issue Jan 31, 2019
See #903

 * create with `@Vector(len, ElemType)`
 * only wrapping addition is implemented

This feature is far from complete; this is only the beginning.
@andrewrk
Copy link
Member

andrewrk commented Jan 31, 2019

In the above commit I introduced the @Vector(len, ElemType) builtin to create vector types, and then I implemented addition (but I didn't make a test yet, hence the box is unchecked). So the effort here is started. Here is what I believe is left to do:

No mixing vector/scalar support. Instead you will use @splat(N, x) to create a vector of N elements from a scalar value x. Reasoning for this is that it more closely matches the LLVM IR. So for example multiplication would be:

fn vecMulScalar(v: @Vector(10, i32), x: i32) @Vector(10, i32) {
    return v * @splat(10, x);
}

@andrewrk andrewrk added the contributor friendly This issue is limited in scope and/or knowledge of Zig internals. label Jan 31, 2019
@abique
Copy link

abique commented Jan 31, 2019

The syntax looks ugly but if it works as good as the llvm builtin vectors, then it is fine! ;-)

Thank you, and don't forget the shuflle vector!

@abique
Copy link

abique commented Jan 31, 2019

What do you think of v10i32 ?

@andrewrk
Copy link
Member

andrewrk commented Feb 1, 2019

What do you think of v10i32 ?

A few things:

  • We need the builtin function anyway (just like we have @IntType (which is planned to be renamed to @Int)), so @Vector is a good starting point. If we switch to syntax, it will be a very small change in the compiler.
  • If there is syntax for it, it should work for ints, floats, and pointers. I'm not sure how the v10i32 example would work for pointer elements, and, if you don't already know about vectors, @Vector seems more discoverable to me than v10i32.
  • Manually putting const v10i32 = @Vector(10, i32); in a file is not so bad. Let's try it out for a while, and maybe we add syntax later if it seems necessary.

Please do feel free to propose syntax for a vector type. What's been proposed so far:

  • <[N]> type

This syntax hasn't been rejected; I'm simply avoiding the syntax question until the feature is done since it's the easiest thing to change at the very end.

@abique
Copy link

abique commented Feb 1, 2019

Why would you want a vector of pointers? Can you do a vector load from that? Would that even be efficient? Do you want people to do vectorized pointer arithmetic? 🐙

I'd go with v4f32 style! People will really enjoy writing simd with that style. But of course it does not work with templates... :) So might need a more verbose type declaration indeed.

@andrewrk
Copy link
Member

andrewrk commented Feb 1, 2019

Why would you want a vector of pointers?

Mainly, because LLVM IR supports it, and they're usually pretty good about representing what hardware generally supports. We don't automatically do everything LLVM does, but it's a good null hypothesis.

Can you do a vector load from that?

Yes you can, which yields a vector. So for example you could have a vector of 4 pointers to a struct, and then obtain a vector of 4 floats which are their fields:

const Point = struct {x: f32, y: f32};
fn multiPointMagnitude(points: @Vector(4, *Point)) @Vector(4, f32) {
    return @sqrt(points.x * points.x + points.y * points.y);
}

It's planned for this code to work verbatim once this issue is closed.

Not only can you do vector loads and vector stores from vectors of pointers, you can also do @maskedGather, @maskedScatter, and more. See the LLVM LangRef links in the comment above for explanations.

@travisstaloch
Copy link
Contributor

How are we supposed to initialize a vector? I couldn't find an example in the newest code. Or is this not implemented yet?

For example, the following doesn't work:

test "initialize vector" {
    const V4i32 = @Vector(4, i32);
    var v: V4i32 = []i32{ 0, 1, 2, 3 };
}

@andrewrk
Copy link
Member

andrewrk commented Feb 2, 2019

Your example is planned to work. That's the checkbox above labeled "implicit array to vector cast".

andrewrk added a commit that referenced this issue Feb 5, 2019
also vectors and arrays now use the same ConstExprVal representation

See #903
@andrewrk
Copy link
Member

andrewrk commented Feb 5, 2019

@travisstaloch the array <-> vector casts work now. Here's the passing test case:

test "implicit array to vector and vector to array" {
const S = struct {
fn doTheTest() void {
var v: @Vector(4, i32) = [4]i32{10, 20, 30, 40};
const x: @Vector(4, i32) = [4]i32{1, 2, 3, 4};
v +%= x;
const result: [4]i32 = v;
assertOrPanic(result[0] == 11);
assertOrPanic(result[1] == 22);
assertOrPanic(result[2] == 33);
assertOrPanic(result[3] == 44);
}
};
S.doTheTest();
comptime S.doTheTest();
}

andrewrk added a commit that referenced this issue Feb 22, 2019
also fix vector behavior tests, they weren't actually testing
runtime vectors, but now they are.

See #903
@andrewrk andrewrk removed this from the 0.4.0 milestone Mar 22, 2019
@shawnl
Copy link
Contributor

shawnl commented Oct 13, 2020 via email

@ghost
Copy link

ghost commented Oct 13, 2020

@floatCast is mentioned as todo, but @intCast @truncate, @as, and friends should also be included.

https://zig.godbolt.org/z/9nYcn4

const std = @import("std");

pub fn main() void {
    const v: i32 = 1;
    const a: @Vector(4, i32) = @splat(4, v);
    // These fail due to unexpected types
    const b = @intCast(@Vector(4, i64), a);
    const c = @as(@Vector(4, i64), a);
}

@slimsag
Copy link
Contributor

slimsag commented Nov 14, 2020

Has there been thoughts already here around runtime switching of CPU SIMD feature sets? i.e. instead of compiling for a single instruction set (AVX2, AVX-512, SSE3, SSSE3 etc.) allowing compiling for multiple and, at runtime, choosing a branch that uses the latest and/or most efficient supported instruction set where reasonable?

@ghost
Copy link

ghost commented Dec 29, 2020

@vector(N, bool) doesn't have and, or defined, nor &, |, making them questionably useful.

@LemonBoy
Copy link
Contributor

@vector(N, bool) doesn't have and, or defined, nor &, |, making them questionably useful.

You can @bitCast(@vector(N,u1), your_bool_vector) and do whatever you want with the vector.

@lemire
Copy link

lemire commented Aug 14, 2022

An interesting test for such an API would be whether one can implement useful artefact beyond number crunching... like High-speed UTF-8 validation or base64 encoding/decoding.

compiling for multiple and, at runtime, choosing a branch that uses the latest and/or most efficient supported instruction set where reasonable?

A related issue is that instructions sets are evolving. For example, the latest AWS graviton nodes support SVE/SVE2. The most powerful AWS nodes support a full range of AVX-512 instructions sets (up to VBMI2).

If you build something up that is unable to benefit from SVE2 or advanced AVX-512 instructions, then you are might not be future proof.

@sharpobject
Copy link
Contributor

sharpobject commented Jan 11, 2023

I agree emphatically with @lemire's comment above.

For even current fixed-pattern byte shuffling with @shuffle, the resulting assembly seems quite bad, and I'm not sure what to write to get a SIMD load or store. I ported a 4x4 transpose to use @shuffle today here https://godbolt.org/z/j584eWsx6. I think it should be 4 loads, 8 instructions to do the transpose, and 4 stores, plus whatever other instructions to do with the calling convention. Every part of the function is a lot bigger than that :(

The "correct" output for this function would be more like this.

@Sahnvour
Copy link
Contributor

I ported a 4x4 transpose to use @shuffle today here https://godbolt.org/z/j584eWsx6. I think it should be 4 loads, 8 instructions to do the transpose, and 4 stores, plus whatever other instructions to do with the calling convention. Every part of the function is a lot bigger than that :(

It gets a lot better with -O ReleaseFast, -Drelease-fast=true is for build.zig files
cf https://godbolt.org/z/d6YvTfYGj

@andrewrk andrewrk modified the milestones: 0.11.0, 0.12.0 Apr 9, 2023
@andrewrk andrewrk modified the milestones: 0.13.0, 0.12.0 Jun 29, 2023
@andrewrk
Copy link
Member

Has there been thoughts already here around runtime switching of CPU SIMD feature sets? i.e. instead of compiling for a single instruction set (AVX2, AVX-512, SSE3, SSSE3 etc.) allowing compiling for multiple and, at runtime, choosing a branch that uses the latest and/or most efficient supported instruction set where reasonable?

@slimsag yeah that's #1018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted This proposal is planned. contributor friendly This issue is limited in scope and/or knowledge of Zig internals. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests