Skip to content

Commit 54a0e18

Browse files
committed
Auto merge of #111475 - workingjubilee:sync-simd-2023-may-10, r=workingjubilee
Sync portable-simd to 2023 May 10 Take 2. r? `@ghost`
2 parents 2d4451f + 6db7a24 commit 54a0e18

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+2177
-723
lines changed

portable-simd/.github/workflows/ci.yml

+4
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,10 @@ jobs:
241241
- "--features std"
242242
- "--features generic_const_exprs"
243243
- "--features std --features generic_const_exprs"
244+
- "--features all_lane_counts"
245+
- "--features all_lane_counts --features std"
246+
- "--features all_lane_counts --features generic_const_exprs"
247+
- "--features all_lane_counts --features std --features generic_const_exprs"
244248

245249
steps:
246250
- uses: actions/checkout@v2

portable-simd/README.md

+12-22
Original file line numberDiff line numberDiff line change
@@ -24,44 +24,34 @@ or by setting up `rustup default nightly` or else with `cargo +nightly {build,te
2424
```bash
2525
cargo new hellosimd
2626
```
27-
to create a new crate. Edit `hellosimd/Cargo.toml` to be
28-
```toml
29-
[package]
30-
name = "hellosimd"
31-
version = "0.1.0"
32-
edition = "2018"
33-
[dependencies]
34-
core_simd = { git = "https://github.com/rust-lang/portable-simd" }
35-
```
36-
37-
and finally write this in `src/main.rs`:
27+
to create a new crate. Finally write this in `src/main.rs`:
3828
```rust
39-
use core_simd::*;
29+
#![feature(portable_simd)]
30+
use std::simd::f32x4;
4031
fn main() {
4132
let a = f32x4::splat(10.0);
4233
let b = f32x4::from_array([1.0, 2.0, 3.0, 4.0]);
4334
println!("{:?}", a + b);
4435
}
4536
```
4637

47-
Explanation: We import all the bindings from the crate with the first line. Then, we construct our SIMD vectors with methods like `splat` or `from_array`. Finally, we can use operators on them like `+` and the appropriate SIMD instructions will be carried out. When we run `cargo run` you should get `[11.0, 12.0, 13.0, 14.0]`.
48-
49-
## Code Organization
38+
Explanation: We construct our SIMD vectors with methods like `splat` or `from_array`. Next, we can use operators like `+` on them, and the appropriate SIMD instructions will be carried out. When we run `cargo run` you should get `[11.0, 12.0, 13.0, 14.0]`.
5039

51-
Currently the crate is organized so that each element type is a file, and then the 64-bit, 128-bit, 256-bit, and 512-bit vectors using those types are contained in said file.
40+
## Supported vectors
5241

53-
All types are then exported as a single, flat module.
42+
Currently, vectors may have up to 64 elements, but aliases are provided only up to 512-bit vectors.
5443

5544
Depending on the size of the primitive type, the number of lanes the vector will have varies. For example, 128-bit vectors have four `f32` lanes and two `f64` lanes.
5645

5746
The supported element types are as follows:
5847
* **Floating Point:** `f32`, `f64`
59-
* **Signed Integers:** `i8`, `i16`, `i32`, `i64`, `i128`, `isize`
60-
* **Unsigned Integers:** `u8`, `u16`, `u32`, `u64`, `u128`, `usize`
61-
* **Masks:** `mask8`, `mask16`, `mask32`, `mask64`, `mask128`, `masksize`
48+
* **Signed Integers:** `i8`, `i16`, `i32`, `i64`, `isize` (`i128` excluded)
49+
* **Unsigned Integers:** `u8`, `u16`, `u32`, `u64`, `usize` (`u128` excluded)
50+
* **Pointers:** `*const T` and `*mut T` (zero-sized metadata only)
51+
* **Masks:** 8-bit, 16-bit, 32-bit, 64-bit, and `usize`-sized masks
6252

63-
Floating point, signed integers, and unsigned integers are the [primitive types](https://doc.rust-lang.org/core/primitive/index.html) you're already used to.
64-
The `mask` types are "truthy" values, but they use the number of bits in their name instead of just 1 bit like a normal `bool` uses.
53+
Floating point, signed integers, unsigned integers, and pointers are the [primitive types](https://doc.rust-lang.org/core/primitive/index.html) you're already used to.
54+
The mask types have elements that are "truthy" values, like `bool`, but have an unspecified layout because different architectures prefer different layouts for mask types.
6555

6656
[simd-guide]: ./beginners-guide.md
6757
[zulip-project-portable-simd]: https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd

portable-simd/crates/core_simd/Cargo.toml

+4-5
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,11 @@ default = ["as_crate"]
1313
as_crate = []
1414
std = []
1515
generic_const_exprs = []
16+
all_lane_counts = []
1617

17-
[target.'cfg(target_arch = "wasm32")'.dev-dependencies.wasm-bindgen]
18-
version = "0.2"
19-
20-
[dev-dependencies.wasm-bindgen-test]
21-
version = "0.3"
18+
[target.'cfg(target_arch = "wasm32")'.dev-dependencies]
19+
wasm-bindgen = "0.2"
20+
wasm-bindgen-test = "0.3"
2221

2322
[dev-dependencies.proptest]
2423
version = "0.10"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### `stdsimd` examples
2+
3+
This crate is a port of example uses of `stdsimd`, mostly taken from the `packed_simd` crate.
4+
5+
The examples contain, as in the case of `dot_product.rs`, multiple ways of solving the problem, in order to show idiomatic uses of SIMD and iteration of performance designs.
6+
7+
Run the tests with the command
8+
9+
```
10+
cargo run --example dot_product
11+
```
12+
13+
and verify the code for `dot_product.rs` on your machine.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
// Code taken from the `packed_simd` crate
2+
// Run this code with `cargo test --example dot_product`
3+
//use std::iter::zip;
4+
5+
#![feature(array_chunks)]
6+
#![feature(slice_as_chunks)]
7+
// Add these imports to use the stdsimd library
8+
#![feature(portable_simd)]
9+
use core_simd::simd::*;
10+
11+
// This is your barebones dot product implementation:
12+
// Take 2 vectors, multiply them element wise and *then*
13+
// go along the resulting array and add up the result.
14+
// In the next example we will see if there
15+
// is any difference to adding and multiplying in tandem.
16+
pub fn dot_prod_scalar_0(a: &[f32], b: &[f32]) -> f32 {
17+
assert_eq!(a.len(), b.len());
18+
19+
a.iter().zip(b.iter()).map(|(a, b)| a * b).sum()
20+
}
21+
22+
// When dealing with SIMD, it is very important to think about the amount
23+
// of data movement and when it happens. We're going over simple computation examples here, and yet
24+
// it is not trivial to understand what may or may not contribute to performance
25+
// changes. Eventually, you will need tools to inspect the generated assembly and confirm your
26+
// hypothesis and benchmarks - we will mention them later on.
27+
// With the use of `fold`, we're doing a multiplication,
28+
// and then adding it to the sum, one element from both vectors at a time.
29+
pub fn dot_prod_scalar_1(a: &[f32], b: &[f32]) -> f32 {
30+
assert_eq!(a.len(), b.len());
31+
a.iter()
32+
.zip(b.iter())
33+
.fold(0.0, |a, zipped| a + zipped.0 * zipped.1)
34+
}
35+
36+
// We now move on to the SIMD implementations: notice the following constructs:
37+
// `array_chunks::<4>`: mapping this over the vector will let use construct SIMD vectors
38+
// `f32x4::from_array`: construct the SIMD vector from a slice
39+
// `(a * b).reduce_sum()`: Multiply both f32x4 vectors together, and then reduce them.
40+
// This approach essentially uses SIMD to produce a vector of length N/4 of all the products,
41+
// and then add those with `sum()`. This is suboptimal.
42+
// TODO: ASCII diagrams
43+
pub fn dot_prod_simd_0(a: &[f32], b: &[f32]) -> f32 {
44+
assert_eq!(a.len(), b.len());
45+
// TODO handle remainder when a.len() % 4 != 0
46+
a.array_chunks::<4>()
47+
.map(|&a| f32x4::from_array(a))
48+
.zip(b.array_chunks::<4>().map(|&b| f32x4::from_array(b)))
49+
.map(|(a, b)| (a * b).reduce_sum())
50+
.sum()
51+
}
52+
53+
// There's some simple ways to improve the previous code:
54+
// 1. Make a `zero` `f32x4` SIMD vector that we will be accumulating into
55+
// So that there is only one `sum()` reduction when the last `f32x4` has been processed
56+
// 2. Exploit Fused Multiply Add so that the multiplication, addition and sinking into the reduciton
57+
// happen in the same step.
58+
// If the arrays are large, minimizing the data shuffling will lead to great perf.
59+
// If the arrays are small, handling the remainder elements when the length isn't a multiple of 4
60+
// Can become a problem.
61+
pub fn dot_prod_simd_1(a: &[f32], b: &[f32]) -> f32 {
62+
assert_eq!(a.len(), b.len());
63+
// TODO handle remainder when a.len() % 4 != 0
64+
a.array_chunks::<4>()
65+
.map(|&a| f32x4::from_array(a))
66+
.zip(b.array_chunks::<4>().map(|&b| f32x4::from_array(b)))
67+
.fold(f32x4::splat(0.0), |acc, zipped| acc + zipped.0 * zipped.1)
68+
.reduce_sum()
69+
}
70+
71+
// A lot of knowledgeable use of SIMD comes from knowing specific instructions that are
72+
// available - let's try to use the `mul_add` instruction, which is the fused-multiply-add we were looking for.
73+
use std_float::StdFloat;
74+
pub fn dot_prod_simd_2(a: &[f32], b: &[f32]) -> f32 {
75+
assert_eq!(a.len(), b.len());
76+
// TODO handle remainder when a.len() % 4 != 0
77+
let mut res = f32x4::splat(0.0);
78+
a.array_chunks::<4>()
79+
.map(|&a| f32x4::from_array(a))
80+
.zip(b.array_chunks::<4>().map(|&b| f32x4::from_array(b)))
81+
.for_each(|(a, b)| {
82+
res = a.mul_add(b, res);
83+
});
84+
res.reduce_sum()
85+
}
86+
87+
// Finally, we will write the same operation but handling the loop remainder.
88+
const LANES: usize = 4;
89+
pub fn dot_prod_simd_3(a: &[f32], b: &[f32]) -> f32 {
90+
assert_eq!(a.len(), b.len());
91+
92+
let (a_extra, a_chunks) = a.as_rchunks();
93+
let (b_extra, b_chunks) = b.as_rchunks();
94+
95+
// These are always true, but for emphasis:
96+
assert_eq!(a_chunks.len(), b_chunks.len());
97+
assert_eq!(a_extra.len(), b_extra.len());
98+
99+
let mut sums = [0.0; LANES];
100+
for ((x, y), d) in std::iter::zip(a_extra, b_extra).zip(&mut sums) {
101+
*d = x * y;
102+
}
103+
104+
let mut sums = f32x4::from_array(sums);
105+
std::iter::zip(a_chunks, b_chunks).for_each(|(x, y)| {
106+
sums += f32x4::from_array(*x) * f32x4::from_array(*y);
107+
});
108+
109+
sums.reduce_sum()
110+
}
111+
112+
// Finally, we present an iterator version for handling remainders in a scalar fashion at the end of the loop.
113+
// Unfortunately, this is allocating 1 `XMM` register on the order of `~len(a)` - we'll see how we can get around it in the
114+
// next example.
115+
pub fn dot_prod_simd_4(a: &[f32], b: &[f32]) -> f32 {
116+
let mut sum = a
117+
.array_chunks::<4>()
118+
.map(|&a| f32x4::from_array(a))
119+
.zip(b.array_chunks::<4>().map(|&b| f32x4::from_array(b)))
120+
.map(|(a, b)| a * b)
121+
.fold(f32x4::splat(0.0), std::ops::Add::add)
122+
.reduce_sum();
123+
let remain = a.len() - (a.len() % 4);
124+
sum += a[remain..]
125+
.iter()
126+
.zip(&b[remain..])
127+
.map(|(a, b)| a * b)
128+
.sum::<f32>();
129+
sum
130+
}
131+
132+
// This version allocates a single `XMM` register for accumulation, and the folds don't allocate on top of that.
133+
// Notice the the use of `mul_add`, which can do a multiply and an add operation ber iteration.
134+
pub fn dot_prod_simd_5(a: &[f32], b: &[f32]) -> f32 {
135+
a.array_chunks::<4>()
136+
.map(|&a| f32x4::from_array(a))
137+
.zip(b.array_chunks::<4>().map(|&b| f32x4::from_array(b)))
138+
.fold(f32x4::splat(0.), |acc, (a, b)| a.mul_add(b, acc))
139+
.reduce_sum()
140+
}
141+
142+
fn main() {
143+
// Empty main to make cargo happy
144+
}
145+
146+
#[cfg(test)]
147+
mod tests {
148+
#[test]
149+
fn smoke_test() {
150+
use super::*;
151+
let a: Vec<f32> = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
152+
let b: Vec<f32> = vec![-8.0, -7.0, -6.0, -5.0, 4.0, 3.0, 2.0, 1.0];
153+
let x: Vec<f32> = [0.5; 1003].to_vec();
154+
let y: Vec<f32> = [2.0; 1003].to_vec();
155+
156+
// Basic check
157+
assert_eq!(0.0, dot_prod_scalar_0(&a, &b));
158+
assert_eq!(0.0, dot_prod_scalar_1(&a, &b));
159+
assert_eq!(0.0, dot_prod_simd_0(&a, &b));
160+
assert_eq!(0.0, dot_prod_simd_1(&a, &b));
161+
assert_eq!(0.0, dot_prod_simd_2(&a, &b));
162+
assert_eq!(0.0, dot_prod_simd_3(&a, &b));
163+
assert_eq!(0.0, dot_prod_simd_4(&a, &b));
164+
assert_eq!(0.0, dot_prod_simd_5(&a, &b));
165+
166+
// We can handle vectors that are non-multiples of 4
167+
assert_eq!(1003.0, dot_prod_simd_3(&x, &y));
168+
}
169+
}

0 commit comments

Comments
 (0)