Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize .zip() for efficient slice and slice iteration #33090

Merged
merged 8 commits into from
Jun 17, 2016

Conversation

bluss
Copy link
Member

@bluss bluss commented Apr 19, 2016

The idea is to introduce a private trait TrustedRandomAccess and specialize .zip() for random access iterators into a counted loop.

The implementation in the PR is internal and has no visible effect in the API

Why a counted loop? To have each slice iterator compile to just a pointer, and both pointers are indexed with the same loop counter value in the generated code. When this succeeds, copying loops are readily recognized and replaced with memcpy and addition loops autovectorize well.

TrustedRandomAccess

The TrustedRandomAccess approach works very well on the surface. Microbenchmarks optimize well, following the ideas above, and that is a dramatic improvement of .zip()'s codegen.

// old zip before this PR: bad, byte-for-byte loop
// with specialized zip: memcpy
pub fn copy_zip(xs: &[u8], ys: &mut [u8]) {
    for (a, b) in ys.iter_mut().zip(xs) {
        *a = *b;
    }
}

// old zip before this PR: single addition per iteration
// with specialized zip: vectorized
pub fn add_zip(xs: &[f32], ys: &mut [f32]) {
    for (a, b) in ys.iter_mut().zip(xs) { *a += *b; }
}


// old zip before this PR: single addition per iteration
// with specialized zip: vectorized (!!)
pub fn add_zip3(xs: &[f32], ys: &[f32], zs: &mut [f32]) {
    for ((a, b), c) in zs.iter_mut().zip(xs).zip(ys) { *a += *b * *c; }
}

Non-null issues

Yet in more complex situations, the .zip() loop can still fall back to its old behavior where phantom null checks throw in fake premature end of the loop conditionals. Remember that a NULL inside
Option<(&T, &T)> makes it a None value and a premature (in this case)
end of the loop.

So even if we have 1) an explicit Some in the code and 2) the types of the pointers are &T or &mut T which are nonnull, we can still get a phantom null check at that point.

One example that illustrates the difference is copy_zip with slice versus Vec arguments. The involved iterator types are exactly the same, but the Vec version doesn't compile down to memcpy. Investigating into this, the function argument metadata emitted to llvm plays the biggest role. As eddyb summarized, we need nonnull for the loop to autovectorize and noalias for it to replace with memcpy.

There was an experiment to use assume to add a non-null assumption on each of the two elements in the specialized zip iterator, but this only helped in some of the test cases and regressed others. Instead I think the nonnull/noalias metadata issue is something we need to solve separately anyway.

Adaptors

These have conditionally implemented TrustedRandomAccess

  • Enumerate
  • Zip

These have not implemented it

  • Map is sideeffectful. The forward case would be workable, but the double ended case is complicated.
  • Chain, exact length semantics unclear
  • Filter, FilterMap, FlatMap and many others don't offer random access and/or exact length

@rust-highfive
Copy link
Collaborator

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

@bluss bluss changed the title WIP: Specialize .zip() for efficient slice+slice iteration WIP: Specialize .zip() for efficient slice and slice iteration Apr 19, 2016
@Stebalien
Copy link
Contributor

Why TrustedRandomAccessIterator instead of TrustedExactSizeIterator with a next_unchecked method? Wouldn't that be more general (TruestedRandomAccessIteratorTrustedExactSizeIterator but not the other way around)? Would it not optimize as well?

@bluss
Copy link
Member Author

bluss commented Apr 19, 2016

I haven't tried that. The counted loop motivation explains why I used this approach though.

@Stebalien
Copy link
Contributor

I see. (Now to practice reading things from start to finish before asking questions...)

@alexcrichton
Copy link
Member

alexcrichton commented Apr 19, 2016

Nice wins! The design here seems pretty clean, although I agree with @Stebalien that if we could get away with something like:

unsafe fn next_unchecked(&mut self) -> Self::Item;

Then that'd be pretty slick as well. (not sure if it optimizes the same way though).

It's a little unfortunate that the Zip struct is growing two words only used in some situations, but I wonder if we could specialize that as well? Something like:

struct Zip<A, B> {
    inner: <(A, B) as ZipImpl<A, B>>::Repr,
}

trait ZipImpl<A, B> {
    type Repr;
}

impl<A, B> ZipImpl<A, B> for (A, B) {
    type Repr = (A, B);
}

// ...

I'm not sure if that'll actually compile just yet, but perhaps worth a shot?

Also, could this add some benchmarks in-tree so eventually if we start tracking them we can track performance over time?

@bluss
Copy link
Member Author

bluss commented Apr 19, 2016

@Stebalien @alexcrichton I have tried similar things to .next_unchecked, but not with the rigor or the scale as in this PR. But in general it does not optimize the same way the indexed version does and does not reduce to memcpy.

Specialization with associated types seems controversial as it is. I'm not worried about the extra struct fields. In the absolute majority of use cases, they are locals in a function and can be optimized out if they are unused.

I will attempt to salvage Map by adding back the side effect case for uneven length iterators, and see if the performance of the test cases are still preserved.

I'll happily add the examples as benchmark cases, but for me they are more like codegen test cases. Do we have a facility for testing optimizations?

@alexcrichton
Copy link
Member

But in general it does not optimize the same way the indexed version does and does not reduce to memcpy.

Ah yeah if LLVM can't connect the dots to realize everything uses the same iteration counter then seems fine to leave as is.

Specialization with associated types seems controversial as it is.

Could you elaborate a bit? Is it not implemented currently, or do you think it's likely to be removed? It's true that this would only be a minor win, I'm almost mostly just interested to see if it works :)

I'll happily add the examples as benchmark cases, but for me they are more like codegen test cases. Do we have a facility for testing optimizations?

We kinda do in src/test/codegen. You could do an operation and then assert that it's compiled to a memcpy or something like that, but unfortunately they're pretty brittle.

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Apr 19, 2016
@bluss
Copy link
Member Author

bluss commented Apr 21, 2016

Here's a playground snippet where you can compare the counted strategy with the next_unchecked strategy.

https://play.rust-lang.org/?gist=fbb8dece16c32195eb60219b9d769871&version=nightly&backtrace=0

The implementation is of course not so beautiful, since we need to access private fields of the iterators from outside their module.

next_unchecked vectorizes, but it does not recognize as memcpy, and you can compare the addition loops' code against each other too. The kind of code generated for the addition loop by next_unchecked is generally a bit worse in benchmarks than the indexed approach.

@bluss
Copy link
Member Author

bluss commented Apr 21, 2016

The AssumeNonNull code is gone, because it regressed some of the simplest testcases (even if it improved many others).

This optimization is pretty brittle itself! I've added benchmarks, and they optimize as they should, but only after I moved the code out into separate functions. The codegen test also passes.

I think Map will not participate in this specialization. I planned to reproduce the needed side effects, but then I realized the next_back case requires so much more, I don't know how it can be done.

@bluss
Copy link
Member Author

bluss commented Apr 21, 2016

It's fragile and it depends on something that's outside the zip iterator's control.

Example: https://gist.github.com/bluss/5a088d3f420d12406689439e2940d731

Same input iterators but different sources (slice vs Vec).

@bluss bluss force-pushed the special-zip-2 branch 2 times, most recently from 90ff0a4 to 1346fa0 Compare April 22, 2016 15:45
@bluss
Copy link
Member Author

bluss commented Apr 22, 2016

Regarding the nonnull/noalias/optimization failures, after looking through this yesterday I think it's a very interesting issue but it's orthogonal to and an independent issue of this improvement of zip, even though the missing nonnull/noalias metadata readily destroys the optimization in many cases. A reliable workaround is to use an intermediate function with slice arguments (works well for ndarray for example). In the long run we must ensure that slices / slice iterators created directly from Vec variables or Vec function arguments are treated the same way slice function arguments are.

@alexcrichton I did one attempt for associated types and found some interesting effects. The implementation diff on top of the current state of the PR is here https://gist.github.com/bluss/ee1396d6b845f8eefa04ead1fa940533 and the compiler error that resulted is also in there. It simply looks like the compiler has to treat the default case as if it has a drop ck obligation. This is all pretty new to me, and I don't know what's correct and how it can be worked around.

I think the approach using Default is the way to go without the lattice rule. There's no way we can have it know the associated data is always () in the default implementation case.

@bluss bluss changed the title WIP: Specialize .zip() for efficient slice and slice iteration Specialize .zip() for efficient slice and slice iteration Apr 22, 2016
@aturon
Copy link
Member

aturon commented Apr 22, 2016

@alexcrichton

Could you elaborate a bit? Is it not implemented currently, or do you think it's likely to be removed? It's true that this would only be a minor win, I'm almost mostly just interested to see if it works :)

To clarify the overall status: it seems pretty clear that we're going to want specializable associated types in some form (and today's implementation has them). We're already starting to collect some pretty significant use cases. The main question is precisely who gets to assume what about these types -- there are a few ways we can take the design, and we want to get feedback on the use cases they enable.

@bluss
Copy link
Member Author

bluss commented Apr 22, 2016

I found a use of associated types that compiles and works as before (optimizes the same). It's in the latest commit. It uses the trait bounds 'static + Default + Clone + Debug on the associated type. The 'static was the ingredient to avoid the dropck issue. Default to allow creating the field in the default impl, and Clone, Debug are to support the derive on Zip.

The trait owning the associated type is a separate trait implemented for every type, so that no new trait bounds are needed on the Zip struct.

@alexcrichton
Copy link
Member

Nice!

I'm definitely in favor of merging :) (but will defer to libs triage at least)

@bluss
Copy link
Member Author

bluss commented Apr 22, 2016

It's in good shape for merging now. Avenues for expansion:

  • Could be expanded to a few more adaptors
  • The specialization trait could go pub unstable so that libcollections can implement it
  • Some brainstorming how to somehow bring in .map().

@bluss
Copy link
Member Author

bluss commented Apr 22, 2016

The dropck thing is probably a bug, though.

@arielb1
Copy link
Contributor

arielb1 commented Apr 23, 2016

it seems pretty clear that we're going to want specializable associated types in some form (and today's implementation has them).

This still feels like a scary type-system feature to me - and lo and behold, the current implementation has tons of bugs.

If you ignore the orphan rules suddenly becoming Very Important, method specialization is basically equivalent to matching on types - which you can basically do (unreliably) even before specialization with sufficient unsafe code tricks.

Specialized associated types, however, add that kind of matching to the type system itself. We don't even have associated types working in all cases even without specialization - I doubt that adding it will make the situation better.

@arielb1
Copy link
Contributor

arielb1 commented Apr 23, 2016

@aturon

We still don't know how to do type equality with HRTB and associated types (at least, I don't, and rustc doesn't do it correctly). Selecting types based on that sounds... bogus.

@bstrie
Copy link
Contributor

bstrie commented Apr 26, 2016

Is this related to specializing to take advantage of a trusted size_hint? (I didn't see any issues open for such a thing.)

@bluss
Copy link
Member Author

bluss commented Apr 26, 2016

@bstrie It's not formally connected to it, but it could be. The specialization here has a much stricter demand (that you can access elements unchecked, by index).

@alexcrichton
Copy link
Member

The libs team discussed this during triage today, and conclusion was positive all around. @aturon would like to scrutinize before landing, so I'll defer the final review to him.

@bors
Copy link
Contributor

bors commented May 19, 2016

☔ The latest upstream changes (presumably #33694) made this pull request unmergeable. Please resolve the merge conflicts.

@bluss
Copy link
Member Author

bluss commented Jun 17, 2016

Thank you. Sorry that I was away for some weeks. Let's give this a go (I'm not 100% that the memcpy test will pass the buildbots).

@bors r=aturon

@bors
Copy link
Contributor

bors commented Jun 17, 2016

📌 Commit 5df05c6 has been approved by aturon

@bors
Copy link
Contributor

bors commented Jun 17, 2016

⌛ Testing commit 5df05c6 with merge 0a35504...

@alexcrichton
Copy link
Member

@bors: retry force clean

  • restarted buildbot

@bors
Copy link
Contributor

bors commented Jun 17, 2016

⌛ Testing commit 5df05c6 with merge c8eff68...

bors added a commit that referenced this pull request Jun 17, 2016
Specialize .zip() for efficient slice and slice iteration

The idea is to introduce a private trait TrustedRandomAccess and specialize .zip() for random access iterators into a counted loop.

The implementation in the PR is internal and has no visible effect in the API

Why a counted loop? To have each slice iterator compile to just a pointer, and both pointers are indexed with the same loop counter value in the generated code. When this succeeds, copying loops are readily recognized and replaced with memcpy and addition loops autovectorize well.

The TrustedRandomAccess approach works very well on the surface. Microbenchmarks optimize well, following the ideas above, and that is a dramatic improvement of .zip()'s codegen.

```rust
// old zip before this PR: bad, byte-for-byte loop
// with specialized zip: memcpy
pub fn copy_zip(xs: &[u8], ys: &mut [u8]) {
    for (a, b) in ys.iter_mut().zip(xs) {
        *a = *b;
    }
}

// old zip before this PR: single addition per iteration
// with specialized zip: vectorized
pub fn add_zip(xs: &[f32], ys: &mut [f32]) {
    for (a, b) in ys.iter_mut().zip(xs) { *a += *b; }
}

// old zip before this PR: single addition per iteration
// with specialized zip: vectorized (!!)
pub fn add_zip3(xs: &[f32], ys: &[f32], zs: &mut [f32]) {
    for ((a, b), c) in zs.iter_mut().zip(xs).zip(ys) { *a += *b * *c; }
}
```

Yet in more complex situations, the .zip() loop can still fall back to its old behavior where phantom null checks throw in fake premature end of the loop conditionals. Remember that a NULL inside
Option<(&T, &T)> makes it a `None` value and a premature (in this case)
end of the loop.

So even if we have 1) an explicit `Some` in the code and 2) the types of the pointers are `&T` or `&mut T` which are nonnull, we can still get a phantom null check at that point.

One example that illustrates the difference is `copy_zip` with slice versus Vec arguments. The involved iterator types are exactly the same, but the Vec version doesn't compile down to memcpy. Investigating into this, the function argument metadata emitted to llvm plays the biggest role. As eddyb summarized, we need nonnull for the loop to autovectorize and noalias for it to replace with memcpy.

There was an experiment to use `assume` to add a non-null assumption on each of the two elements in the specialized zip iterator, but this only helped in some of the test cases and regressed others. Instead I think the nonnull/noalias metadata issue is something we need to solve separately anyway.

These have conditionally implemented TrustedRandomAccess

- Enumerate
- Zip

These have not implemented it

- Map is sideeffectful. The forward case would be workable, but the double ended case is complicated.
- Chain, exact length semantics unclear
- Filter, FilterMap, FlatMap and many others don't offer random access and/or exact length
@bors bors merged commit 5df05c6 into rust-lang:master Jun 17, 2016
@Stebalien
Copy link
Contributor

Unless I'm mistaken, this should also have implementations for Skip and Take.

@frewsxcv
Copy link
Member

Could this have made Zip invariant? #35727

@Stebalien
Copy link
Contributor

Could this have made Zip invariant? #35727

It is.

@krdln
Copy link
Contributor

krdln commented Sep 16, 2016

@bluss Could you elaborate on why exactly implementing this for Map (and Cloned) would be a problem? I fail to see what kind of side effects would pose a problem, can you give an example? (Also, is this comment thread a proper place to ask such a question?)

@bluss
Copy link
Member Author

bluss commented Sep 16, 2016

@krdln why not here, I don't mind.

I didn't know how to faithfully reproduce the side effects.

As an example, if you have two mapped iterators of length 4 and 3 like this: (0..4).map(f).zip((0..3).map(g))

The quirky part is that f will be called four times, and g three times. The fourth time is to extract the last element from the first iterator, before we realize there is no fourth element in the second iterator.

The specialized version computes a length 3 and only attempts to extract 3 elements from each iterator, as it is implemented now.

I remember i tried to replicate this in the specialized case. Maybe we can try again.

@krdln
Copy link
Contributor

krdln commented Sep 16, 2016

Oh, I totally didn't think about this edge case. So, by implementing this "naively" (by that I mean without replicating original behaviour), you would introduce an incosistency between the non-specialized and specialized version. So my question now is why exactly it's a problem?

Is it because of specialized version having different semantics? If that's really needed for correctness, then in fact the naive implementation is no-go.

Or is it because it may potentially break somebody's code, which relies on one more call to the closure? I would be really surprised if somebody was relying on this for correctness, especially as this behaviour is not documented anywhere. I would consider that change as breaking as changing the order of comparisons in sort implementation.

Anyway, I've mostly ask the question because of Cloned adapter, I was sad that my code wasn't autovectorizing and started to refactor it to unsafe manual indexing, when I've noticed that removing .cloned() made it vectorize! So, would it at least be a good idea to implement TrustedRandomAccess for Cloned where T: Copy? The docs say

More formally: if T: Copy, x: T, and y: &T, then let x = y.clone(); is equivalent to let x = *y;. Manual implementations should be careful to uphold this invariant; however, unsafe code must not rely on it to ensure memory safety.

so I guess it would be fine?

@bluss
Copy link
Member Author

bluss commented Sep 16, 2016

Cloned is much simpler than Map, I don't see much problem with saying that .clone() might not be called. Even simpler if the type is Copy.

I think that when we specialize we need to pay very close attention to the details and always try to make it seamless. Otherwise the code that builds on top of for example zip gets very hard to understand all of a sudden.

Manishearth added a commit to Manishearth/rust that referenced this pull request Sep 17, 2016
…ition, r=alexcrichton

Remove data structure specialization for .zip() iterator

Go back on half the specialization, the part that changed the Zip
struct's fields themselves depending on the types of the iterators.

Previous PR: rust-lang#33090

This means that the Zip iterator will always carry two usize fields,
which are sometimes unused. If a whole for loop using a .zip() iterator is
inlined, these are simply removed and have no effect.

The same improvement for Zip of for example slice iterators remain, and
they still optimize well. However, like when the specialization of zip
was merged, the compiler is still very sensistive to the exact context.

For example this code only autovectorizes if the function is used, not
if the code in zip_sum_i32 is inserted inline where it was called:

```rust
fn zip_sum_i32(xs: &[i32], ys: &[i32]) -> i32 {
    let mut s = 0;
    for (&x, &y) in xs.iter().zip(ys) {
        s += x * y;
    }
    s
}

fn zipdot_i32_default_zip(b: &mut test::Bencher)
{
    let xs = vec![1; 1024];
    let ys = vec![1; 1024];

    b.iter(|| {
        zip_sum_i32(&xs, &ys)
    })
}
```

Include a test that checks that `Zip<T, U>` is covariant w.r.t. T and U.

Fixes rust-lang#35727
bors added a commit that referenced this pull request Sep 17, 2016
…excrichton

Remove data structure specialization for .zip() iterator

Go back on half the specialization, the part that changed the Zip
struct's fields themselves depending on the types of the iterators.

Previous PR: #33090

This means that the Zip iterator will always carry two usize fields,
which are sometimes unused. If a whole for loop using a .zip() iterator is
inlined, these are simply removed and have no effect.

The same improvement for Zip of for example slice iterators remain, and
they still optimize well. However, like when the specialization of zip
was merged, the compiler is still very sensistive to the exact context.

For example this code only autovectorizes if the function is used, not
if the code in zip_sum_i32 is inserted inline where it was called:

```rust
fn zip_sum_i32(xs: &[i32], ys: &[i32]) -> i32 {
    let mut s = 0;
    for (&x, &y) in xs.iter().zip(ys) {
        s += x * y;
    }
    s
}

fn zipdot_i32_default_zip(b: &mut test::Bencher)
{
    let xs = vec![1; 1024];
    let ys = vec![1; 1024];

    b.iter(|| {
        zip_sum_i32(&xs, &ys)
    })
}
```

Include a test that checks that `Zip<T, U>` is covariant w.r.t. T and U.

Fixes #35727
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants