Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StepBy<_, Range<_>> optimises poorly #31155

Closed
huonw opened this issue Jan 24, 2016 · 14 comments
Closed

StepBy<_, Range<_>> optimises poorly #31155

huonw opened this issue Jan 24, 2016 · 14 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@huonw
Copy link
Member

huonw commented Jan 24, 2016

There's a lot going on inside StepBy<_, Range<_>>'s Iterator implementation and LLVM does a reasonable job of cutting things down, but doesn't get all the way (definition inlined for context if it changes in future, and to allow easy experimentation):

#![crate_type = "lib"]

use std::mem;

struct R {
    start: i32,
    end: i32,
    step_by: i32,
}

impl Iterator for R
{
    type Item = i32;

    fn next(&mut self) -> Option<i32> {
        let rev = self.step_by < 0;
        if (rev && self.start > self.end) ||
           (!rev && self.start < self.end)
        {
            match self.start.checked_add(self.step_by) { // inlined & stable version of <i32 as std::iter::Step>::step
                Some(mut n) => {
                    mem::swap(&mut self.start, &mut n);
                    Some(n)
                },
                None => {
                    let mut n = self.end.clone();
                    mem::swap(&mut self.start, &mut n);
                    Some(n)
                }
            }
        } else {
            None
        }
    }
}


pub fn iterator() -> usize {
    R { start: 0, end: 100000, step_by: 2 }.count()
}

pub fn while_() -> usize {
    let mut count = 0;
    let mut i = 0;
    while i < 100000 {
        count += 1;
        i += 2;
    }
    count
}

Optimised asm (it'd be great for the first to be like the second):

_ZN8iterator20h37c73f52cfd206d4LbaE:
    .cfi_startproc
    xorl    %eax, %eax
    movl    $100000, %ecx
    xorl    %edx, %edx
    .align  16, 0x90
.LBB1_1:
    addl    $2, %edx
    cmovol  %ecx, %edx
    incq    %rax
    cmpl    $100000, %edx
    jl  .LBB1_1
    retq
_ZN6while_20h2b6d290fcc3d36d0UbaE:
    .cfi_startproc
    movl    $50000, %eax
    retq

https://play.rust-lang.org/?gist=a926869a4cf59d6683c4
#24660 previously had a somewhat similar problem, although this one is compounded by using checked_add implemented in terms of LLVM's overflow intrinsics, which the LLVM performance tips explicitly recommend against:

Avoid using arithmetic intrinsics unless you are required by your source language specification to emit a particular code sequence. The optimizer is quite good at reasoning about general control flow and arithmetic, it is not anywhere near as strong at reasoning about the various intrinsics. If profitable for code generation purposes, the optimizer will likely form the intrinsics itself late in the optimization pipeline. It is very rarely profitable to emit these directly in the language frontend. This item explicitly includes the use of the overflow intrinsics.

@huonw huonw added A-libs I-slow Issue: Problems and improvements with respect to performance of generated code. A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels Jan 24, 2016
@bluss
Copy link
Member

bluss commented Jan 24, 2016

Here's my ideal step by: https://play.rust-lang.org/?gist=f967691b33eef2f2de70&version=stable

For example, the boundary checks in the loop are eliminated. The idea is simply to do exactly what a for loop in C would do.

@dirk
Copy link
Contributor

dirk commented Jan 24, 2016

I'm guessing this is very likely a case of the @llvm.sadd.with.overflow.i32 intrinsic preventing the LLVM optimizer from being able to do proper constant folding, see this playground for an example. Check the LLVM IR in release mode, look for the function @_ZN8iterator... near the bottom when commenting/uncommenting lines 23 and 28. You'll see that with an unchecked addition it's able to fully constant fold the entire expression, but when the checked addition intrinsic is used it's unable to constant fold the expression.

@ranma42
Copy link
Contributor

ranma42 commented Jan 24, 2016

It looks like it might be possible to get rid of checked_add by computing an appropriate range end. This would also be needed to fix a bug causing infinite repetition in StepBy<_, RangeFrom<_>>.
Although the example states that (0u8..).step_by(2) only iterates on all even u8 values, it will actually do so an infinite number of times: http://is.gd/EaJ6ur

@bluss
Copy link
Member

bluss commented Jan 24, 2016

I think we decided the user has to ensure that they don't take too many elements from a 0.. iterator. It overflows with debug assertions.

@bluss
Copy link
Member

bluss commented Jan 24, 2016

i.e. behavior past the overflow point is not part of the interface.

@huonw
Copy link
Member Author

huonw commented Jan 25, 2016

@bluss hm, while it's nice to have something to compare against for the simple cases, your ideal version is unfortunately somewhat useless: we obviously want to do what a C for loop does (that's essentially what zero cost abstractions means for things like this), but the code as written it doesn't handle wrap-around properly (e.g. (250u8..255).step_by(10)), nor does it handle negative steps.

@bluss
Copy link
Member

bluss commented Jan 25, 2016

Maybe useless to replace current step_by, yes.

It's a tall order, isn't it? Do what the for loop does, Zero overhead, Check for overflow. Computing the end up front, it sounds doable though.

For the RangeFrom iterator we evidently chose to not care for overflow (overflow is user error), that could be the model here too.

@huonw
Copy link
Member Author

huonw commented Jan 25, 2016

I think that making (250_u8..255).step_by(10) have infinite length would be a massive footgun, almost to the point of maliciousness on the part of us library designers. At least with RangeFrom, step_by is just mirroring the behaviour of the underlying iterator.

It seems to me that it makes most sense for some_iter.step_by(n) to be equivalent to something like

let mut count = 0;
some_iter.filter_map(|x| { if count == 0 { count = n - 1; Some(x) } else { count -= 1; None })

(So, yes, some scheme that computes the appropriate upper bound seems like it may be the best plan of attack.)

@bluss
Copy link
Member

bluss commented Jan 25, 2016

I was against having 0.. not check for overflow, but it is a solution that leads to zero overhead. We already have the 0u8.. footgun now.

But I don't agree that it has infinite length. It has a debug assertion for overflow.

@huonw
Copy link
Member Author

huonw commented Jan 25, 2016

Yes, the RangeFrom footgun is somewhat unfortunate, but it seems much less severe to me. If someone is using a range with an upper bound, it seems bad to explicitly disobey it, and, it seems worse to have different behaviour between two similar variations of a type.

Debug assertions makes the "real" behaviour a type a little bit of a grey area, but I think one certainly couldn't say that an overflowing version of that iterator is finite: no matter what build configuration you have, you'll never get a None back from next. I believe size_hint would have to return None as its second value.

@bluss
Copy link
Member

bluss commented Jan 25, 2016

Very fair point. The nice behavior is certainly preferable. When people pull out a while loop on the forum and discover "this has better performance than step_by", it's because they completely disregard the overflow (wraparound) case, though.

RangeFrom is neither finite nor infinite, but it ends with a bang 😉

@dirk
Copy link
Contributor

dirk commented Jan 25, 2016

Introducing new footguns definitely doesn't sound like a desirable thing. It seems to me that the current behavior (ie. overflow checking) is the most desirable behavior for the general use of the StepBy on a Range given that it's the safest (compared to a while loop).

That (250_u8..255).step_by(10) is both safe and optimizes pretty well (even without constant folding) is a good thing in my book.

I think it's worth asking whether we can—and whether it's worth—creating an optimized path for StepBy that can be taken when we can be sure of the safety preconditions you guise have mentioned:

  • The start of the range is known
  • The end of the range is known
  • The step is known to not overflow at any point along the series from the start to the end

@steveklabnik steveklabnik added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed A-libs labels Mar 24, 2017
@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 24, 2017
@alexcrichton
Copy link
Member

Today's reproduction still doesn't optimize

bors added a commit that referenced this issue Jun 21, 2018
Specialize StepBy<Range(Inclusive)>

Part of #51557, related to #43064, #31155

As discussed in the above issues, `step_by` optimizes very badly on ranges which is related to
1. the special casing of the first `StepBy::next()` call
2. the need to do 2 additions of `n - 1` and `1` inside the range's `next()`

This PR eliminates both by overriding `next()` to always produce the current element and also step ahead by `n` elements in one go. The generated code is much better, even identical in the case of a `Range` with constant `start` and `end` where `start+step` can't overflow. Without constant bounds it's a bit longer than the manual loop. `RangeInclusive` doesn't optimize as nicely but is still much better than the original asm.
Unsigned integers optimize better than signed ones for some reason.

See the following two links for a comparison.

[godbolt: specialization for ..](https://godbolt.org/g/haHLJr)
[godbolt: specialization for ..=](https://godbolt.org/g/ewyMu6)

`RangeFrom`, the only other range with an `Iterator` implementation can't be specialized like this without changing behaviour due to overflow. There is no way to save "finished-ness".

The approach can not be used in general, because it would produce side effects of the underlying iterator too early.

May obsolete #51435, haven't checked.
@steveklabnik
Copy link
Member

Today, the feature is no longer in nightly, and we get fully optimized results for both:


playground::iterator: # @playground::iterator
# %bb.0:
	movl	$50000, %eax            # imm = 0xC350
	retq
                                        # -- End function

playground::while_: # @playground::while_
# %bb.0:
	movl	$50000, %eax            # imm = 0xC350
	retq
                                        # -- End function

closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants