Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the slice iterator's searching methods #37972

Merged
merged 3 commits into from
Jan 18, 2017

Conversation

bluss
Copy link
Member

@bluss bluss commented Nov 23, 2016

Improve all, any, find, position, rposition by explicitly unrolling the loop for the slice iterators.

  • Introduce a few extension methods and functions for raw pointers make the new code easy to express
  • Introduce helper methods search_while, rsearch_while that generalize all the searching methods

LLVM doesn't unroll the loop in .find() by default (clang is the same), so performance benefits a lot from explicit unrolling here. An iterator method without conditional exits (like .fold()) does not need this on the other hand.

One of the raw pointer extension methods is fn post_inc(&mut self) -> Self which is the rustic equivalent of “ptr++”, and it is a nice way to express the raw pointer loop (see commit 3).

Specific development notes about search_while: I tried both computing an end pointer "rounded" to 4, as well as the ptrdistance >= 4 loop condition, ptrdistance was better. I tried handling the last 0-3 elements unrolled or with a while loop, the loop was better.

@rust-highfive
Copy link
Collaborator

r? @sfackler

(rust_highfive has picked a reviewer for you, use r? to override)

@bluss
Copy link
Member Author

bluss commented Nov 23, 2016

benchmarks: https://gist.github.com/bluss/3fed8a20667d525a5f8a4713442fe540 (updated)

 name                     iter-search-old-3 ns/iter  iter-search-new-3 ns/iter  diff ns/iter   diff % 
 iter_all                 17,826 (1838 MB/s)         11,188 (2928 MB/s)               -6,638  -37.24% 
 iter_find                17,849 (1835 MB/s)         11,149 (2939 MB/s)               -6,700  -37.54% 
 iter_find_split_loop_u8  309 (1656 MB/s)            237 (2160 MB/s)                     -72  -23.30% 
 iter_position            27,389 (1196 MB/s)         11,165 (2934 MB/s)              -16,224  -59.24% 

}
}
false
!self.all(move |x| !f(x))
Copy link
Member

@nagisa nagisa Nov 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change in semantics for code like this:

impl Iterator for _ {
    fn all(...) -> bool where ... { !self.any(move |x| !f(x)) }
    // no `any` def’n.
}

Currently it would just work, but after this change it would become an infinite loop.

I noticed this, because this is a common issue in Haskell.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice catch. Indeed, it ICEs with a massive recursive type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed that part from the PR entirely.

@bluss
Copy link
Member Author

bluss commented Nov 24, 2016

Well it looks like there is something real broken from the travis tests.. (Oh.. ZST offsetting is broken! Edit: And now fixed, of course.)

@bluss
Copy link
Member Author

bluss commented Nov 24, 2016

nom's benchmark improves with the PR (some kind of http test) (commit 2e2730cdb451a55)

 name      nom-before ns/iter  nom-after ns/iter  diff ns/iter   diff % 
 one_test  764                 618                        -146  -19.11%

cc @Geal Do you have more nom benchmarks? This PR is kind of relevant for nom, I hope?

@Geal
Copy link
Contributor

Geal commented Nov 24, 2016

@bluss indeed, it is relevant. I have other benchmarks here: https://github.com/Geal/nom_benchmarks
but I have not updated those in a long time. I'll migrate them to the new nom that' I'll release in a couple of days.

@bluss
Copy link
Member Author

bluss commented Nov 24, 2016

chomp benchmarks cc @m4rw3r (commit d838dd267611edc)

name                            before-2 ns/iter  after-2 ns/iter  diff ns/iter   diff % 
 count_vec_10k                   8,361             8,382                      21    0.25% 
 count_vec_10k_maybe_incomplete  8,182             8,196                      14    0.17% 
 count_vec_1k                    867               865                        -2   -0.23% 
 float_f32                       26,866            27,112                    246    0.92% 
 float_f64                       1,052             1,068                      16    1.52% 
 float_no_conversion             92                64                        -28  -30.43% 
 float_small_f32                 54                54                          0    0.00% 
 float_small_f64                 55                54                         -1   -1.82% 
 float_small_no_conversion       22                23                          1    4.55% 
 from_str_f32                    28,417            27,089                 -1,328   -4.67% 
 from_str_f64                    1,007             987                       -20   -1.99% 
 from_str_small_f32              29                29                          0    0.00% 
 from_str_small_f64              30                29                         -1   -3.33% 
 many1_vec_10k                   10,057            10,070                     13    0.13% 
 many1_vec_10k_maybe_incomplete  10,126            10,115                    -11   -0.11% 
 many1_vec_1k                    1,422             1,415                      -7   -0.49% 
 many_vec_10k                    9,082             9,086                       4    0.04% 
 many_vec_10k_maybe_incomplete   9,355             9,138                    -217   -2.32% 
 many_vec_1k                     1,272             1,236                     -36   -2.83% 
 match_float                     70                43                        -27  -38.57% 
 multiple_requests               53,834            46,884                 -6,950  -12.91% 
 single_request                  786               678                      -108  -13.74% 
 single_request_large            1,195             1,008                    -187  -15.65% 
 single_request_minimal          117               118                         1    0.85%

@Geal
Copy link
Contributor

Geal commented Nov 24, 2016

@bluss I updated the benchmarks on https://github.com/Geal/nom_benchmarks , they should work with nom master now

@bluss
Copy link
Member Author

bluss commented Nov 24, 2016

Ok, I tried to run the benchmarks I found. nom-benchmarks commit 5246454ce735d4fb8c

mp4

 name               before-2 ns/iter  after-1 ns/iter  diff ns/iter  diff % 
 bigbuckbunny_test  225               226                         1   0.44% 
 small_test         241               235                        -6  -2.49%

nom-http

 name         before-1 ns/iter  after-1 ns/iter  diff ns/iter   diff % 
 bigger_test  319,699           262,205               -57,494  -17.98% 
 small_test   60,897            49,495                -11,402  -18.72% 

Use an extension trait for the slice iterator's pointer manipulations.
Introduce a helper method .search_while() that generalizes internal
iteration (Iterator's all, find, position, fold and so on).

The compiler does not unroll loops with conditional exits; we can do
this manually instead to improve the performance of for example
Iterator::find and Iterator::position when used on the slice iterators.

The unrolling is patterned on libstdc++'s implementation of std::find_if.
@bluss
Copy link
Member Author

bluss commented Nov 24, 2016

I started with the idea of general internal iteration, but now it's been simplified into something better. It's search_while instead of fold_while.

@arthurprs
Copy link
Contributor

Nice!

Is this a general optimization though? I guess if the FnMut is large LLVM will just refuse to inline it.

@bluss
Copy link
Member Author

bluss commented Nov 25, 2016

Good question. There would be cases where this manual unrolling is not the right thing to do, but very often the search's predicate is simple.

Hidden in commit 3's log it says “The unrolling is patterned on libstdc++'s implementation of std::find_if.”; so there is a precedent, that library unrolls by 4 explicitly as well.

@arthurprs
Copy link
Contributor

arthurprs commented Nov 25, 2016

It's probably worth testing the behavior when the fn is complex. It's hardly the common case for filters though.

@bluss
Copy link
Member Author

bluss commented Nov 25, 2016

I would prefer to find an existing workload for that rather than something synthetic.

This explicit unroll should be removed again when llvm does this by itself. Until then we have a lot to win, it pains me to see byte-by-byte loops non-unrolled, and it has some significant wins for the parsing libraries.

@bluss
Copy link
Member Author

bluss commented Nov 25, 2016

I found another parsing library, gimli (commit ccc49fa5d4cd2) cc @philipc and it has some improvements and regressions, I couldn't quite understand exactly why two of the cases regressed.

 name                                                       before-best ns/iter  after-best ns/iter  diff ns/iter   diff % 
 bench_executing_line_number_programs                       408,366              429,549                   21,183    5.19% 
 bench_parsing_debug_abbrev                                 14,876               16,176                     1,300    8.74% 
 bench_parsing_debug_aranges                                22,993               23,124                       131    0.57% 
 bench_parsing_debug_info                                   1,866,797            1,908,916                 42,119    2.26% 
 bench_parsing_debug_info_tree                              2,140,274            2,155,311                 15,037    0.70% 
 bench_parsing_debug_pubnames                               106,288              83,556                   -22,732  -21.39% 
 bench_parsing_debug_types                                  50,729               41,663                    -9,066  -17.87% 
 bench_parsing_line_number_program_opcodes                  348,240              349,769                    1,529    0.44% 
 cfi::eval_longest_fde_instructions_new_ctx_everytime       460                  445                          -15   -3.26% 
 cfi::eval_longest_fde_instructions_same_ctx                480                  444                          -36   -7.50% 
 cfi::iterate_entries_and_do_not_parse_any_fde              72,258               74,956                     2,698    3.73% 
 cfi::iterate_entries_and_parse_every_fde                   733,124              730,244                   -2,880   -0.39% 
 cfi::iterate_entries_and_parse_every_fde_and_instructions  1,216,472            1,204,091                -12,381   -1.02% 
 cfi::iterate_entries_evaluate_every_fde                    1,967,683            1,868,310                -99,373   -5.05% 
 cfi::parse_longest_fde_instructions                        213                  212                           -1   -0.47% 

One of the main beneficiaries (bench_parsing_debug_pubnames) spends a lot of time in this: let null_idx = input.iter().position(|ch| *ch == 0)

@arthurprs
Copy link
Contributor

arthurprs commented Nov 25, 2016

ptrdistance can be expensive for non-power-of-2 sizes.

What about specializing the PR for the numerical types (char, ints, floats) and small {2}arrays/tuples?

@bluss
Copy link
Member Author

bluss commented Nov 25, 2016

I'm ok with that, if we can introduce that conditional without making it too messy. It could be as simple as using unrolling only for small elements ("size_of <= size_of<u64>") or for a white list of types. None of those are particularly good solutions. It could even be re-targeted to specifically byte size element loops only, which is quite the hack but also understandable.

Upstream llvm bug for not unrolling simple loops. https://llvm.org/bugs/show_bug.cgi?id=27360

clang does the same thing, no unrolling

#include<vector>

std::size_t find_zero(std::vector<char>& v) {
    auto it = v.begin();
    for (; it != v.end(); ++it) {
        if (!*it) {
            break;
        }
    }
    return std::distance(v.begin(), it);
}
; Function Attrs: norecurse nounwind readonly uwtable
define i64 @_Z9find_zeroRSt6vectorIcSaIcEE(%"class.std::vector"* nocapture readonly dereferenceable(24) %v) #0 {
  %1 = getelementptr inbounds %"class.std::vector", %"class.std::vector"* %v, i64 0, i32 0, i32 0, i32 0
  %2 = load i8*, i8** %1, align 8, !tbaa !1
  %3 = getelementptr inbounds %"class.std::vector", %"class.std::vector"* %v, i64 0, i32 0, i32 0, i32 1
  %4 = load i8*, i8** %3, align 8, !tbaa !1
  %5 = icmp eq i8* %2, %4
  %6 = ptrtoint i8* %2 to i64
  br i1 %5, label %._crit_edge, label %.lr.ph.preheader

.lr.ph.preheader:                                 ; preds = %0
  br label %.lr.ph

.lr.ph:                                           ; preds = %.lr.ph.preheader, %9
  %it.sroa.0.02 = phi i8* [ %10, %9 ], [ %2, %.lr.ph.preheader ]
  %7 = load i8, i8* %it.sroa.0.02, align 1, !tbaa !5
  %8 = icmp eq i8 %7, 0
  br i1 %8, label %._crit_edge.loopexit, label %9

; <label>:9                                       ; preds = %.lr.ph
  %10 = getelementptr inbounds i8, i8* %it.sroa.0.02, i64 1
  %11 = icmp eq i8* %10, %4
  br i1 %11, label %._crit_edge.loopexit, label %.lr.ph

._crit_edge.loopexit:                             ; preds = %9, %.lr.ph
  %it.sroa.0.0.lcssa.ph = phi i8* [ %4, %9 ], [ %it.sroa.0.02, %.lr.ph ]
  br label %._crit_edge

._crit_edge:                                      ; preds = %._crit_edge.loopexit, %0
  %it.sroa.0.0.lcssa = phi i8* [ %2, %0 ], [ %it.sroa.0.0.lcssa.ph, %._crit_edge.loopexit ]
  %12 = ptrtoint i8* %it.sroa.0.0.lcssa to i64
  %13 = sub i64 %12, %6
  ret i64 %13
}

@ranma42
Copy link
Contributor

ranma42 commented Nov 27, 2016

@bluss is LLVM unable to remove the bound checking if indexes are used instead of raw pointers? Are we already tracking this in an issue (and/or an LLVM bug)?
This might make it possible to get rid of the unsafe block and of ptrdiff in this PR and it would also likely improve the code generated in other places.

@bluss
Copy link
Member Author

bluss commented Nov 27, 2016

@ranma42 Do you mean that it should compute the equivalent slice and use an indexed loop over that instead? In that case I've already experimented with that kind of solution (example) and it was slower. (Yes bounds checks are removed in that formulation)

I don't completely understand the question about using indexing in the slice iterator. Indexing doesn't work as well with the iterator (start, end pointer representation) as with the slice (pointer, length representation).

@ranma42
Copy link
Contributor

ranma42 commented Nov 27, 2016

I meant that it would be possible to use a start/end indexes pair (or even a slice itself) to track the not-yet-consumed part in the iterator. If LLVM was able to remove all of the bound checking, I would expect this to have good performance (as the ptr representation can trivially be obtained with a linear transformation, which IIRC is often applied to loop variables)

@bluss
Copy link
Member Author

bluss commented Nov 27, 2016

I don't think that's a natural way to express these algorithms in the slice iterator. It then needs some code to fixup the start, end pointer pair (update from the indices) when it exits the loop.

I was quite pleased with the code in search_while because it is so clear that it is "obviously" correct. Link to that line Ok, don't laugh 😄... Yes there's a macro, a closure call, a method call there on each line. But the methods and macros are introduced to make it neat, unsafe code should be neat and easy to review.

@philipc
Copy link
Contributor

philipc commented Nov 28, 2016

@bluss For the gimli benchmarks, I'm definitely seeing that improvement for bench_parsing_debug_pubnames and bench_parsing_debug_types. However, I think all the other regressions you saw were just noise. I'm not seeing any significant differences there over multiple runs.

Copy link
Member

@Kimundi Kimundi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@steveklabnik
Copy link
Member

ping @sfackler , you're the reviewer here, but you haven't commented. Should we move this review to someone else?

@sfackler
Copy link
Member

sfackler commented Jan 3, 2017

@bors r+

@bluss
Copy link
Member Author

bluss commented Jan 11, 2017

Relevant parallel issue that hasn't been resolved - A discussion if libc++ should adopt libstdc++'s unrolling for the C++ find. https://llvm.org/bugs/show_bug.cgi?id=19708 And look at that delicious Duff's device

@alexcrichton
Copy link
Member

@bors: r=sfackler

@bors
Copy link
Contributor

bors commented Jan 17, 2017

📌 Commit a54ddfb has been approved by sfackler

@bors
Copy link
Contributor

bors commented Jan 17, 2017

⌛ Testing commit a54ddfb with merge c07a6ae...

bors added a commit that referenced this pull request Jan 17, 2017
Improve the slice iterator's searching methods

Improve all, any, find, position, rposition by explicitly unrolling the loop for the slice iterators.

- Introduce a few extension methods and functions for raw pointers make the new code easy to express
- Introduce helper methods `search_while, rsearch_while` that generalize all the searching methods

LLVM doesn't unroll the loop in `.find()` by default (clang is the same), so performance benefits a lot from explicit unrolling here. An iterator method without conditional exits (like `.fold()`) does not need this on the other hand.

One of the raw pointer extension methods is `fn post_inc(&mut self) -> Self` which is the rustic equivalent of “`ptr++`”, and it is a nice way to express the raw pointer loop (see commit 3).

Specific development notes about `search_while`: I tried both computing an end pointer "rounded" to 4, as well as the `ptrdistance >= 4` loop condition, ptrdistance was better. I tried handling the last 0-3 elements unrolled or with a while loop, the loop was better.
@bors
Copy link
Contributor

bors commented Jan 18, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: sfackler
Pushing c07a6ae to master...

@bors bors merged commit a54ddfb into rust-lang:master Jan 18, 2017
@bluss bluss deleted the iter-find-is-on-a-roll branch January 18, 2017 09:40
@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 25, 2017

@bluss that moment when I click on a PR that I find interesting to see if I can improve it using SIMD and I follow your link to an LLVM bug report that I also found interesting in which surprisingly the last comment is from myself quoting a SIMD implementation of the algorithm 🤣

I've tried to think if we can use specialization to implement SSE2 and AVX versions "manually" but since the predicate is opaque we cannot.

We would need to provide in std "generic predicates", like e.g. Eq<T, U>(value: U) (and Le/Ge/Leq/Geq/Neq...) that use fn_traits to become callable and do x == value. This would allow specializing on e.g. slices of particular integer types and particular comparison operators (since Eq is its own type), and then use cfg_target_feature and extern-intrinsic to use SIMD when supported by the target. For users to benefit from the optimization, they would need to call iter().find(Eq(3)) instead of iter().find(|&&x| x == 3)which... is IMO too much trouble (and confusingness) for an optimization that should "just trigger" for closures as well...

@bluss
Copy link
Member Author

bluss commented Jan 25, 2017

@gnzlbg The compiler making the decision about unrolling and vectorization is definitely preferable, since code size etc decisions can be made according to settings and context. Don't you think that autovectorization would follow as well if the compiler just learns to unroll this loop by itself?

By the way, maybe you can improve on the memchr fallback that's in libstd if you are interested.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 25, 2017

@bluss I will look into the memchr fallback, thanks for the pointer.

I agree that it would be better if auto-vectorization would just work but... Why do you think that it will ever work? All the data I have points that it never will: after 30 years of GCC and 15 years of LLVM, inlining, which is required for auto-vectorization, is believed to be an unsolvable problem, and auto-vectorization optimizations themselves still fail to reliably optimize even the most trivial loops.

@bluss
Copy link
Member Author

bluss commented Jan 25, 2017

I think there are many examples where it works amazingly well.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 25, 2017

I should have emphasized the word reliably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.