Improve the slice iterator's searching methods #37972

bluss · 2016-11-23T22:18:12Z

Improve all, any, find, position, rposition by explicitly unrolling the loop for the slice iterators.

Introduce a few extension methods and functions for raw pointers make the new code easy to express
Introduce helper methods search_while, rsearch_while that generalize all the searching methods

LLVM doesn't unroll the loop in .find() by default (clang is the same), so performance benefits a lot from explicit unrolling here. An iterator method without conditional exits (like .fold()) does not need this on the other hand.

One of the raw pointer extension methods is fn post_inc(&mut self) -> Self which is the rustic equivalent of “ptr++”, and it is a nice way to express the raw pointer loop (see commit 3).

Specific development notes about search_while: I tried both computing an end pointer "rounded" to 4, as well as the ptrdistance >= 4 loop condition, ptrdistance was better. I tried handling the last 0-3 elements unrolled or with a while loop, the loop was better.

rust-highfive · 2016-11-23T22:18:25Z

r? @sfackler

(rust_highfive has picked a reviewer for you, use r? to override)

bluss · 2016-11-23T23:12:21Z

benchmarks: https://gist.github.com/bluss/3fed8a20667d525a5f8a4713442fe540 (updated)

 name                     iter-search-old-3 ns/iter  iter-search-new-3 ns/iter  diff ns/iter   diff % 
 iter_all                 17,826 (1838 MB/s)         11,188 (2928 MB/s)               -6,638  -37.24% 
 iter_find                17,849 (1835 MB/s)         11,149 (2939 MB/s)               -6,700  -37.54% 
 iter_find_split_loop_u8  309 (1656 MB/s)            237 (2160 MB/s)                     -72  -23.30% 
 iter_position            27,389 (1196 MB/s)         11,165 (2934 MB/s)              -16,224  -59.24%

nagisa · 2016-11-23T23:48:10Z

src/libcore/iter/iterator.rs

-            }
-        }
-        false
+        !self.all(move |x| !f(x))


This is a breaking change in semantics for code like this:

impl Iterator for _ { fn all(...) -> bool where ... { !self.any(move |x| !f(x)) } // no `any` def’n. }

Currently it would just work, but after this change it would become an infinite loop.

I noticed this, because this is a common issue in Haskell.

Oh nice catch. Indeed, it ICEs with a massive recursive type.

I've removed that part from the PR entirely.

bluss · 2016-11-24T00:23:26Z

Well it looks like there is something real broken from the travis tests.. (Oh.. ZST offsetting is broken! Edit: And now fixed, of course.)

bluss · 2016-11-24T01:20:29Z

nom's benchmark improves with the PR (some kind of http test) (commit 2e2730cdb451a55)

 name      nom-before ns/iter  nom-after ns/iter  diff ns/iter   diff % 
 one_test  764                 618                        -146  -19.11%

cc @Geal Do you have more nom benchmarks? This PR is kind of relevant for nom, I hope?

Geal · 2016-11-24T10:12:30Z

@bluss indeed, it is relevant. I have other benchmarks here: https://github.com/Geal/nom_benchmarks
but I have not updated those in a long time. I'll migrate them to the new nom that' I'll release in a couple of days.

bluss · 2016-11-24T13:36:57Z

chomp benchmarks cc @m4rw3r (commit d838dd267611edc)

name                            before-2 ns/iter  after-2 ns/iter  diff ns/iter   diff % 
 count_vec_10k                   8,361             8,382                      21    0.25% 
 count_vec_10k_maybe_incomplete  8,182             8,196                      14    0.17% 
 count_vec_1k                    867               865                        -2   -0.23% 
 float_f32                       26,866            27,112                    246    0.92% 
 float_f64                       1,052             1,068                      16    1.52% 
 float_no_conversion             92                64                        -28  -30.43% 
 float_small_f32                 54                54                          0    0.00% 
 float_small_f64                 55                54                         -1   -1.82% 
 float_small_no_conversion       22                23                          1    4.55% 
 from_str_f32                    28,417            27,089                 -1,328   -4.67% 
 from_str_f64                    1,007             987                       -20   -1.99% 
 from_str_small_f32              29                29                          0    0.00% 
 from_str_small_f64              30                29                         -1   -3.33% 
 many1_vec_10k                   10,057            10,070                     13    0.13% 
 many1_vec_10k_maybe_incomplete  10,126            10,115                    -11   -0.11% 
 many1_vec_1k                    1,422             1,415                      -7   -0.49% 
 many_vec_10k                    9,082             9,086                       4    0.04% 
 many_vec_10k_maybe_incomplete   9,355             9,138                    -217   -2.32% 
 many_vec_1k                     1,272             1,236                     -36   -2.83% 
 match_float                     70                43                        -27  -38.57% 
 multiple_requests               53,834            46,884                 -6,950  -12.91% 
 single_request                  786               678                      -108  -13.74% 
 single_request_large            1,195             1,008                    -187  -15.65% 
 single_request_minimal          117               118                         1    0.85%

Geal · 2016-11-24T14:29:14Z

@bluss I updated the benchmarks on https://github.com/Geal/nom_benchmarks , they should work with nom master now

bluss · 2016-11-24T14:35:25Z

Ok, I tried to run the benchmarks I found. nom-benchmarks commit 5246454ce735d4fb8c

mp4

 name               before-2 ns/iter  after-1 ns/iter  diff ns/iter  diff % 
 bigbuckbunny_test  225               226                         1   0.44% 
 small_test         241               235                        -6  -2.49%

nom-http

 name         before-1 ns/iter  after-1 ns/iter  diff ns/iter   diff % 
 bigger_test  319,699           262,205               -57,494  -17.98% 
 small_test   60,897            49,495                -11,402  -18.72%

Use an extension trait for the slice iterator's pointer manipulations.

Introduce a helper method .search_while() that generalizes internal iteration (Iterator's all, find, position, fold and so on). The compiler does not unroll loops with conditional exits; we can do this manually instead to improve the performance of for example Iterator::find and Iterator::position when used on the slice iterators. The unrolling is patterned on libstdc++'s implementation of std::find_if.

bluss · 2016-11-24T15:58:07Z

I started with the idea of general internal iteration, but now it's been simplified into something better. It's search_while instead of fold_while.

arthurprs · 2016-11-25T09:50:24Z

Nice!

Is this a general optimization though? I guess if the FnMut is large LLVM will just refuse to inline it.

bluss · 2016-11-25T10:49:32Z

Good question. There would be cases where this manual unrolling is not the right thing to do, but very often the search's predicate is simple.

Hidden in commit 3's log it says “The unrolling is patterned on libstdc++'s implementation of std::find_if.”; so there is a precedent, that library unrolls by 4 explicitly as well.

arthurprs · 2016-11-25T10:56:20Z

It's probably worth testing the behavior when the fn is complex. It's hardly the common case for filters though.

bluss · 2016-11-25T12:53:48Z

I would prefer to find an existing workload for that rather than something synthetic.

This explicit unroll should be removed again when llvm does this by itself. Until then we have a lot to win, it pains me to see byte-by-byte loops non-unrolled, and it has some significant wins for the parsing libraries.

bluss · 2016-11-25T13:41:21Z

I found another parsing library, gimli (commit ccc49fa5d4cd2) cc @philipc and it has some improvements and regressions, I couldn't quite understand exactly why two of the cases regressed.

 name                                                       before-best ns/iter  after-best ns/iter  diff ns/iter   diff % 
 bench_executing_line_number_programs                       408,366              429,549                   21,183    5.19% 
 bench_parsing_debug_abbrev                                 14,876               16,176                     1,300    8.74% 
 bench_parsing_debug_aranges                                22,993               23,124                       131    0.57% 
 bench_parsing_debug_info                                   1,866,797            1,908,916                 42,119    2.26% 
 bench_parsing_debug_info_tree                              2,140,274            2,155,311                 15,037    0.70% 
 bench_parsing_debug_pubnames                               106,288              83,556                   -22,732  -21.39% 
 bench_parsing_debug_types                                  50,729               41,663                    -9,066  -17.87% 
 bench_parsing_line_number_program_opcodes                  348,240              349,769                    1,529    0.44% 
 cfi::eval_longest_fde_instructions_new_ctx_everytime       460                  445                          -15   -3.26% 
 cfi::eval_longest_fde_instructions_same_ctx                480                  444                          -36   -7.50% 
 cfi::iterate_entries_and_do_not_parse_any_fde              72,258               74,956                     2,698    3.73% 
 cfi::iterate_entries_and_parse_every_fde                   733,124              730,244                   -2,880   -0.39% 
 cfi::iterate_entries_and_parse_every_fde_and_instructions  1,216,472            1,204,091                -12,381   -1.02% 
 cfi::iterate_entries_evaluate_every_fde                    1,967,683            1,868,310                -99,373   -5.05% 
 cfi::parse_longest_fde_instructions                        213                  212                           -1   -0.47%

One of the main beneficiaries (bench_parsing_debug_pubnames) spends a lot of time in this: let null_idx = input.iter().position(|ch| *ch == 0)

arthurprs · 2016-11-25T13:54:12Z

ptrdistance can be expensive for non-power-of-2 sizes.

What about specializing the PR for the numerical types (char, ints, floats) and small {2}arrays/tuples?

bluss · 2016-11-25T14:19:58Z

I'm ok with that, if we can introduce that conditional without making it too messy. It could be as simple as using unrolling only for small elements ("size_of <= size_of<u64>") or for a white list of types. None of those are particularly good solutions. It could even be re-targeted to specifically byte size element loops only, which is quite the hack but also understandable.

Upstream llvm bug for not unrolling simple loops. https://llvm.org/bugs/show_bug.cgi?id=27360

clang does the same thing, no unrolling

#include<vector>

std::size_t find_zero(std::vector<char>& v) {
    auto it = v.begin();
    for (; it != v.end(); ++it) {
        if (!*it) {
            break;
        }
    }
    return std::distance(v.begin(), it);
}

; Function Attrs: norecurse nounwind readonly uwtable
define i64 @_Z9find_zeroRSt6vectorIcSaIcEE(%"class.std::vector"* nocapture readonly dereferenceable(24) %v) #0 {
  %1 = getelementptr inbounds %"class.std::vector", %"class.std::vector"* %v, i64 0, i32 0, i32 0, i32 0
  %2 = load i8*, i8** %1, align 8, !tbaa !1
  %3 = getelementptr inbounds %"class.std::vector", %"class.std::vector"* %v, i64 0, i32 0, i32 0, i32 1
  %4 = load i8*, i8** %3, align 8, !tbaa !1
  %5 = icmp eq i8* %2, %4
  %6 = ptrtoint i8* %2 to i64
  br i1 %5, label %._crit_edge, label %.lr.ph.preheader

.lr.ph.preheader:                                 ; preds = %0
  br label %.lr.ph

.lr.ph:                                           ; preds = %.lr.ph.preheader, %9
  %it.sroa.0.02 = phi i8* [ %10, %9 ], [ %2, %.lr.ph.preheader ]
  %7 = load i8, i8* %it.sroa.0.02, align 1, !tbaa !5
  %8 = icmp eq i8 %7, 0
  br i1 %8, label %._crit_edge.loopexit, label %9

; <label>:9                                       ; preds = %.lr.ph
  %10 = getelementptr inbounds i8, i8* %it.sroa.0.02, i64 1
  %11 = icmp eq i8* %10, %4
  br i1 %11, label %._crit_edge.loopexit, label %.lr.ph

._crit_edge.loopexit:                             ; preds = %9, %.lr.ph
  %it.sroa.0.0.lcssa.ph = phi i8* [ %4, %9 ], [ %it.sroa.0.02, %.lr.ph ]
  br label %._crit_edge

._crit_edge:                                      ; preds = %._crit_edge.loopexit, %0
  %it.sroa.0.0.lcssa = phi i8* [ %2, %0 ], [ %it.sroa.0.0.lcssa.ph, %._crit_edge.loopexit ]
  %12 = ptrtoint i8* %it.sroa.0.0.lcssa to i64
  %13 = sub i64 %12, %6
  ret i64 %13
}

ranma42 · 2016-11-27T09:53:01Z

@bluss is LLVM unable to remove the bound checking if indexes are used instead of raw pointers? Are we already tracking this in an issue (and/or an LLVM bug)?
This might make it possible to get rid of the unsafe block and of ptrdiff in this PR and it would also likely improve the code generated in other places.

bluss · 2016-11-27T12:24:56Z

@ranma42 Do you mean that it should compute the equivalent slice and use an indexed loop over that instead? In that case I've already experimented with that kind of solution (example) and it was slower. (Yes bounds checks are removed in that formulation)

I don't completely understand the question about using indexing in the slice iterator. Indexing doesn't work as well with the iterator (start, end pointer representation) as with the slice (pointer, length representation).

ranma42 · 2016-11-27T17:47:31Z

I meant that it would be possible to use a start/end indexes pair (or even a slice itself) to track the not-yet-consumed part in the iterator. If LLVM was able to remove all of the bound checking, I would expect this to have good performance (as the ptr representation can trivially be obtained with a linear transformation, which IIRC is often applied to loop variables)

bluss · 2016-11-27T17:53:28Z

I don't think that's a natural way to express these algorithms in the slice iterator. It then needs some code to fixup the start, end pointer pair (update from the indices) when it exits the loop.

I was quite pleased with the code in search_while because it is so clear that it is "obviously" correct. Link to that line Ok, don't laugh 😄... Yes there's a macro, a closure call, a method call there on each line. But the methods and macros are introduced to make it neat, unsafe code should be neat and easy to review.

philipc · 2016-11-28T03:20:56Z

@bluss For the gimli benchmarks, I'm definitely seeing that improvement for bench_parsing_debug_pubnames and bench_parsing_debug_types. However, I think all the other regressions you saw were just noise. I'm not seeing any significant differences there over multiple runs.

Kimundi

Looks good!

steveklabnik · 2017-01-03T16:40:41Z

ping @sfackler , you're the reviewer here, but you haven't commented. Should we move this review to someone else?

sfackler · 2017-01-03T17:45:34Z

@bors r+

bluss · 2017-01-11T18:45:43Z

Relevant parallel issue that hasn't been resolved - A discussion if libc++ should adopt libstdc++'s unrolling for the C++ find. https://llvm.org/bugs/show_bug.cgi?id=19708 ^{And look at that delicious Duff's device}

alexcrichton · 2017-01-17T21:10:25Z

@bors: r=sfackler

bors · 2017-01-17T21:10:26Z

📌 Commit a54ddfb has been approved by sfackler

bors · 2017-01-17T21:51:31Z

⌛ Testing commit a54ddfb with merge c07a6ae...

Improve the slice iterator's searching methods Improve all, any, find, position, rposition by explicitly unrolling the loop for the slice iterators. - Introduce a few extension methods and functions for raw pointers make the new code easy to express - Introduce helper methods `search_while, rsearch_while` that generalize all the searching methods LLVM doesn't unroll the loop in `.find()` by default (clang is the same), so performance benefits a lot from explicit unrolling here. An iterator method without conditional exits (like `.fold()`) does not need this on the other hand. One of the raw pointer extension methods is `fn post_inc(&mut self) -> Self` which is the rustic equivalent of “`ptr++`”, and it is a nice way to express the raw pointer loop (see commit 3). Specific development notes about `search_while`: I tried both computing an end pointer "rounded" to 4, as well as the `ptrdistance >= 4` loop condition, ptrdistance was better. I tried handling the last 0-3 elements unrolled or with a while loop, the loop was better.

bors · 2017-01-18T00:26:54Z

☀️ Test successful - status-appveyor, status-travis
Approved by: sfackler
Pushing c07a6ae to master...

gnzlbg · 2017-01-25T09:03:15Z

@bluss that moment when I click on a PR that I find interesting to see if I can improve it using SIMD and I follow your link to an LLVM bug report that I also found interesting in which surprisingly the last comment is from myself quoting a SIMD implementation of the algorithm 🤣

I've tried to think if we can use specialization to implement SSE2 and AVX versions "manually" but since the predicate is opaque we cannot.

We would need to provide in std "generic predicates", like e.g. Eq<T, U>(value: U) (and Le/Ge/Leq/Geq/Neq...) that use fn_traits to become callable and do x == value. This would allow specializing on e.g. slices of particular integer types and particular comparison operators (since Eq is its own type), and then use cfg_target_feature and extern-intrinsic to use SIMD when supported by the target. For users to benefit from the optimization, they would need to call iter().find(Eq(3)) instead of iter().find(|&&x| x == 3)which... is IMO too much trouble (and confusingness) for an optimization that should "just trigger" for closures as well...

bluss · 2017-01-25T12:15:36Z

@gnzlbg The compiler making the decision about unrolling and vectorization is definitely preferable, since code size etc decisions can be made according to settings and context. Don't you think that autovectorization would follow as well if the compiler just learns to unroll this loop by itself?

By the way, maybe you can improve on the memchr fallback that's in libstd if you are interested.

gnzlbg · 2017-01-25T13:31:02Z

@bluss I will look into the memchr fallback, thanks for the pointer.

I agree that it would be better if auto-vectorization would just work but... Why do you think that it will ever work? All the data I have points that it never will: after 30 years of GCC and 15 years of LLVM, inlining, which is required for auto-vectorization, is believed to be an unsolvable problem, and auto-vectorization optimizations themselves still fail to reliably optimize even the most trivial loops.

bluss · 2017-01-25T13:55:00Z

I think there are many examples where it works amazingly well.

gnzlbg · 2017-01-25T14:13:47Z

I should have emphasized the word reliably.

rust-highfive assigned sfackler Nov 23, 2016

nagisa reviewed Nov 23, 2016

View reviewed changes

bluss force-pushed the iter-find-is-on-a-roll branch from 4c90af8 to 57429b6 Compare November 24, 2016 00:25

bluss added 3 commits November 24, 2016 16:55

core: Convert utility macros for the slice iterator into a trait

c1db77e

Use an extension trait for the slice iterator's pointer manipulations.

core: Add ptrdistance to slice module

7611e42

bluss force-pushed the iter-find-is-on-a-roll branch from 90fd57d to a54ddfb Compare November 24, 2016 15:56

Kimundi reviewed Nov 28, 2016

View reviewed changes

bluss mentioned this pull request Nov 29, 2016

Add find_map to the Iterator trait rust-lang/rfcs#1801

Closed

bluss mentioned this pull request Dec 30, 2016

Rust doesn't unroll a loop with constant data #38694

Closed

bluss mentioned this pull request Jan 3, 2017

Extra null check on as_bytes() iteration with break #25306

Closed

bors merged commit a54ddfb into rust-lang:master Jan 18, 2017

bluss deleted the iter-find-is-on-a-roll branch January 18, 2017 09:40

scottmcm mentioned this pull request Sep 19, 2019

Simplify some Iterator methods. #64572

Closed

Improve the slice iterator's searching methods #37972

Improve the slice iterator's searching methods #37972

Conversation

bluss commented Nov 23, 2016 • edited Loading

rust-highfive commented Nov 23, 2016

bluss commented Nov 23, 2016 • edited Loading

nagisa Nov 23, 2016 • edited Loading

Choose a reason for hiding this comment

bluss Nov 24, 2016

Choose a reason for hiding this comment

bluss Nov 24, 2016

Choose a reason for hiding this comment

bluss commented Nov 24, 2016 • edited Loading

bluss commented Nov 24, 2016 • edited Loading

Geal commented Nov 24, 2016

bluss commented Nov 24, 2016 • edited Loading

Geal commented Nov 24, 2016

bluss commented Nov 24, 2016

bluss commented Nov 24, 2016

arthurprs commented Nov 25, 2016

bluss commented Nov 25, 2016

arthurprs commented Nov 25, 2016 • edited Loading

bluss commented Nov 25, 2016

bluss commented Nov 25, 2016

arthurprs commented Nov 25, 2016 • edited Loading

bluss commented Nov 25, 2016 • edited Loading

ranma42 commented Nov 27, 2016

bluss commented Nov 27, 2016 • edited Loading

ranma42 commented Nov 27, 2016

bluss commented Nov 27, 2016 • edited Loading

philipc commented Nov 28, 2016

Kimundi left a comment

Choose a reason for hiding this comment

steveklabnik commented Jan 3, 2017

sfackler commented Jan 3, 2017

bluss commented Jan 11, 2017

alexcrichton commented Jan 17, 2017

bors commented Jan 17, 2017

bors commented Jan 17, 2017

bors commented Jan 18, 2017

gnzlbg commented Jan 25, 2017 • edited Loading

bluss commented Jan 25, 2017

gnzlbg commented Jan 25, 2017 • edited Loading

bluss commented Jan 25, 2017

gnzlbg commented Jan 25, 2017

bluss commented Nov 23, 2016 •

edited

Loading

bluss commented Nov 23, 2016 •

edited

Loading

nagisa Nov 23, 2016 •

edited

Loading

bluss commented Nov 24, 2016 •

edited

Loading

bluss commented Nov 24, 2016 •

edited

Loading

bluss commented Nov 24, 2016 •

edited

Loading

arthurprs commented Nov 25, 2016 •

edited

Loading

arthurprs commented Nov 25, 2016 •

edited

Loading

bluss commented Nov 25, 2016 •

edited

Loading

bluss commented Nov 27, 2016 •

edited

Loading

bluss commented Nov 27, 2016 •

edited

Loading

gnzlbg commented Jan 25, 2017 •

edited

Loading

gnzlbg commented Jan 25, 2017 •

edited

Loading