Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduced allocations in merge_sort for short vectors #12029

Merged
merged 1 commit into from
Feb 7, 2014

Conversation

zkamsler
Copy link
Contributor

@zkamsler zkamsler commented Feb 4, 2014

This pull request:

  1. Changes the initial insertion sort to be in-place, and defers allocation of working set until merge is needed.
  2. Increases the increases the maximum run length to use insertion sort for from 8 to 32 elements. This increases the size of vectors that will not allocate, and reduces the number of merge passes by two. It seemed to be the sweet spot in the benchmarks that I ran.

Here are the results of some benchmarks. Note that they are sorting u64s, so types that are more expensive to compare or copy may have different behaviors.
Before changes:

test vec::bench::sort_random_large      bench:    719753 ns/iter (+/- 130173) = 111 MB/s
test vec::bench::sort_random_medium     bench:      4726 ns/iter (+/- 742) = 169 MB/s
test vec::bench::sort_random_small      bench:       344 ns/iter (+/- 76) = 116 MB/s
test vec::bench::sort_sorted            bench:    437244 ns/iter (+/- 70043) = 182 MB/s

Deferred allocation (8 element insertion sort):

test vec::bench::sort_random_large      bench:    702630 ns/iter (+/- 88158) = 113 MB/s
test vec::bench::sort_random_medium     bench:      4529 ns/iter (+/- 497) = 176 MB/s
test vec::bench::sort_random_small      bench:       185 ns/iter (+/- 49) = 216 MB/s
test vec::bench::sort_sorted            bench:    425853 ns/iter (+/- 60907) = 187 MB/s

Deferred allocation (16 element insertion sort):

test vec::bench::sort_random_large      bench:    692783 ns/iter (+/- 165837) = 115 MB/s
test vec::bench::sort_random_medium     bench:      4434 ns/iter (+/- 722) = 180 MB/s
test vec::bench::sort_random_small      bench:       187 ns/iter (+/- 38) = 213 MB/s
test vec::bench::sort_sorted            bench:    393783 ns/iter (+/- 85548) = 203 MB/s

Deferred allocation (32 element insertion sort):

test vec::bench::sort_random_large      bench:    682556 ns/iter (+/- 131008) = 117 MB/s
test vec::bench::sort_random_medium     bench:      4370 ns/iter (+/- 1369) = 183 MB/s
test vec::bench::sort_random_small      bench:       179 ns/iter (+/- 32) = 223 MB/s
test vec::bench::sort_sorted            bench:    358353 ns/iter (+/- 65423) = 223 MB/s

Deferred allocation (64 element insertion sort):

test vec::bench::sort_random_large      bench:    712040 ns/iter (+/- 132454) = 112 MB/s
test vec::bench::sort_random_medium     bench:      4425 ns/iter (+/- 784) = 180 MB/s
test vec::bench::sort_random_small      bench:       179 ns/iter (+/- 81) = 223 MB/s
test vec::bench::sort_sorted            bench:    317812 ns/iter (+/- 62675) = 251 MB/s

This is the best I could manage with the basic merge sort while keeping the invariant that the original vector must contain each element exactly once when the comparison function is called. If one is not married to a stable sort, an in-place n*log(n) sorting algorithm may have better performance in some cases.

for #12011
cc @huonw

@emberian
Copy link
Member

emberian commented Feb 4, 2014

Awesome! LGTM

@huonw
Copy link
Member

huonw commented Feb 4, 2014

Cool! Will review in a little while.

@huonw
Copy link
Member

huonw commented Feb 4, 2014

Changes the initial insertion sort to be in-place, and defers allocation of working set until merge is needed.

I'm not sure about this change, maybe it would be better to handle the short case completely separately to do the initial sort into the temporary space for longer vectors. (Of course this is getting a little non-DRY. :( )

If one is not married to a stable sort, an in-place n*log(n) sorting algorithm may have better performance in some cases.

There was some discussion of having both sort and stable_sort (or unstable_sort and sort). I don't think a decision was made.

// `compare` fails.
let mut working_space = with_capacity(2 * len);
// these both are buffers of length `len`.
let mut buf_dat = working_space.as_mut_ptr();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment is never read, and could just be let mut buf_dat = buf_v;, with a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is read by buf_tmp, but that could be inlined if that is clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, so it is, inlining would be good, or just working_space.unsafe_mut_ref(len) for buf_tmp.

in any case, I would prefer out-of-place insertion sort for large vectors (or at least experimenting with it), which would mean this discussion is moot anyway.

@brson
Copy link
Contributor

brson commented Feb 4, 2014

👍

@zkamsler
Copy link
Contributor Author

zkamsler commented Feb 5, 2014

I have updated the commit based on the minor comments that you had. I am certainly open to making the in-place sorting be a separate path, but I am not sure what the benefit it would bring if it would be doing almost the same thing. The in-place sorting does not appear to have a performance penalty (the benchmarks show a marginal improvement for the non-short case with in-place sorting, although that is hardly authoritative).

Keeping the short path separate does afford more future flexibility to change them independently, but nothing prevents that from happening later. What is your motivation for doing so?

@huonw
Copy link
Member

huonw commented Feb 6, 2014

I was mainly thinking the extra cost of doing two copies while insertion sorting (i.e. copying into tmp, then copying tmp back into the vector); but, as you say, there appears to still be a speed up.

I imagine it won't be any different for "small" types like u64, but might show some difference for a type like (u64, u64, u64, u64) (or even larger, but not so large as to make using ~ the correct option) where copies are more significant. Could you add three benchmarks (i.e. small, medium, large) with a larger type than just u64?

@zkamsler
Copy link
Contributor Author

zkamsler commented Feb 6, 2014

Here are the results of some benchmarks that I ran locally. The _big ones are sorting vectors of (u64,u64,u64,u46). Out of place might be slightly faster for the big ones, but it is noisy enough that I cannot say whether it is significant. The larger insertion sort runs do appear to be slower when sorting the tuples, which makes sense given that insertion sort has a lot more writes than merge sort.

in-place insertion (32 elem max run):

test vec::bench::sort_big_random_large    ... bench:   1512893 ns/iter (+/- 154790) = 211 MB/s
test vec::bench::sort_big_random_medium   ... bench:      8542 ns/iter (+/- 1369) = 374 MB/s
test vec::bench::sort_big_random_small    ... bench:       246 ns/iter (+/- 147) = 650 MB/s
test vec::bench::sort_big_sorted          ... bench:    536906 ns/iter (+/- 61414) = 595 MB/s
test vec::bench::sort_random_large        ... bench:    685387 ns/iter (+/- 86290) = 116 MB/s
test vec::bench::sort_random_medium       ... bench:      4277 ns/iter (+/- 915) = 187 MB/s
test vec::bench::sort_random_small        ... bench:       183 ns/iter (+/- 50) = 218 MB/s
test vec::bench::sort_sorted              ... bench:    364977 ns/iter (+/- 63217) = 219 MB/s

in-place insertion (16 elem max run)

test vec::bench::sort_big_random_large    ... bench:   1447282 ns/iter (+/- 247736) = 220 MB/s
test vec::bench::sort_big_random_medium   ... bench:      8063 ns/iter (+/- 1136) = 396 MB/s
test vec::bench::sort_big_random_small    ... bench:       242 ns/iter (+/- 56) = 661 MB/s
test vec::bench::sort_big_sorted          ... bench:    580873 ns/iter (+/- 347205) = 550 MB/s
test vec::bench::sort_random_large        ... bench:    693611 ns/iter (+/- 130341) = 115 MB/s
test vec::bench::sort_random_medium       ... bench:      4461 ns/iter (+/- 1039) = 179 MB/s
test vec::bench::sort_random_small        ... bench:       180 ns/iter (+/- 36) = 222 MB/s
test vec::bench::sort_sorted              ... bench:    382591 ns/iter (+/- 80074) = 209 MB/s

in-place insertion (8 elem max run):

test vec::bench::sort_big_random_large    ... bench:   1453949 ns/iter (+/- 225381) = 219 MB/s
test vec::bench::sort_big_random_medium   ... bench:      7975 ns/iter (+/- 1034) = 401 MB/s
test vec::bench::sort_big_random_small    ... bench:       241 ns/iter (+/- 42) = 663 MB/s
test vec::bench::sort_big_sorted          ... bench:    607379 ns/iter (+/- 78551) = 526 MB/s
test vec::bench::sort_random_large        ... bench:    791019 ns/iter (+/- 178413) = 101 MB/s
test vec::bench::sort_random_medium       ... bench:      4426 ns/iter (+/- 519) = 180 MB/s
test vec::bench::sort_random_small        ... bench:       180 ns/iter (+/- 26) = 222 MB/s
test vec::bench::sort_sorted              ... bench:    419979 ns/iter (+/- 50808) = 190 MB/s

out-of-place (32 run insertion sort)

test vec::bench::sort_big_random_large    ... bench:   1454360 ns/iter (+/- 276385) = 219 MB/s
test vec::bench::sort_big_random_medium   ... bench:      8259 ns/iter (+/- 1190) = 387 MB/s
test vec::bench::sort_big_random_small    ... bench:       242 ns/iter (+/- 72) = 661 MB/s
test vec::bench::sort_big_sorted          ... bench:    526373 ns/iter (+/- 168484) = 607 MB/s
test vec::bench::sort_random_large        ... bench:    692873 ns/iter (+/- 140006) = 115 MB/s
test vec::bench::sort_random_medium       ... bench:      4342 ns/iter (+/- 677) = 184 MB/s
test vec::bench::sort_random_small        ... bench:       177 ns/iter (+/- 35) = 225 MB/s
test vec::bench::sort_sorted              ... bench:    380630 ns/iter (+/- 66918) = 210 MB/s

out-of-place (16 run insertion sort)

test vec::bench::sort_big_random_large    ... bench:   1393579 ns/iter (+/- 224143) = 229 MB/s
test vec::bench::sort_big_random_medium   ... bench:      7802 ns/iter (+/- 1219) = 410 MB/s
test vec::bench::sort_big_random_small    ... bench:       245 ns/iter (+/- 39) = 653 MB/s
test vec::bench::sort_big_sorted          ... bench:    584058 ns/iter (+/- 143346) = 547 MB/s
test vec::bench::sort_random_large        ... bench:    702402 ns/iter (+/- 178147) = 113 MB/s
test vec::bench::sort_random_medium       ... bench:      4500 ns/iter (+/- 900) = 177 MB/s
test vec::bench::sort_random_small        ... bench:       183 ns/iter (+/- 36) = 218 MB/s
test vec::bench::sort_sorted              ... bench:    413594 ns/iter (+/- 65994) = 193 MB/s

out-of-place (8 run insertion sort)

test vec::bench::sort_big_random_large    ... bench:   1405837 ns/iter (+/- 285931) = 227 MB/s
test vec::bench::sort_big_random_medium   ... bench:      8434 ns/iter (+/- 2950) = 379 MB/s
test vec::bench::sort_big_random_small    ... bench:       251 ns/iter (+/- 60) = 637 MB/s
test vec::bench::sort_big_sorted          ... bench:    663738 ns/iter (+/- 109683) = 481 MB/s
test vec::bench::sort_random_large        ... bench:    752714 ns/iter (+/- 135182) = 106 MB/s
test vec::bench::sort_random_medium       ... bench:      5793 ns/iter (+/- 3496) = 138 MB/s
test vec::bench::sort_random_small        ... bench:       187 ns/iter (+/- 60) = 213 MB/s
test vec::bench::sort_sorted              ... bench:    430830 ns/iter (+/- 113507) = 185 MB/s

@huonw
Copy link
Member

huonw commented Feb 6, 2014

Ah, cool! Thanks for investigating.

It looks like, as you say: (a) there's a small benefit to out-of-place sorting for large types, but (b) there's a much larger benefit to using a smaller insertion sort threshold. I imagine we could have something like

static BASE_INSERTION: uint = 32;
static LARGE_INSERTION: uint = 16;

let insertion = if size_of::<T>() <= 16 { BASE_INSERTION } else { LARGE_INSERTION };

But I'm slightly nervous to include something platform specific like this.

If you add the _big benchmarks to this PR, I'm happy to approve this if you file a bug along the lines of "tune vec.sort()'s insertion threshold" with the numbers you have above.

I'm also happy to approve any combination of out-of-place for longer vectors and naively "tuned" insertion runs as above. (I would like the bug filed no matter what you choose; and if you do decide to do the tuning, it should have a comment along the lines of "// FIXME # this adjustment seems to make sorting vectors of large elements a little faster on some platforms, but hasn't been tested/tuned extensively" (or something).)

@zkamsler
Copy link
Contributor Author

zkamsler commented Feb 7, 2014

In for a penny, in for a pound. I opened #12092, added the _big tests, and switched to out-of place insertion for large vectors with a varying threshold.

test vec::bench::sort_big_random_large    ... bench:   1389769 ns/iter (+/- 256837) = 230 MB/s
test vec::bench::sort_big_random_medium   ... bench:      7605 ns/iter (+/- 1051) = 420 MB/s
test vec::bench::sort_big_random_small    ... bench:       227 ns/iter (+/- 72) = 704 MB/s
test vec::bench::sort_big_sorted          ... bench:    585882 ns/iter (+/- 118334) = 545 MB/s
test vec::bench::sort_random_large        ... bench:    712451 ns/iter (+/- 119542) = 112 MB/s
test vec::bench::sort_random_medium       ... bench:      4578 ns/iter (+/- 868) = 174 MB/s
test vec::bench::sort_random_small        ... bench:       165 ns/iter (+/- 55) = 242 MB/s
test vec::bench::sort_sorted              ... bench:    379892 ns/iter (+/- 70648) = 210 MB/s

Added a seperate in-place insertion sort for short vectors.
Increased threshold for insertion short for 8 to 32 elements
for small types and 16 for larger types. Added benchmarks
for sorting larger types.
@zkamsler
Copy link
Contributor Author

zkamsler commented Feb 7, 2014

@huonw Removed parentheses, and added comment

bors added a commit that referenced this pull request Feb 7, 2014
This pull request:
1) Changes the initial insertion sort to be in-place, and defers allocation of working set until merge is needed.
2) Increases the increases the maximum run length to use insertion sort for from 8 to 32 elements. This increases the size of vectors that will not allocate, and reduces the number of merge passes by two. It seemed to be the sweet spot in the benchmarks that I ran.

Here are the results of some benchmarks. Note that they are sorting u64s, so types that are more expensive to compare or copy may have different behaviors.
Before changes:
```
test vec::bench::sort_random_large      bench:    719753 ns/iter (+/- 130173) = 111 MB/s
test vec::bench::sort_random_medium     bench:      4726 ns/iter (+/- 742) = 169 MB/s
test vec::bench::sort_random_small      bench:       344 ns/iter (+/- 76) = 116 MB/s
test vec::bench::sort_sorted            bench:    437244 ns/iter (+/- 70043) = 182 MB/s
```

Deferred allocation (8 element insertion sort):
```
test vec::bench::sort_random_large      bench:    702630 ns/iter (+/- 88158) = 113 MB/s
test vec::bench::sort_random_medium     bench:      4529 ns/iter (+/- 497) = 176 MB/s
test vec::bench::sort_random_small      bench:       185 ns/iter (+/- 49) = 216 MB/s
test vec::bench::sort_sorted            bench:    425853 ns/iter (+/- 60907) = 187 MB/s
```

Deferred allocation (16 element insertion sort):
```
test vec::bench::sort_random_large      bench:    692783 ns/iter (+/- 165837) = 115 MB/s
test vec::bench::sort_random_medium     bench:      4434 ns/iter (+/- 722) = 180 MB/s
test vec::bench::sort_random_small      bench:       187 ns/iter (+/- 38) = 213 MB/s
test vec::bench::sort_sorted            bench:    393783 ns/iter (+/- 85548) = 203 MB/s
```

Deferred allocation (32 element insertion sort):
```
test vec::bench::sort_random_large      bench:    682556 ns/iter (+/- 131008) = 117 MB/s
test vec::bench::sort_random_medium     bench:      4370 ns/iter (+/- 1369) = 183 MB/s
test vec::bench::sort_random_small      bench:       179 ns/iter (+/- 32) = 223 MB/s
test vec::bench::sort_sorted            bench:    358353 ns/iter (+/- 65423) = 223 MB/s
```

Deferred allocation (64 element insertion sort):
```
test vec::bench::sort_random_large      bench:    712040 ns/iter (+/- 132454) = 112 MB/s
test vec::bench::sort_random_medium     bench:      4425 ns/iter (+/- 784) = 180 MB/s
test vec::bench::sort_random_small      bench:       179 ns/iter (+/- 81) = 223 MB/s
test vec::bench::sort_sorted            bench:    317812 ns/iter (+/- 62675) = 251 MB/s
```

This is the best I could manage with the basic merge sort while keeping the invariant that the original vector must contain each element exactly once when the comparison function is called. If one is not married to a stable sort, an in-place n*log(n) sorting algorithm may have better performance in some cases.

for #12011
cc @huonw
@bors bors closed this Feb 7, 2014
@bors bors merged commit cebe5e8 into rust-lang:master Feb 7, 2014
bors added a commit to rust-lang-ci/rust that referenced this pull request Jul 25, 2022
flip1995 pushed a commit to flip1995/rust that referenced this pull request Dec 28, 2023
…-matches!, r=llogiq

6459: Check for redundant `matches!` with `Ready`, `Pending`, `V4`, `V6`

Fixes rust-lang#6459.

```
changelog: [`redundant_pattern_matching`]: Add checks for `Poll::{Ready,Pending}` and `IpAddr::{V4,V6}` in `matches!`
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants