Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster view creation #19259

Merged
merged 4 commits into from
Jan 5, 2017
Merged

Faster view creation #19259

merged 4 commits into from
Jan 5, 2017

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Nov 8, 2016

This is a series of performance patches that I wrote a while ago, but never had a chance to fully vet their performance impact. At a minimum, this fixes #19257. We really need better benchmarks for subarray creation in nanosoldier; I don't think SubArray creation is really tested there.

The fourth commit is the most micro- of micro-optimizations, and it might be a little too cute… but I think there's good reason to do it beyond the micro-optimization. See the commit message for more details.

@mbauman
Copy link
Member Author

mbauman commented Nov 8, 2016

May as well see what the soldier reports in any case: @nanosoldier runbenchmarks("ALL", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed, but no benchmarks were actually executed. Perhaps your tag predicate contains mispelled tags? cc @jrevels

@jrevels
Copy link
Member

jrevels commented Nov 8, 2016

(ALL is a keyword for run all benchmarks, "ALL" means run all benchmarks tagged "ALL")
@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@timholy
Copy link
Member

timholy commented Nov 9, 2016

Nice! I like the direction, good catch. What kind of speedups are you seeing?

Aside from the test failures, might want to look into the lu regression. The rest I bet are spurious.

@KristofferC KristofferC added potential benchmark Could make a good benchmark in BaseBenchmarks performance Must go faster labels Nov 9, 2016
parent::P
indexes::I
offset1::L # for linear indexing and pointer, only stored when LinearFast
stride1::L # for linear indexing, only stored when LinearFast
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tabs -> spaces

@mbauman
Copy link
Member Author

mbauman commented Nov 10, 2016

I didn't push this sooner since I had a hard time constructing benchmarks that actually demonstrated that this was an improvement. The inlining changes were purely based upon simplifications to @code_typed/llvm, but I had a hard time measuring an impact there.

The reason I pushed it when I did was because I didn't want others to duplicate this work as they investigated #19257. That result seems to have been somewhat spurious, but we need more benchmarks here in any case. I'm not sure when I'll have time to test this further.

@StefanKarpinski
Copy link
Member

So what's your call, @mbauman, merge or not? If you don't think it's a regression, I'd be in favor.

@mbauman
Copy link
Member Author

mbauman commented Nov 11, 2016

Lemme drop the last commit here and propose it separately since it's breaking.

@mbauman
Copy link
Member Author

mbauman commented Jan 2, 2017

I'd like to merge this for 0.6, but it'd be good to wait until JuliaCI/BaseBenchmarks.jl#54 makes it onto Nanosoldier for a final perf check.

@mbauman mbauman added this to the 0.6.0 milestone Jan 2, 2017
compute_offset1(parent, stride1::Integer, dims::Tuple{Int}, inds::Tuple{Colon}, I::Tuple) = compute_linindex(parent, I) - stride1*first(indices(parent, dims[1])) # index-preserving case
compute_offset1(parent, stride1::Integer, dims, inds, I::Tuple) = compute_linindex(parent, I) - stride1 # linear indexing starts with 1
compute_offset1(parent, stride1::Integer, dims::Tuple{Int}, inds::Tuple{Colon}, I::Tuple) = (@_inline_meta; compute_linindex(parent, I) - stride1*first(indices(parent, dims[1]))) # index-preserving case
compute_offset1(parent, stride1::Integer, dims, inds, I::Tuple) = (@_inline_meta; compute_linindex(parent, I) - stride1) # linear indexing starts with 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit overly long lines here, wrap?

@mbauman
Copy link
Member Author

mbauman commented Jan 3, 2017

@nanosoldier runbenchmarks(ALL, vs = ":master")

@jrevels
Copy link
Member

jrevels commented Jan 3, 2017

Had to kick nanosoldier, retriggering:

@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@mbauman
Copy link
Member Author

mbauman commented Jan 3, 2017

Hm, I cannot reproduce any of the largest regressions… in fact "sumeach_view","linspace(1.0,2.0,10000)" is nearly 2x faster with this branch on my computer (instead of 2x slower). It is pretty old, though, and doesn't have the same fancy SIMD registers that nanosoldier has. But there are many more performance gains — including a few constant-folds, which is pretty cool.

Since I can't reproduce the perf issues, I think this is as good as I can make it.

@KristofferC
Copy link
Member

KristofferC commented Jan 4, 2017

As another data point, on my laptop I'm getting a ~3x slowdown for the benchmark in the comment above. The LLVM codes are here: https://gist.github.com/KristofferC/1cf9e09d97289b521f494c9c68958043.

Edit: Removed some confusion...

@mbauman
Copy link
Member Author

mbauman commented Jan 4, 2017 via email

@KristofferC
Copy link
Member

I get 115us on PR and ~45us on master. On 0.5 it is around 115us as well. This PR improves much more than it regresses though so while it is sometimes good to be greedy, perhaps this can just be merged?

@StefanKarpinski
Copy link
Member

The failure is the current ongoing 32-bit Linux problem. I'm with @KristofferC on this.

@JeffBezanson
Copy link
Member

Sounds good to me too.

@KristofferC
Copy link
Member

@yuyichao postulated that the slowdown is due to LLVM generating bad native code on newer architectures causing a partial register stall when converting an integer to double. I confirmed this by recompiling the sysimg for x86-64 which made the relevant benchmark almost 3x faster. (The same problem appears for the pi_sum benchmark at

function pisum()
which is 2x slower when compiling for native architecture).

@vchuravy
Copy link
Member

vchuravy commented Jan 5, 2017

In that light I would say we should merge this and tackle the code generation issue later.

@KristofferC KristofferC merged commit 20b704a into master Jan 5, 2017
@tkelman tkelman deleted the mb/fastersubcreation branch January 5, 2017 14:48
@mbauman mbauman mentioned this pull request Jan 12, 2017
@timholy
Copy link
Member

timholy commented Mar 5, 2017

I wonder if this has regressed?

master:

julia> A = rand(1000,1000,1);

julia> @benchmark view(A, :, :, 1) seconds=1
BenchmarkTools.Trial: 
  memory estimate:  880 bytes
  allocs estimate:  40
  --------------
  minimum time:     14.049 μs (0.00% GC)
  median time:      14.456 μs (0.00% GC)
  mean time:        14.703 μs (0.00% GC)
  maximum time:     64.914 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia-0.5:

julia> A = rand(1000,1000,1);

julia> @benchmark view(A, :, :, 1) seconds=1
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     27.119 ns (0.00% GC)
  median time:      28.238 ns (0.00% GC)
  mean time:        35.656 ns (6.56% GC)
  maximum time:     1.478 μs (95.10% GC)
  --------------
  samples:          10000
  evals/sample:     995
  time tolerance:   5.00%
  memory tolerance: 1.00%

(I'd fix this myself except I'm up to my neck in another project right now.)

@KristofferC
Copy link
Member

I get:

julia> A = rand(1000,1000,1);

julia> @btime view($A, :, :, 1)
 10.903 ns (1 allocation: 64 bytes)

@KristofferC
Copy link
Member

Looks like the missed interpolation of $A was the problem

@timholy
Copy link
Member

timholy commented Mar 5, 2017

You are so right. Newbie error 😄. With it, they're the same speed on 0.5 and 0.6.

@Sacha0 Sacha0 added deprecation This change introduces or involves a deprecation needs news A NEWS entry is required for this change labels May 14, 2017
@tkelman tkelman removed the needs news A NEWS entry is required for this change label May 16, 2017
tkelman pushed a commit that referenced this pull request May 16, 2017
@KristofferC KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation This change introduces or involves a deprecation performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Creating a view (SubArray) of a Vector allocates memory