-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Giant rewrite for Julia v0.5 #102
Conversation
Major Changes: - Drop Vector storage for Partials. This reduces indirection in code that had to handle multiple container types. I have yet to see a case where Tuple storage wasn't faster due to GC overhead. - Consolidate all ForwardDiffNumber types into the new type DiffNumber, which is structured like a GradientNumber. HessianNumber and TensorNumber have been replaced by the ability to nest DiffNumbers. This allows for Tuple storage of higher-order partial components, and cuts out a lot of code. - API functions have been replaced by macros to allow type stable passage of constant kwargs. This includes some small breaking changes. For example, the target function is ALWAYS the first argument to the new API functions (to maintain consistency with other higher-order mutation functions in Base, like `map!`). Additionally, some kwarg names have changed. - experimental multithreading support can be enabled on some API functions by passing in `multithread = true` - the caching layer has been simplified, and is now thread-safe.
679ce71
to
d37f375
Compare
@@ -1,6 +1,5 @@ | |||
language: julia | |||
julia: | |||
- 0.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this branch dropping support for 0.4? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Threads
doesn't exist in 0.4, so trying to load the package will fail. I guess I could wrap all the multithreading code in conditionals against VERSION
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please, otherwise JuMP (and anyone who has JuMP installed) will be stuck on ForwardDiff 0.1 for a long time
2f43135
to
6f311f0
Compare
Coo stuff! I am impressed you managed to reduce the different Some thoughts about the cache implementation. From previous experience I have found that caching arrays in a Dict is in some cases just too slow. Especially now with the removal of the vector gradient numbers the vectors used will tend to be small so the overhead will be significant. One possible idea to improve performance of the cache without any impact on the API: Instead of having the cache hold arrays, we let it hold types that contain as many arrays is needed for the problem (jacobian, gradient etc). The accesses to the dict ( This requires a bit of extra code to implement the different caches but would significantly reduce the number of typeasserts and calls to the dict needed. My second comment is that #99 still seems relevant and could with the macros now be implemented in a type stable way. The hardest part would be to figure out a clean API for it. Regarding SIMD there has been some discussions about it starting here: eschnett/SIMD.jl#1 (comment). With JuliaLang/julia#14976 I am willing to put in work to implement this but would be good to have some discussions with others to know if this is something more people support. |
I implemented the first idea in: KristofferC@1a65099 Benchmarks with: function rosenbrock(x)
a = one(eltype(x))
b = 100 * a
result = zero(eltype(x))
for i in 1:length(x)-1
result += (a - x[i])^2 + b*(x[i+1] - x[i]^2)^2
end
return result
end That commit: julia> @time for i = 1:10^5 g(rand(5)) end
0.166368 seconds (900.00 k allocations: 39.673 MB, 5.06% gc time)
julia> @time for i = 1:10^5 g(rand(20)) end
0.292666 seconds (900.00 k allocations: 61.035 MB, 2.32% gc time)
julia> @time for i = 1:10^5 g(rand(35)) end
0.624599 seconds (900.00 k allocations: 94.604 MB, 2.74% gc time) This branch: julia> @time for i = 1:10^5 g(rand(5)) end
0.352903 seconds (800.00 k allocations: 35.095 MB, 1.12% gc time)
julia> @time for i = 1:10^5 g(rand(20)) end
0.508113 seconds (800.00 k allocations: 56.458 MB, 1.23% gc time)
julia> @time for i = 1:10^5 g(rand(35)) end
0.986225 seconds (800.00 k allocations: 90.027 MB, 1.27% gc time) Didn't test it with multithreading yet. It currently wastes a bit of space (allocating for all threads even if running single threaded and allocates the remainder |
I liked your idea in #99, I was actually going to eventually use that idea here. Your approach for separating individual threads' cache data also saves If you want to implement your cache idea as follow-up PR to this one, that would be awesome (otherwise, I'll incorporate your ideas later). |
Makes sense. When I started this refactor, I tried rewriting |
if N == L | ||
body = VEC_MODE_EXPR | ||
else | ||
nthreads = Threads.nthreads() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need to be either Base.Threads
or a using Base.Threads
is needed, I think.
6f311f0
to
a8a1f80
Compare
Thank you for your comments. I have a few busy days coming up but when I get time I will try flesh out the cache stuff unless you get to it before me. I'll open a PR against this branch when something is sort of ready. I thought a bit about benchmarking as well. Since I am likely to do some benchmarking during the cache stuff would it make sense to implement those benchmarks on top of As a side note I though I would mention that to me the code is a bit harder to read and reason about now than before. There are a few places like @generated function call_gradient!{S,N,L,M}(f, output::Vector{S}, x::Vector, ::Type{Val{N}}, ::Type{Val{L}}, ::Type{Val{M}}) where I am having troubles remembering which length or type |
Regarding a global We could either document the |
Yeah, there are a lot of mysterious single-letter identifiers at the moment. I'll change the type parameter names so that they reflect their corresponding keywords, should make it slightly less confusing.
Transparent documentation and a I'm just going to save any additional cache refactoring for when the rest of this PR is complete, which will probably take some time, given my current schedule. If #105 is ready by that point, I'll just defer to that PR. Either way, you're gathering some valuable information on where the pain points are, and what the right way to solve them might be. Thanks for looking into this! |
#105 is shaping up ok. What is left is adding the functionality to explicitly pass a created cache as a keyword argument to the macros. What do you think about removing the option to set the chunk size? It would make some stuff simpler because then the chunk size is simply a function of the input length. |
I've thought about it, but we'd have to be really confident in our chunk-picking heuristic. The right chunk size for any given problem depends not only on the dimension of the input, but on the performance characteristics of the target function itself. If we forced the user to defer to our heuristic with no option to override it, I'd want to make sure that the dependence on the target function is near negligible in the general case. |
Given that we have no good sense of how to choose a good chunk size, I think we should leave this as an option. |
Ok. Makes sense! |
f61f6ec
to
c0038b9
Compare
c0038b9
to
cf1024d
Compare
The segfault in the tests on 0.5 goes away for me with Edit: It goes away when doing many things |
Yeah, it's a finicky one. When I added these tests, I played around for a little bit to try to come up with a minimal repro of the segfault, to no avail. Part of the reason I pushed was to see if anybody else could figure it out 😛 I would guess this bug is similar to JuliaLang/julia#15043, but I'm not totally sure how it actually behaves yet. IIRC, I found that Substituting uses of |
Maybe it is worth making an issue upstream? I played around a bit with trying to reduce it and if I removed some tests I stopped getting segfaults and instead got some GC corruption crash. Would be interesting to go back before |
d1effac
to
20c2fc0
Compare
e69566d
to
d32c678
Compare
Everything here is now done, besides getting tests to pass on Travis - I have no idea what's going on there. If anybody is curious about benchmarks, here's the difference between some benchmarks ran on master and this branch on Julia v0.4 (the data is located in the benchmarks folder so you can play with it yourself): julia> using JLD, BenchmarkTools
julia> g41 = JLD.load("benchmark/results_julia_v04_forwarddiff_v01.jld", "results");
julia> g42 = JLD.load("benchmark/results_julia_v04_forwarddiff_v02.jld", "results");
julia> compare(a, b, k) = judge(minimum(a[k]), minimum(b[k]))
compare (generic function with 1 method)
julia> compare(g42, g41, "derivative")
6-element BenchmarkTools.BenchmarkGroup:
tags: []
"deriv_test_6" => TrialJudgement(-55.51% => improvement)
"deriv_test_4" => TrialJudgement(-95.39% => improvement)
"deriv_test_5" => TrialJudgement(-96.40% => improvement)
"deriv_test_2" => TrialJudgement(-94.97% => improvement)
"deriv_test_1" => TrialJudgement(-88.74% => improvement)
"deriv_test_3" => TrialJudgement(-89.70% => improvement)
julia> compare(g42, g41, "gradient")
3-element BenchmarkTools.BenchmarkGroup:
tags: []
"rosenbrock" => TrialJudgement(-21.89% => improvement)
"ackley" => TrialJudgement(+0.11% => invariant)
"self_weighted_logit" => TrialJudgement(-36.30% => improvement)
julia> compare(g42, g41, "hessian")
3-element BenchmarkTools.BenchmarkGroup:
tags: []
"rosenbrock" => TrialJudgement(-81.98% => improvement)
"ackley" => TrialJudgement(-80.44% => improvement)
"self_weighted_logit" => TrialJudgement(-78.50% => improvement)
julia> compare(g42, g41, "jacobian")
3-element BenchmarkTools.BenchmarkGroup:
tags: []
"trigonometric" => TrialJudgement(-10.58% => improvement)
"chebyquad" => TrialJudgement(-10.07% => improvement)
"brown_almost_linear" => TrialJudgement(-9.19% => improvement) |
That's what you get for using case insensitive file systems ;) File is called |
return ($F){N,R,switch_eltype(C, R)} | ||
end | ||
end | ||
if v"0.4" <= VERSION < v"0.5-" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, please, please be more specific with VERSION dependent code like this
There's been a bunch of discussion between ForwardDiff developers about where this package should go, and with v0.5 on the way, I think now is the right time to start implementing some of these changes. A lot of these changes are breaking and depend on each other, so it's unfortunately going to be one monolithic PR if we want to avoid intermittent breakage.
After this PR, it will be time to upgrade ForwardDiff's version from
v0.1.x
tov0.2.x
. We can maintain arelease-0.1
branch for backporting purposes.This PR is only about 2/3 of the way complete, but aims to fix #36, #88, #92, #94, #96. It paves the way for us to address #83, #89, and #98.
Major Changes:
Vector
storage for epsilons components. This reduces indirection inPartials
code that had to handle multiple container types. I have yet to see a case whereTuple
storage wasn't faster due to GC overhead.ForwardDiffNumber
types into the new typeDual{N,T}
, which is structured similarly to the formerGradientNumber
.HessianNumber
andTensorNumber
have been replaced by the ability to nestDual
s. This allows forTuple
storage of higher-order partial components, cuts out a lot of code, and should allow for more cache-friendly higher-order API methods (since indexing patterns will be more straightforward).@generated
functions have been removed, and the API has been simplified by introducing theChunk
immutable and allowing subtypes ofForwardDiffResults
to be passed as mutable arguments to the API functions. The documentation will explain this in more detail, once I write it.j(x) = ForwardDiff.jacobian(f, x)
.tensor
/tensor!
has been removed, but is still easily implementable by users.multithread = true
. I'm getting a 2x speed up vs. the single-threaded implementation when using 4 threads to take the gradient ofrosenbrock
with large (> 10,000 elements) input vectors. I haven't benchmarked enough to know how this scales, though I'd expect it to asymptote to N-times speed-up for N threads as the input length goes to infinity.TODO:
jacobian
/jacobian!
hessian
/hessian!
v0.1.x
tov0.2.x