-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Purpose and Goals of Flux.jl #1734
Comments
Thanks Chris for starting this conversation. Leaving some initial thoughts here that I will update given more time/ideas:
|
We should strive for good out of the box performance, and composability with the rest of the Julia ecosystem. The goal is not to compete with other ML packages, but do more, do better or do different things. For example, you can be on par with PyTorch by PyCalling it. Hence that is not a goal by itself - but if we need to pull in a few kernels from elsewhere while we work on our native Julia kernels, we should definitely do it. People come to Julia for performance after all! |
Absolutely, and let me take this opportunity to double down on how important TTFG is. We're already within ±50% of PyTorch for pretty much any random model you can find on Torch Hub, and with Dhairya and Julian's work on distributed training that should extend beyond a single process/GPU (more dynamic architectures like neural ODEs, AIUI, range from very competitive to a complete bloodbath). However, the sleight of hand is that I'm only counting the post-warmup time for most models. It should not take 50+ seconds to do a single training step on a simple MLP. Heck, it probably shouldn't even take 5 seconds, but at least at that range you might beat out XLA! |
We should review the Flux ecosystem for anything we can do to improve compile latency - but that continues to otherwise be an ongoing task for the compiler folks. |
It's great that we are having this discussion. Here's my thoughts to the questions.
|
Some of this should be reflected on the website and the docs. |
I'm very aligned with what Brian and Kyle said. Adding a few and partially overlapping thoughts.
|
We have a vision of being flexible and scalable while optimising hot paths. We have complexities where we need it, our AD is one of the most complex out there if I were to say so myself. Our forward mode is also written with allowing simd in mind. We focused on PL-like use cases - so a lot of our design philosophy was also written around a natural feeling API, which can be extended and stays close to the language semantics - that definitely included some quirks (broadcasting over different dimensions comes to mind), but those are hard to get rid of entirely. This is our approach to differentiable programming, and it was reflected in an old post https://julialang.org/blog/2017/12/ml-pl/ which seems especially relevant today. We stayed generic to give us the best chance of getting new use cases to work (tracked arrays, CuArrays , images, dual numbers etc also helped since we needed to work with them) and that paid massive dividends over time. In most ML cases, we have to optimise for GPUs for large networks and now increasingly for smaller networks for SciML. We are already competitive on CPUs (compared to other frameworks) but can of course do more with the likes of Octavian.jl. For GPUs we follow same CUDNN paths as other frameworks. Our philosophy leaves space for accommodating special optimised dispatches, since we have the core infrastructure in place. Remember when we explored TVM as an intermediate pass? So I think optimised Composiblity and being Julian is key to getting the sweet spot for performance and flexibility. Performance follows from good design, and we actively optimise hot paths on an ongoing basis, we want a fast library after all. What that good design looks like - well that can be subjective, but we have made great headway with our approach so far and if there are specific performance considerations that we need to address, we will do that. |
I need to find the time to finish up some basic layers and functionality in NNlibCPU.jl, and also to add some more specialized support to DiffEqFlux.jl. For now, I've been using/experimenting a bit in SimpleChains.jl, adding functionality as I need it. Example: using SimpleChains, DiffEqFlux, Flux
x = rand(Float32, 24, 199); y = rand(Float32, 2, 199);
chn = Chain(Dense(24, 24, tanh), Dense(24, 2));
opt_chn = Flux.Optimiser(WeightDecay(1f-4), ADAM())
chn_loss = let m = chn
(x, y) -> Flux.Losses.mse(m(x), y)
end
@time foreach(_ -> Flux.train!(chn_loss, Flux.params(chn), ((x,y),), opt_chn), 1:10_000) # Flux
@time foreach(_ -> Flux.train!(chn_loss, Flux.params(chn), ((x,y),), opt_chn), 1:10_000) # Flux
chn_fast = FastChain(FastDense(24, 24, tanh), FastDense(24, 2));
function train!(chn::FastChain, X, Y, vparam = DiffEqFlux.initial_params(chn); iters = 1_000, opt = Flux.Optimiser(WeightDecay(1f-4), ADAM()))
# chn = L2Penalty(SimpleChains.add_loss(_chn, SquaredLoss(Y)), 1f-4 * length(Y))
f = let X = X, Y = Y
p -> Flux.mse(chn(X, p), Y)
end
for _ ∈ 1:iters
g = Zygote.gradient(f, vparam)[1]
Flux.Optimise.update!(opt, vparam, g)
end
vparam
end
vp_fc = @time train!(chn_fast, x, y);
@time train!(chn_fast, x, y, vp_fc);
chn_simple = SimpleChain(TurboDense(tanh, (static(24), static(24))), TurboDense(identity, (static(24), static(2))));
function train!(_chn::SimpleChain, X, Y, vparam = SimpleChains.init_params(_chn), g = similar(vparam); iters = 1_000, opt = ADAM(), λ = 1f-4)
chn = L2Penalty(SimpleChains.add_loss(_chn, SquaredLoss(Y)), 1f-4 * length(Y))
for _ ∈ 1:iters
loss = valgrad!(g, chn, X, vparam)
Flux.Optimise.update!(opt, vparam, g)
end
vparam
end
vp_simple = SimpleChains.init_params(chn_simple); g = similar(vp_simple);
@time train!(chn_simple, x, y, vp_simple, g, iters = 10_000); # SimpleChains
@time train!(chn_simple, x, y, vp_simple, g, iters = 10_000); # SimpleChains I get: julia> @time foreach(_ -> Flux.train!(chn_loss, Flux.params(chn), ((x,y),), opt_chn), 1:10_000) # Flux
1.448004 seconds (3.51 M allocations: 1.438 GiB, 6.94% gc time, 0.66% compilation time)
julia> @time foreach(_ -> Flux.train!(chn_loss, Flux.params(chn), ((x,y),), opt_chn), 1:10_000) # Flux
1.452637 seconds (3.51 M allocations: 1.438 GiB, 7.79% gc time, 0.65% compilation time)
julia> @time train!(chn_fast, x, y, vp_fc, iters = 10_000); # DiffEqFlux
1.545734 seconds (3.89 M allocations: 1.841 GiB, 8.54% gc time)
julia> @time train!(chn_fast, x, y, vp_fc, iters = 10_000); # DiffEqFlux
1.557841 seconds (3.89 M allocations: 1.841 GiB, 9.47% gc time)
julia> @time train!(chn_simple, x, y, vp_simple, g, iters = 10_000); # SimpleChains
0.314526 seconds (10.01 k allocations: 318.391 KiB)
julia> @time train!(chn_simple, x, y, vp_simple, g, iters = 10_000); # SimpleChains
0.228624 seconds (10.01 k allocations: 318.391 KiB) So for problems like this, a 5x improvement over Flux is achievable.
I think reducing the burden on Zygote should help. SimpleChain's initial compile times aren't great though, because LoopVectorization's compile times are still bad. |
Is there a separate issue for tracking compile times? |
I had a few conversations with contributors, and one of the things I would like to sum up from the conversations is this idea of a triangle. There's simplicity, performance, and flexibility. In general you can pick two and it will come at the detriment of the other. Of course, you can build things well and it won't completely sacrifice the other, but there will still be an undeniable engineering trade-off. @chriselrod brings up a very good point that one can invoke that simplicity + flexibility can still give "performance" sometimes, but only in a narrow definition of performance that does not include compile times. At the end of the day, expanding our From these discussions though, it seems like "simplicity" is not the major focus of Flux anymore. But it's what the docs open with:
I think a PR updating the philosophy and promises of Flux that the reader first sees would be a good start. Maybe here's some starter text given the discussions I've had:
Octavian 😅 |
So I'll be mostly quiet because my contributions to Flux have been mostly indirect and minimal, but you know me :), I give user feedback unsolicited at times... What I think people want/expect/need out of Flux is pretty simple.
There hasn't been a stable enough DL library that supports CPU & GPU in the Julia ecosystem since I've been around (Julia 0.6). KNet was OK back in the day, but it was also slow, inflexible, etc, might be better now. I take that back, a few weeks before Zygote was introduced Flux was pretty solid - I used to use one of those versions exclusively, for about a year or so because it worked so well (except for 1D conv's and a few other things I had to tweak). That's problematic for end users. In my experience, (today included), I've started a project in Flux and then realized "wow I can't ___ without getting an error I don't have time to solve right now. A few versions ago I'm pretty sure I could do this... I'll go back to Torch because rewriting this loading job and model topology will only take 20min". Guess what I am saying is, the syntax Flux allows for is amazing, personally, I'd take some performance cost at v1.0 for more stability and the same flexibility.
|
Since we merged #1736, I will close this one. But let the discussion continue. It might be better for the discussion to be on discourse though. |
Given that so much has happened with Flux, I think it might be a good time to take a step back and reevaluate what the goals of Flux.jl and the FluxML project are. Since the maintainers that are running the project are not the founders of the project, it would be good for everyone to be on the same page as to "what is Flux?" and "why is Flux?". I think there has been a bit of different opinions on these questions which leads to different technical directions. But as @oxinabox has said, technical problems are usually social problems.
So there are a few questions I wanted to put there, now that we have had years to really evaluate what the answers would mean.
Dense
is exactly how you would implement it!" kinds of things. "You could have written Flux", "Doing the obvious thing". Why? For most people,Dense
on CPU would be faster if it used https://github.com/JuliaLinearAlgebra/Octavian.jl . Would fancy code making it use an alternative JuliaBLAS be against the goals of being a legible teaching repository, or is Flux.jl a production machine learning code base that would accept speed boost that isn't the obvious thing that a standard user would have done? If @chriselrod added a bunch ofrrules
that reduce the amount of reliance on Zygote/Diffractor to optimize things away, is that what the Flux.jl project wants? I think the push and pull of this question is one of the big ones that has been a source of disagreements.Lastly, I just want to politely ask that these are really questions stay for the maintainers. I can see a lot of users jumping in here and saying "I think Flux should be X", but if you're not actively trying to work on Flux and the greater ML ecosystem to be X then 🤷 that's really just derailing the conversation. Users want everything, everyone knows that. But I think it's good for people working on and around Flux to have a good sense of what everyone else is putting their time in for. If we can understand why people are donating their time then we can all have a better appreciation of each other's work.
The text was updated successfully, but these errors were encountered: