Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Purpose and Goals of Flux.jl #1734

Closed
ChrisRackauckas opened this issue Oct 4, 2021 · 13 comments
Closed

The Purpose and Goals of Flux.jl #1734

ChrisRackauckas opened this issue Oct 4, 2021 · 13 comments

Comments

@ChrisRackauckas
Copy link
Member

Given that so much has happened with Flux, I think it might be a good time to take a step back and reevaluate what the goals of Flux.jl and the FluxML project are. Since the maintainers that are running the project are not the founders of the project, it would be good for everyone to be on the same page as to "what is Flux?" and "why is Flux?". I think there has been a bit of different opinions on these questions which leads to different technical directions. But as @oxinabox has said, technical problems are usually social problems.

So there are a few questions I wanted to put there, now that we have had years to really evaluate what the answers would mean.

  1. What is the purpose of Flux.jl being pure Julia? Is it good marketing or it something more? If that something more is the standard answer in Julia (support for non-standard types), then I think in some respects the technical approaches have not aligned with this goal, as for example convolution layers are sensitive to dual numbers (Steady State GPU Training Hangs and Quits SciML/DiffEqFlux.jl#567). Is this something that is a major priority of the Flux project? Or is it moving the other way: would incorporating more of Torch.jl into Flux.jl be a goal of the Flux project?
  2. Is the goal of Flux.jl to be a teaching and research repository or a production ML library? Early presentations of Flux.jl touted how simple the code of Flux.jl is. "Dense is exactly how you would implement it!" kinds of things. "You could have written Flux", "Doing the obvious thing". Why? For most people, Dense on CPU would be faster if it used https://github.com/JuliaLinearAlgebra/Octavian.jl . Would fancy code making it use an alternative JuliaBLAS be against the goals of being a legible teaching repository, or is Flux.jl a production machine learning code base that would accept speed boost that isn't the obvious thing that a standard user would have done? If @chriselrod added a bunch of rrules that reduce the amount of reliance on Zygote/Diffractor to optimize things away, is that what the Flux.jl project wants? I think the push and pull of this question is one of the big ones that has been a source of disagreements.
  3. Is Flux.jl a modular interface or the source of the top notch implementations? Should the Flux.jl documentation include examples using GeometricFlux.jl and showcasing all that the FluxML ecosystem has to offer, or is Flux.jl just the core? If Flux.jl is just the core, is there any canonical description of the Flux.jl universe and how it is used? What of those layers should be in Flux.jl and what layers should not be?
  4. What is the overarching goal of Flux? What's the thing that is like "if Flux achieves X then it did its job"? If Flux hits feature-parity with PyTorch? If Flux is the fastest library? If Flux is good enough while still being legible internally to the average ML undergrad?

Lastly, I just want to politely ask that these are really questions stay for the maintainers. I can see a lot of users jumping in here and saying "I think Flux should be X", but if you're not actively trying to work on Flux and the greater ML ecosystem to be X then 🤷 that's really just derailing the conversation. Users want everything, everyone knows that. But I think it's good for people working on and around Flux to have a good sense of what everyone else is putting their time in for. If we can understand why people are donating their time then we can all have a better appreciation of each other's work.

@ToucheSir ToucheSir pinned this issue Oct 4, 2021
@ToucheSir
Copy link
Member

ToucheSir commented Oct 4, 2021

Thanks Chris for starting this conversation. Leaving some initial thoughts here that I will update given more time/ideas:

  1. My feeling is that this is actually a proxy for a more fundamental divide between the "quasi-static" (to steal a term from you) camp and the "diffprog" camp. Examples of the former would be Metalhead or Transformers.jl, while the latter would include SciML and Turing. The former wants better throughput and scalability, and is willing to sacrifice some flexibility for it by, for example, wrapping optimized external libraries. The latter is generally not concerned with huge networks, but does need more flexibility for, say, higher-order derivatives or broader language + library support. Given limited maintainer bandwidth, I'm not sure how we should balance this tension at a strategic level.

  2. I'm agnostic about whether Flux source code should focus on pedagogy, but I do think it should err on the side of being extensible for downstream users instead of trying to create its own island like the TensorFlows of the world. This is why you'll often see me hemming and hawing about PRs that add a bunch of framework internal functionality for a very specific set of use cases (e.g. a complex loss function that's only been used for 2 NLP benchmarks).

    However, what I am comfortable rejecting outright are changes that have an outsized impact on the time to first gradient. I personally picked up Flux right before Tracker was phased out. I've never been able to replicate that Python framework like responsiveness since, and frankly I'm not sure I would've stuck around had I started with the current version of Flux. So in that sense, I think some of the resistance to taking on dependencies is not necessarily code complexity, but the execution overhead those deps bring to an already subpar startup. Thus, if something like this:

    If @chriselrod added a bunch of rrules that reduce the amount of reliance on Zygote/Diffractor to optimize things away

    Can be pulled off with no material increase or a decrease in TTFG, then I am 100% for it. Or if we could come up with an equivalent to 22 seconds to 3 and now more: Let's fix all of the DifferentialEquations.jl + universe compile times! SciML/DifferentialEquations.jl#786 that opens up breathing room for future, more compilation-heavy changes. What I fear is that the biggest culprits may be out of our control, but given how well the DiffEq effort went those fears may be unfounded.

  3. My current opinions are:
    a. The former.
    b. Yes, but not in the reference docs.
    c. I'm not aware of any outside of the tutorial documentation folks like @logankilpatrick and @lilianabs have worked on.
    d. If 2+ dependants have it, it should probably be in core. Otherwise, FastAI has shown us how well layering can work without Flux itself needing to get involved.

  4. I'd be curious to know this too 😅. @ElOceanografo's post on Discourse was a real lightbulb moment because it provided a tangible framework for how we might drive adoption. Personally, I have no idea how large and diverse our group B is, let alone our group D. It did seem like there was a specific vision of where Flux(ML) was headed circa v0.10, but I've not been involved long enough to know what that was.

@ViralBShah
Copy link
Member

ViralBShah commented Oct 4, 2021

We should strive for good out of the box performance, and composability with the rest of the Julia ecosystem. The goal is not to compete with other ML packages, but do more, do better or do different things. For example, you can be on par with PyTorch by PyCalling it.

Hence that is not a goal by itself - but if we need to pull in a few kernels from elsewhere while we work on our native Julia kernels, we should definitely do it. People come to Julia for performance after all!

@ToucheSir
Copy link
Member

ToucheSir commented Oct 4, 2021

Absolutely, and let me take this opportunity to double down on how important TTFG is. We're already within ±50% of PyTorch for pretty much any random model you can find on Torch Hub, and with Dhairya and Julian's work on distributed training that should extend beyond a single process/GPU (more dynamic architectures like neural ODEs, AIUI, range from very competitive to a complete bloodbath). However, the sleight of hand is that I'm only counting the post-warmup time for most models. It should not take 50+ seconds to do a single training step on a simple MLP. Heck, it probably shouldn't even take 5 seconds, but at least at that range you might beat out XLA!

@ViralBShah
Copy link
Member

We should review the Flux ecosystem for anything we can do to improve compile latency - but that continues to otherwise be an ongoing task for the compiler folks.

@darsnack
Copy link
Member

darsnack commented Oct 5, 2021

It's great that we are having this discussion. Here's my thoughts to the questions.

  1. The purpose of being pure Julia is to maximize the amount code that can be differentiated and to make composing with the rest of the ecosystem easier. I have no qualms about calling various external or internal libraries for performance as long as that stuff is somewhere like NNlib.jl and can be swapped out by more advanced users. But composing with non-standard types is a basic priority feature IMO. I agree that the technical implementation does not align with this goal. NNlib.jl needs more interface hooks to make it extensible. In general, that's the most sustainable approach here IMO. It's not Flux's job to provide the "a generic implementation" that "just works." Instead, make the API design flexible enough so that assumptions (like the "safe types" for convolution) are not baked in. Provide a hook and a default values for it.
  2. I think just aiming to be usable for current day research is a pretty high bar that we do not meet. I never liked the Dense example cause you can write sigma(W*x + b) in most frameworks and it works. Support for writing a line like that does not mean the default implementations in Flux.jl need to be that simple (convolution is far more complex for us!). If we can get performance from more complex implementations, then we should. If one day the "generic" implementation works and is fast all the time, great. Until then, no need to hang onto it.
  3. Flux.jl should be modular, and it should not aim to provide a solution for everything. Packages that implement things like transformers, GNNs, data loading, datasets, etc. should be considered part of the Flux ecosystem. We should be documenting how to use those packages in the Flux docs (can be the Flux.jl docs or some other page as long as it is front and center). Instead of reinventing code that exists elsewhere, we should be improving it, helping maintain it, and documenting it. That will benefit users more than a duplicate implementation.
  4. I'm not sure what the answer is here either. Should there ever be a point where we say "Flux is done?" If you mean at what point do we stop developing Flux.jl and start developing more stuff around it, then I think the answer for me is once the explicit gradient stuff is complete (that includes Diffractor transition, Optimisers.jl, etc.). That's the only remaining major change IMO. The rest of the development is outside Flux.jl and the development inside is for performance, fixes, and maintenance.

@ViralBShah
Copy link
Member

Some of this should be reflected on the website and the docs.

@CarloLucibello
Copy link
Member

I'm very aligned with what Brian and Kyle said. Adding a few and partially overlapping thoughts.

  1. Flux.jl is written in julia for composability and because we love to write julia code. And people love looking into Flux and being able to understand it. Honestly, I'm not interested in incorporating pytorch (which doesn't mean that we shouldn't) but definitely interested in supporting as many types as possible while also optimizing most common scenarios.

  2. For me Correctness > Perfomance (especially on common use-cases) > Composability > Implementation Simplicity.
    Docs can always be pedagogic, custom types can always be easy to define,
    but as Flux's matures I want it to squeeze out each performance drop I can.
    Fortunately (and thanks julia!) I don't expect a large tradeoff in most cases.
    I doubt the Dense layer will ever be more complicated than something like @turbo σ.(W*x .+ b).

  3. I think Flux.jl should be core, should only provide basic deep learning components, implementing them directly or importing other packages (NNlib, Optimisers.jl, Zygote.jl,...).
    PyTorch is a good reference, we want to do everything PyTorch does but supporting generic types. Flux.jl's docs can refer to the wider ML ecosystem (we already do) but I wouldn't go all the way to showcase, we don't want to have the burden
    of keeping in sync with everything, that is something that can be done better by the packages themselves.

  4. Being feature-parity with PyTorch, being the fastest DL library are top priorities for me, composability and ease of use/customization are top priorities as well. Being legible a second-order one (but again, I hope we won't have to lose much terrain here).

@DhairyaLGandhi
Copy link
Member

We have a vision of being flexible and scalable while optimising hot paths. We have complexities where we need it, our AD is one of the most complex out there if I were to say so myself. Our forward mode is also written with allowing simd in mind.

We focused on PL-like use cases - so a lot of our design philosophy was also written around a natural feeling API, which can be extended and stays close to the language semantics - that definitely included some quirks (broadcasting over different dimensions comes to mind), but those are hard to get rid of entirely. This is our approach to differentiable programming, and it was reflected in an old post https://julialang.org/blog/2017/12/ml-pl/ which seems especially relevant today.

We stayed generic to give us the best chance of getting new use cases to work (tracked arrays, CuArrays , images, dual numbers etc also helped since we needed to work with them) and that paid massive dividends over time.

In most ML cases, we have to optimise for GPUs for large networks and now increasingly for smaller networks for SciML. We are already competitive on CPUs (compared to other frameworks) but can of course do more with the likes of Octavian.jl. For GPUs we follow same CUDNN paths as other frameworks.

Our philosophy leaves space for accommodating special optimised dispatches, since we have the core infrastructure in place. Remember when we explored TVM as an intermediate pass? So I think optimised rrules are fair game. We have workshopped the likes of NNlibCPU to get started on what that would look like.

Composiblity and being Julian is key to getting the sweet spot for performance and flexibility. Performance follows from good design, and we actively optimise hot paths on an ongoing basis, we want a fast library after all. What that good design looks like - well that can be subjective, but we have made great headway with our approach so far and if there are specific performance considerations that we need to address, we will do that.

@chriselrod
Copy link

I need to find the time to finish up some basic layers and functionality in NNlibCPU.jl, and also to add some more specialized support to DiffEqFlux.jl.

For now, I've been using/experimenting a bit in SimpleChains.jl, adding functionality as I need it.

Example:

using SimpleChains, DiffEqFlux, Flux

x = rand(Float32, 24, 199); y = rand(Float32, 2, 199);
chn = Chain(Dense(24, 24, tanh), Dense(24, 2));

opt_chn = Flux.Optimiser(WeightDecay(1f-4), ADAM())
chn_loss = let m = chn
  (x, y) -> Flux.Losses.mse(m(x), y)
end

@time foreach(_ -> Flux.train!(chn_loss, Flux.params(chn), ((x,y),), opt_chn), 1:10_000) # Flux
@time foreach(_ -> Flux.train!(chn_loss, Flux.params(chn), ((x,y),), opt_chn), 1:10_000) # Flux

chn_fast = FastChain(FastDense(24, 24, tanh), FastDense(24, 2));
function train!(chn::FastChain, X, Y, vparam = DiffEqFlux.initial_params(chn); iters = 1_000, opt = Flux.Optimiser(WeightDecay(1f-4), ADAM()))
  # chn = L2Penalty(SimpleChains.add_loss(_chn, SquaredLoss(Y)), 1f-4 * length(Y))
  f = let X = X, Y = Y
    p -> Flux.mse(chn(X, p), Y)
  end
  for _  1:iters
    g = Zygote.gradient(f, vparam)[1]
    Flux.Optimise.update!(opt, vparam, g)
  end
  vparam
end
vp_fc = @time train!(chn_fast, x, y);
@time train!(chn_fast, x, y, vp_fc);


chn_simple = SimpleChain(TurboDense(tanh, (static(24), static(24))), TurboDense(identity, (static(24), static(2))));
function train!(_chn::SimpleChain, X, Y, vparam = SimpleChains.init_params(_chn), g = similar(vparam); iters = 1_000, opt = ADAM(), λ = 1f-4)
  chn = L2Penalty(SimpleChains.add_loss(_chn, SquaredLoss(Y)), 1f-4 * length(Y))
  for _  1:iters
    loss = valgrad!(g, chn, X, vparam)
    Flux.Optimise.update!(opt, vparam, g)
  end
  vparam
end
vp_simple = SimpleChains.init_params(chn_simple); g = similar(vp_simple);
@time train!(chn_simple, x, y, vp_simple, g, iters = 10_000); # SimpleChains
@time train!(chn_simple, x, y, vp_simple, g, iters = 10_000); # SimpleChains

I get:

julia> @time foreach(_ -> Flux.train!(chn_loss, Flux.params(chn), ((x,y),), opt_chn), 1:10_000) # Flux
  1.448004 seconds (3.51 M allocations: 1.438 GiB, 6.94% gc time, 0.66% compilation time)

julia> @time foreach(_ -> Flux.train!(chn_loss, Flux.params(chn), ((x,y),), opt_chn), 1:10_000) # Flux
  1.452637 seconds (3.51 M allocations: 1.438 GiB, 7.79% gc time, 0.65% compilation time)

julia> @time train!(chn_fast, x, y, vp_fc, iters = 10_000); # DiffEqFlux
  1.545734 seconds (3.89 M allocations: 1.841 GiB, 8.54% gc time)

julia> @time train!(chn_fast, x, y, vp_fc, iters = 10_000); # DiffEqFlux
  1.557841 seconds (3.89 M allocations: 1.841 GiB, 9.47% gc time)

julia> @time train!(chn_simple, x, y, vp_simple, g, iters = 10_000); # SimpleChains
  0.314526 seconds (10.01 k allocations: 318.391 KiB)

julia> @time train!(chn_simple, x, y, vp_simple, g, iters = 10_000); # SimpleChains
  0.228624 seconds (10.01 k allocations: 318.391 KiB)

So for problems like this, a 5x improvement over Flux is achievable.
Adding some instrumentation shows that >10% of the time in train!(chn_simple, ...) is being spent in Flux.Optimize.update!, and this is also the source of all the memory allocations, so that may be a reasonable target for optimization, at which point SimpleChains.jl may become usable as a standalone library.

Can be pulled off with no material increase or a decrease in TTFG, then I am 100% for it.

I think reducing the burden on Zygote should help.

SimpleChain's initial compile times aren't great though, because LoopVectorization's compile times are still bad.
That will also be the case for NNlibCPU.jl, but at least using it will be optional.

@ViralBShah
Copy link
Member

Is there a separate issue for tracking compile times?

@ChrisRackauckas
Copy link
Member Author

ChrisRackauckas commented Oct 5, 2021

I had a few conversations with contributors, and one of the things I would like to sum up from the conversations is this idea of a triangle. There's simplicity, performance, and flexibility. In general you can pick two and it will come at the detriment of the other. Of course, you can build things well and it won't completely sacrifice the other, but there will still be an undeniable engineering trade-off. @chriselrod brings up a very good point that one can invoke that simplicity + flexibility can still give "performance" sometimes, but only in a narrow definition of performance that does not include compile times.

At the end of the day, expanding our rrules to make Zygote/Diffractor do less work will always decrease compile times. We saw in DifferentialEquations.jl we could reduce compile times by an order of magnitude for the most common cases (SciML/DifferentialEquations.jl#786), and one of the most substantial pieces of this was just taking the CPU dispatches and expanding the broadcasts to loops by hand (SciML/OrdinaryDiffEq.jl#1465). You could probably drop Zygote compile times from like 50 seconds to <5 seconds by just adding a bunch of rrules on the standard layers and unrolling any CPU broadcast. The cost of course is simplicity. The question is, would such a PR be accepted? Is it worth it? Agreed this is worth an issue of its own to diagnose and then play with.

From these discussions though, it seems like "simplicity" is not the major focus of Flux anymore. But it's what the docs open with:

Flux: The Julia Machine Learning Library

Flux is a library for machine learning. It comes "batteries-included" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:

- Doing the obvious thing. Flux has relatively few explicit APIs for features like regularisation or embeddings. Instead, writing down the mathematical form will work – and be fast.
- You could have written Flux. All of it, from LSTMs to GPU kernels, is straightforward Julia code. When in doubt, it’s well worth looking at the source. If you need something different, you can easily roll your own.
- Play nicely with others. Flux works well with Julia libraries from data frames and images to differential equation solvers, so you can easily build complex data processing pipelines that integrate Flux models.

I think a PR updating the philosophy and promises of Flux that the reader first sees would be a good start. Maybe here's some starter text given the discussions I've had:

Flux: The Julia Machine Learning Library

Flux is a library for machine learning in Julia geared towards high-performance and production-quality pipelines. It comes "batteries-included" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. This flexibility + performance is the core philosophy of Flux.

We follow a few key principles:

- Composability is core. Flux works directly with the Julia package ecosystem, even with packages which were never intended to be used with machine learning! Flux works well with Julia libraries from data frames and images to differential equation solvers, so you can easily build complex data processing pipelines that integrate Flux models.
- Performance is key. Flux integrates with high-performance AD tools like Zygote to allow for fast generated code. Both CPU and GPU performance is highly optimized. Integration with parallelism tools like DaggerFlux.jl makes distributed and multi-GPU training easy for scaling production. Please report any performance issue as a bug.
- Flux is easily extendable. Flux itself is written in Julia, so adding to the ML framework can be done by you! New layers are easily added by external packages like NNLibCPU and GeometricFlux. Specializations to AD rules can be done via ChainRules.jl. You can write your own GPU kernels with CUDA.jl. From just one language your new research kernels can be optimized and made production ready.

I doubt the Dense layer will ever be more complicated than something like @turbo σ.(W*x .+ b).

Octavian 😅

@caseykneale
Copy link

So I'll be mostly quiet because my contributions to Flux have been mostly indirect and minimal, but you know me :), I give user feedback unsolicited at times... What I think people want/expect/need out of Flux is pretty simple.

  1. What is the purpose of Flux.jl being pure Julia? Good for bug tracking, faster iterations, maintainability, easy to install/deploy remotely (less surface area for trouble), etc.

  2. Is the goal of Flux.jl to be a teaching and research repository or a production ML library? Not sure, but the latter would be important for a lot of users. Research is super important, but researching how to build tools is a very specific kind of research not many people who intend to use libraries like this perform.

There hasn't been a stable enough DL library that supports CPU & GPU in the Julia ecosystem since I've been around (Julia 0.6). KNet was OK back in the day, but it was also slow, inflexible, etc, might be better now. I take that back, a few weeks before Zygote was introduced Flux was pretty solid - I used to use one of those versions exclusively, for about a year or so because it worked so well (except for 1D conv's and a few other things I had to tweak). That's problematic for end users. In my experience, (today included), I've started a project in Flux and then realized "wow I can't ___ without getting an error I don't have time to solve right now. A few versions ago I'm pretty sure I could do this... I'll go back to Torch because rewriting this loading job and model topology will only take 20min".

Guess what I am saying is, the syntax Flux allows for is amazing, personally, I'd take some performance cost at v1.0 for more stability and the same flexibility.

  1. Is Flux.jl a modular interface or the source of the top notch implementations? Doesn't matter to me as an end user as long as the pkg.add doesn't OOM a typical workstation, relentlessly version clash, or take an hour to install.

  2. What is the overarching goal of Flux? Not for me to reason about.

@ViralBShah
Copy link
Member

Since we merged #1736, I will close this one. But let the discussion continue. It might be better for the discussion to be on discourse though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants