Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archiving DataLoaders.jl #90

Open
6 tasks
lorenzoh opened this issue May 28, 2022 · 20 comments
Open
6 tasks

Archiving DataLoaders.jl #90

lorenzoh opened this issue May 28, 2022 · 20 comments

Comments

@lorenzoh
Copy link
Contributor

lorenzoh commented May 28, 2022

Now that MLUtils.jl has feature parity with DataLoaders.DataLoader, it's time to deprecate DataLoaders.jl.

We should:

@CarloLucibello
Copy link
Member

CarloLucibello commented May 28, 2022

I'm having second thoughts on this, since I heard some complaints about MLUtils.jl now being too heavy for someone that needs only basic utilities like train/test split. For instance, the latency of using MLUtils is halved removing the parallel stuff:

julia> @time using MLUtils # on master
  2.090915 seconds (2.10 M allocations: 124.960 MiB, 3.50% gc time, 68.02% compilation time)
julia> @time using MLUtils  # after removing FLoops and FoldsThreads
  0.928274 seconds (890.85 k allocations: 53.444 MiB, 2.36% gc time, 73.03% compilation time)

@CarloLucibello
Copy link
Member

CarloLucibello commented May 28, 2022

Maybe we can keep DataLoaders.jl, move it to the JuliaML org, copy and paste the implementation from here and mark a breaking release. Then we have Flux reexport it.

For MLUtils, we keep eachobs without the parallel option.

As a further breakdown, I also suggest to move getobs and numobs and their implementations for the Base types to a MLCore.jl repo. But that can come later, with the excision of the parallel dataloader MLUtils is a lean dependence for someone that wants to implement the obs interface.

Or we just go on with the plan outlined in the OP. What do you think @lorenzoh @darsnack @ToucheSir ?

@lorenzoh
Copy link
Contributor Author

Right, I didn't think of the latency. I think what you're proposing makes sense, and makes for an easier migration as well.

@juliohm
Copy link
Member

juliohm commented May 28, 2022

My opinion is that we should really consider how many people are annoyed by this latency issue and if this is a real demand that justifies dividing work again into multiple repositories.

This premature modularization is a serious issue in Julia organizations where people simply split things into packages because 1 user commented about load time. At some point we all hope that Julia will fix the latency issue, and so let's think carefully before deciding to go separate ways.

Can you think of users of DataLoaders.jl that are not users of MLUtils.jl? If the answer is yes, then we have a more solid argument to keep these efforts separate with such a tiny community of maintainers.

@ToucheSir
Copy link
Contributor

How much can we reduce import times without removing core deps (FLoops can probably go though since it's mostly used for syntax sugar)? I don't think this package has any precompilation, for example.

@darsnack
Copy link
Member

I agree with Brian that we should look into reducing the load time contributed by Floops first. It would be nicer as a user to not have to load different packages to set up the data pipeline.

I think the main complaint re: load times is packages that want to extend the interface, not user load times. For this, let's see how far people get with the getindex and length interface. If we find lots of types are needing to define getobs and numobs specifically, then we can split the interface out. I think that makes more sense than splitting out the parallel loading.

@lorenzoh
Copy link
Contributor Author

lorenzoh commented May 29, 2022

As another data point, the import time difference is also quite big on my machine:

julia> @time using MLUtils
  0.222900 seconds (468.17 k allocations: 27.956 MiB, 1.72% compilation time) (no parallel deps)
  1.010500 seconds (1.67 M allocations: 95.612 MiB, 1.94% gc time, 43.90% compilation time) (master)
  0.995093 seconds (1.56 M allocations: 88.791 MiB, 46.31% compilation time) (without FLoops)

Edit: added time without FLoops.jl

@CarloLucibello
Copy link
Member

Can you think of users of DataLoaders.jl that are not users of MLUtils.jl? If the answer is yes, then we have a more solid argument to keep these efforts separate with such a tiny community of maintainers.

I can think of many users of MLUtils not being users of DataLoaders.

It would be nicer as a user to not have to load different packages to set up the data pipeline.

true. This is mitigated though by the fact that Flux's users just import Flux and have anything they need, without having to know anything about MLUtils or DataLoaders. Non-deep-learning users can just use an eachobs without any parallel option.

Edit: added time without FLoops.jl

so removing FLoops doesn't help much.

As for the mantainance burden of another repo, I expect DataLoaders.jl to be very low activity, we will hardly change anything until we start experimenting with other parallel loading strategies.

@ToucheSir
Copy link
Contributor

What about precompilation? I don't think any packages in JuliaML use it?

@CarloLucibello
Copy link
Member

Precompilation would help with TTFX but not with speeding up using right?

@ToucheSir
Copy link
Contributor

I thought it was part of SciML/DifferentialEquations.jl#786, but reading back through SciML/DifferentialEquations.jl#786 (comment) it appears invalidations + Requires were more important for import times. Has anyone tried running @time_imports from 1.9 on MLUtils? I don't have a nightly Julia installed locally.

@theabhirath
Copy link
Contributor

This is what I get:

julia> @time_imports using MLUtils
     10.4 ms    ┌ MacroTools
     17.8 ms  ┌ ZygoteRules 34.58% compilation time
      0.6 ms  ┌ DefineSingletons
      0.3 ms    ┌ IteratorInterfaceExtensions
      0.9 ms  ┌ TableTraits
      1.1 ms      ┌ DelimitedFiles
      3.3 ms    ┌ Compat
      1.2 ms    ┌ Requires
      0.4 ms    ┌ DataValueInterfaces
      1.0 ms      ┌ DataAPI
     11.2 ms    ┌ Tables
      1.0 ms    ┌ ConstructionBase
     16.2 ms    ┌ Setfield 29.96% compilation time (22% recompilation)
     13.6 ms    ┌ InitialValues
    127.6 ms  ┌ BangBang 55.14% compilation time (2% recompilation)
     10.0 ms  ┌ FunctionWrappers
     51.4 ms  ┌ DataStructures
      0.4 ms    ┌ CompositionsBase
     15.4 ms  ┌ Accessors 24.46% compilation time
      1.5 ms  ┌ ContextVariablesX
     60.8 ms    ┌ ChainRulesCore
     62.5 ms  ┌ ChangesOfVariables
      0.8 ms    ┌ InverseFunctions
      2.8 ms    ┌ DocStringExtensions 52.48% compilation time
      3.6 ms    ┌ IrrationalConstants
      0.7 ms    ┌ StatsAPI
      0.7 ms    ┌ SortingAlgorithms
      1.4 ms    ┌ LogExpFunctions
      6.1 ms    ┌ Missings
     38.9 ms  ┌ StatsBase 3.78% compilation time
      6.9 ms    ┌ MicroCollections
      0.7 ms    ┌ Adapt
      0.7 ms    ┌ ArgCheck
      6.4 ms    ┌ SplittablesBase
     34.7 ms    ┌ Baselet
     85.8 ms  ┌ Transducers 10.24% compilation time
     28.4 ms  ┌ MLStyle
      1.1 ms  ┌ PrettyPrint
      5.1 ms  ┌ ShowCases
    265.3 ms  ┌ FoldsThreads 86.56% compilation time
      1.2 ms  ┌ FLoopsBase
      0.7 ms  ┌ NameResolution
      3.4 ms  ┌ JuliaVariables
      6.1 ms  ┌ FLoops
    747.3 ms  MLUtils 43.14% compilation time (<1% recompilation)

@CarloLucibello
Copy link
Member

We could do the lazy import (or require import) thing we do in MLDatasets.jl also here for FoldsThreads (see https://github.com/johnnychen94/LazyModules.jl), but it is a hackish solution that we should avoid if possible.

@ToucheSir
Copy link
Contributor

Some finer-grained timing locally suggests that the majority of import overhead for BangBang and FoldsThreads is self time and not from dependencies (unless there is significant invalidation happening, but that does not appear to be the case). It's not clear to me why this is though, @tkf do you have any thoughts?

@CarloLucibello
Copy link
Member

I'm waiting for the outcome of this discussion before tagging a new release that adds the 'parallel' API to DataLoader which we may have to break in the future if the parallel implementation is moved out. But now a few changes have accumulated, most importantly bug fixes like #97, so we should reach a decision soon.

@lorenzoh
Copy link
Contributor Author

Would it be possible to add parallel as an experimental option? Could be removed later by simply ignoring it.

Thinking through it, I think it would be better to have DataLoader in MLUtils.jl. If criticisms mount after, it can always be removed.

Re being able to implement the data container interface without taking on a large dep: I think getindex and length should suffice in most use cases; where they don't, is mostly for builtin types like Array, Tuple, and Dict that are already defined in MLUtils.jl

@darsnack
Copy link
Member

darsnack commented Jun 10, 2022

I share Lorenz's view here about waiting to see what people say. Also, keeping it in means we can release now, so I suggest doing that and moving this discussion to the next cycle. The long load times + DataLoader as a type are already released right?

@lorenzoh
Copy link
Contributor Author

Yes, the load time is already released.

@CarloLucibello
Copy link
Member

Ok, let's release then

@ablaom
Copy link

ablaom commented Nov 3, 2023

As a further breakdown, I also suggest to move getobs and numobs and their implementations for the Base types to a MLCore.jl repo.

Yes, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants