Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PrecompileTools.jl #284

Open
RomeoV opened this issue Aug 5, 2023 · 9 comments
Open

Use PrecompileTools.jl #284

RomeoV opened this issue Aug 5, 2023 · 9 comments

Comments

@RomeoV
Copy link

RomeoV commented Aug 5, 2023

Motivation and description

Currently the startup time for using this package is quite long.
For example, running the code snippet below takes about 80s on my machine, which is 99% overhead time (the two epochs are practically instant).

To compare, a basic Flux model only takes about 6s after startup. Since in Julia 1.9 and 1.10 a lot of the compile time can be "cached away" I think we'd greatly benefit from integrating something like PrecompileTools.jl into the packages.

Possible Implementation

I saw there's already a workload.jl file (which basically just runs all the tests) which is used for sysimg creation. Perhaps we can do something similar for the PrecompileTools.jl directive.

I can try to get a PR started in the coming days.

Sample code

using FastAI, FastVision, Metalhead, Random
data, blocks = load(datarecipes()["mnist_png"])
idx = randperm(length(data[1]))[1:100]
data_ = (mapobs(data[1].f, data[1].data[idx]), mapobs(data[2].f, data[2].data[idx]))
task = ImageClassificationSingle(blocks)
learner = tasklearner(task, data_, backbone=ResNet(18).layers[1], callbacks=[ToGPU()])
fitonecycle!(learner, 2)
exit()
@lorenzoh
Copy link
Member

lorenzoh commented Aug 5, 2023

Sounds good! Have you tested the Speedup already?

@RomeoV
Copy link
Author

RomeoV commented Aug 5, 2023

Some issues I'm running into:

  • datarecipes not available at precompile time. Also, I suppose we don't have access to any data that needs to be downloaded. @lorenzoh does FastVision ship with some datarecipes before running the Module.init function? Otherwise I'll just mock some data.
  • currently CUDA precompile seems to be broken. See Cannot precompile GPU code with PrecompileTools JuliaGPU/CUDA.jl#2006
  • since FastVision doesn't depend on Metalhead (fair enough) I can't precompile with the ResNet. Fair enough. Maybe also makes sense to mock a model.

For now I'm just trying to test what speedup we could hope for by making a separate "startup" package (as is suggested here) that loads all of FastVision, Metalhead, etc and then basically has my code above as a precompile code, but without GPU and with mocked data. I'll report what speedup that brings.

@RomeoV
Copy link
Author

RomeoV commented Aug 5, 2023

Hmm this approach brings the TTF-epoch from 77s to about 65s, which is a speed up for sure, but I was kind of hoping for even more. I will have to look a bit deeper at where the time is spent. It might be all GPU stuff, in which case we'll need to wait for the above mentioned issue to conclude. There's also the possibility that on first execution cuDNN has to run a bunch of micro-benchmarks to determine some algorithms choices. I filed a WIP PR to cache that a while ago, but haven't looked at it in a while JuliaGPU/CUDA.jl#1948. If it turns out that the TTF-epoch is dominated by that I'll push that a bit more.

@RomeoV
Copy link
Author

RomeoV commented Aug 7, 2023

Another update - I ran a training similar to the code above, but without any FastAI.jl/FluxTraining.jl, i.e. just Flux.jl and Metalhead.jl (see code below).

With using the precompile approach from above, timings are 27s for CPU only and 55s for GPU.

In particular, 55s is only about 15% less than 65s - in other words, the fact that my above measurements are at 65s seems mostly dominated not by the FastAI infrastructure, but rather by the GPUCompiler etc. It still might be worth it to follow through with this issue, or at least write some instructions how to make a startup package, but further improvements must come from the Flux infrastructure itself.

see code
using Flux, Metalhead
device_ = eval(device)
import Flux: gradient
import Flux.OneHotArrays: onehotbatch
labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
data = ([rand(Float32, 32, 32, 3) for _ in 1:100],
        [rand(labels) for _ in 1:100])
model = ResNet(18; nclasses=10) |> device_
train_loader = Flux.DataLoader(data, batchsize=10, shuffle=true, collate=true)
opt = Flux.Optimise.Descent()
ps = Flux.params(model)
loss = Flux.Losses.logitcrossentropy
for epoch in 1:2
    for (x, y) in train_loader
        yb = onehotbatch(y, labels) |> device_
        model(x |> device_)
        grads = gradient(ps) do
            loss(model(x |> device_), yb)
        end
        Flux.Optimise.update!(opt, ps, grads)
    end
end

@ToucheSir
Copy link
Member

I suspected as much. You'll want to drill further down into the timings to see if something like JuliaGPU/GPUCompiler.jl#65 is at play.

@RomeoV
Copy link
Author

RomeoV commented Aug 7, 2023

Thanks. When i find some time, I'll also check if JuliaGPU/CUDA.jl#1947 helps. But probably I'll move that discussion somewhere else.

@RomeoV
Copy link
Author

RomeoV commented Sep 26, 2023

Update on this: Since JuliaGPU/CUDA.jl#2006 seems to be fixed, it's possible to just write your own little precompile directive, which reduces TTF Epoch to about 12 seconds -- quite workable!

MyModule.jl:

module FastAIStartup
using FastAI, FastVision, Metalhead
import FastVision: RGB, N0f8
import Flux
import Flux: gradient
import Flux.OneHotArrays: onehotbatch

import PrecompileTools: @setup_workload, @compile_workload
@setup_workload begin
    labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
    @compile_workload begin
        # with FastAI.jl
        data = ([rand(RGB{N0f8}, 32, 32) for _ in 1:100],
                [rand(labels) for _ in 1:100])
        blocks = (Image{2}(), FastAI.Label{String}(labels))
        task = ImageClassificationSingle(blocks)
        learner = tasklearner(task, data,
                            backbone=backbone(EfficientNet(:b0)),
                            callbacks = [ToGPU()])
        fitonecycle!(learner, 2)
    end
end
end # module FastAIStartup

benchmark.jl

using FastAI, FastVision, Metalhead
import FastVision: RGB, N0f8
import Flux
import Flux: gradient
import Flux.OneHotArrays: onehotbatch

labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
data = ([rand(RGB{N0f8}, 64, 64) for _ in 1:100],
        [rand(labels) for _ in 1:100])
blocks = (Image{2}(), FastAI.Label{String}(labels))
task = ImageClassificationSingle(blocks)
learner = tasklearner(task, data,
                      backbone=backbone(EfficientNet(:b0)),
                      callbacks = [ToGPU()])
fitonecycle!(learner, 2)
julia> @time include("benchmark.jl")
 11.546966 seconds (7.37 M allocations: 731.768 MiB, 4.15% gc time, 27.73% compilation time: 3% of which was recompilation)

@RomeoV
Copy link
Author

RomeoV commented Sep 26, 2023

I still think though it makes sense to move some of the precompile directives into this module.
Very broadly, something like:

@compile_workload begin
labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
data = ([rand(RGB{N0f8}, 64, 64) for _ in 1:100],
        [rand(labels) for _ in 1:100])
blocks = (Image{2}(), FastAI.Label{String}(labels))
task = ImageClassificationSingle(blocks)j
learner = tasklearner(task, data,
                      backbone=backbone(mockmodel(task)))
fitonecycle!(learner, 2)

# enable this somehow only if CUDA is loaded?
learner_gpu = tasklearner(task, data,
                      backbone=backbone(mockmodel(task)),
                      callbacks = [ToGPU()])
fitonecycle!(learner_gpu, 2)
end

@ToucheSir
Copy link
Member

ToucheSir commented Sep 27, 2023

I'm on board with adding precompile workloads, but only if we can ensure they don't use a bunch of CPU + memory at runtime (compile time is fine), don't modify any global state (e.g. default RNG) and don't do any I/O. That last one is most important because it's caused hangs during precompilation for other packages. That may mean strategic calls to precompile in some places instead of solely using PrecompileTools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants