Use PrecompileTools.jl #284

RomeoV · 2023-08-05T06:08:24Z

Motivation and description

Currently the startup time for using this package is quite long.
For example, running the code snippet below takes about 80s on my machine, which is 99% overhead time (the two epochs are practically instant).

To compare, a basic Flux model only takes about 6s after startup. Since in Julia 1.9 and 1.10 a lot of the compile time can be "cached away" I think we'd greatly benefit from integrating something like PrecompileTools.jl into the packages.

Possible Implementation

I saw there's already a workload.jl file (which basically just runs all the tests) which is used for sysimg creation. Perhaps we can do something similar for the PrecompileTools.jl directive.

I can try to get a PR started in the coming days.

Sample code

using FastAI, FastVision, Metalhead, Random
data, blocks = load(datarecipes()["mnist_png"])
idx = randperm(length(data[1]))[1:100]
data_ = (mapobs(data[1].f, data[1].data[idx]), mapobs(data[2].f, data[2].data[idx]))
task = ImageClassificationSingle(blocks)
learner = tasklearner(task, data_, backbone=ResNet(18).layers[1], callbacks=[ToGPU()])
fitonecycle!(learner, 2)
exit()

The text was updated successfully, but these errors were encountered:

lorenzoh · 2023-08-05T07:55:45Z

Sounds good! Have you tested the Speedup already?

RomeoV · 2023-08-05T08:30:47Z

Some issues I'm running into:

datarecipes not available at precompile time. Also, I suppose we don't have access to any data that needs to be downloaded. @lorenzoh does FastVision ship with some datarecipes before running the Module.init function? Otherwise I'll just mock some data.
currently CUDA precompile seems to be broken. See Cannot precompile GPU code with PrecompileTools JuliaGPU/CUDA.jl#2006
since FastVision doesn't depend on Metalhead (fair enough) I can't precompile with the ResNet. Fair enough. Maybe also makes sense to mock a model.

For now I'm just trying to test what speedup we could hope for by making a separate "startup" package (as is suggested here) that loads all of FastVision, Metalhead, etc and then basically has my code above as a precompile code, but without GPU and with mocked data. I'll report what speedup that brings.

RomeoV · 2023-08-05T08:49:58Z

Hmm this approach brings the TTF-epoch from 77s to about 65s, which is a speed up for sure, but I was kind of hoping for even more. I will have to look a bit deeper at where the time is spent. It might be all GPU stuff, in which case we'll need to wait for the above mentioned issue to conclude. There's also the possibility that on first execution cuDNN has to run a bunch of micro-benchmarks to determine some algorithms choices. I filed a WIP PR to cache that a while ago, but haven't looked at it in a while JuliaGPU/CUDA.jl#1948. If it turns out that the TTF-epoch is dominated by that I'll push that a bit more.

RomeoV · 2023-08-07T06:16:36Z

Another update - I ran a training similar to the code above, but without any FastAI.jl/FluxTraining.jl, i.e. just Flux.jl and Metalhead.jl (see code below).

With using the precompile approach from above, timings are 27s for CPU only and 55s for GPU.

In particular, 55s is only about 15% less than 65s - in other words, the fact that my above measurements are at 65s seems mostly dominated not by the FastAI infrastructure, but rather by the GPUCompiler etc. It still might be worth it to follow through with this issue, or at least write some instructions how to make a startup package, but further improvements must come from the Flux infrastructure itself.

see code

using Flux, Metalhead
device_ = eval(device)
import Flux: gradient
import Flux.OneHotArrays: onehotbatch
labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
data = ([rand(Float32, 32, 32, 3) for _ in 1:100],
        [rand(labels) for _ in 1:100])
model = ResNet(18; nclasses=10) |> device_
train_loader = Flux.DataLoader(data, batchsize=10, shuffle=true, collate=true)
opt = Flux.Optimise.Descent()
ps = Flux.params(model)
loss = Flux.Losses.logitcrossentropy
for epoch in 1:2
    for (x, y) in train_loader
        yb = onehotbatch(y, labels) |> device_
        model(x |> device_)
        grads = gradient(ps) do
            loss(model(x |> device_), yb)
        end
        Flux.Optimise.update!(opt, ps, grads)
    end
end

ToucheSir · 2023-08-07T07:27:44Z

I suspected as much. You'll want to drill further down into the timings to see if something like JuliaGPU/GPUCompiler.jl#65 is at play.

RomeoV · 2023-08-07T08:34:34Z

Thanks. When i find some time, I'll also check if JuliaGPU/CUDA.jl#1947 helps. But probably I'll move that discussion somewhere else.

RomeoV · 2023-09-26T21:03:11Z

Update on this: Since JuliaGPU/CUDA.jl#2006 seems to be fixed, it's possible to just write your own little precompile directive, which reduces TTF Epoch to about 12 seconds -- quite workable!

MyModule.jl:

module FastAIStartup
using FastAI, FastVision, Metalhead
import FastVision: RGB, N0f8
import Flux
import Flux: gradient
import Flux.OneHotArrays: onehotbatch

import PrecompileTools: @setup_workload, @compile_workload
@setup_workload begin
    labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
    @compile_workload begin
        # with FastAI.jl
        data = ([rand(RGB{N0f8}, 32, 32) for _ in 1:100],
                [rand(labels) for _ in 1:100])
        blocks = (Image{2}(), FastAI.Label{String}(labels))
        task = ImageClassificationSingle(blocks)
        learner = tasklearner(task, data,
                            backbone=backbone(EfficientNet(:b0)),
                            callbacks = [ToGPU()])
        fitonecycle!(learner, 2)
    end
end
end # module FastAIStartup

benchmark.jl

using FastAI, FastVision, Metalhead
import FastVision: RGB, N0f8
import Flux
import Flux: gradient
import Flux.OneHotArrays: onehotbatch

labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
data = ([rand(RGB{N0f8}, 64, 64) for _ in 1:100],
        [rand(labels) for _ in 1:100])
blocks = (Image{2}(), FastAI.Label{String}(labels))
task = ImageClassificationSingle(blocks)
learner = tasklearner(task, data,
                      backbone=backbone(EfficientNet(:b0)),
                      callbacks = [ToGPU()])
fitonecycle!(learner, 2)

julia> @time include("benchmark.jl")
 11.546966 seconds (7.37 M allocations: 731.768 MiB, 4.15% gc time, 27.73% compilation time: 3% of which was recompilation)

RomeoV · 2023-09-26T21:06:20Z

I still think though it makes sense to move some of the precompile directives into this module.
Very broadly, something like:

@compile_workload begin
labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
data = ([rand(RGB{N0f8}, 64, 64) for _ in 1:100],
        [rand(labels) for _ in 1:100])
blocks = (Image{2}(), FastAI.Label{String}(labels))
task = ImageClassificationSingle(blocks)j
learner = tasklearner(task, data,
                      backbone=backbone(mockmodel(task)))
fitonecycle!(learner, 2)

# enable this somehow only if CUDA is loaded?
learner_gpu = tasklearner(task, data,
                      backbone=backbone(mockmodel(task)),
                      callbacks = [ToGPU()])
fitonecycle!(learner_gpu, 2)
end

ToucheSir · 2023-09-27T00:31:36Z

I'm on board with adding precompile workloads, but only if we can ensure they don't use a bunch of CPU + memory at runtime (compile time is fine), don't modify any global state (e.g. default RNG) and don't do any I/O. That last one is most important because it's caused hangs during precompilation for other packages. That may mean strategic calls to precompile in some places instead of solely using PrecompileTools.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use PrecompileTools.jl #284

Use PrecompileTools.jl #284

RomeoV commented Aug 5, 2023 •

edited

Loading

lorenzoh commented Aug 5, 2023

RomeoV commented Aug 5, 2023 •

edited

Loading

RomeoV commented Aug 5, 2023

RomeoV commented Aug 7, 2023 •

edited

Loading

ToucheSir commented Aug 7, 2023

RomeoV commented Aug 7, 2023

RomeoV commented Sep 26, 2023

RomeoV commented Sep 26, 2023 •

edited

Loading

ToucheSir commented Sep 27, 2023 •

edited

Loading

Use PrecompileTools.jl #284

Use PrecompileTools.jl #284

Comments

RomeoV commented Aug 5, 2023 • edited Loading

Motivation and description

Possible Implementation

Sample code

lorenzoh commented Aug 5, 2023

RomeoV commented Aug 5, 2023 • edited Loading

RomeoV commented Aug 5, 2023

RomeoV commented Aug 7, 2023 • edited Loading

ToucheSir commented Aug 7, 2023

RomeoV commented Aug 7, 2023

RomeoV commented Sep 26, 2023

RomeoV commented Sep 26, 2023 • edited Loading

ToucheSir commented Sep 27, 2023 • edited Loading

RomeoV commented Aug 5, 2023 •

edited

Loading

RomeoV commented Aug 5, 2023 •

edited

Loading

RomeoV commented Aug 7, 2023 •

edited

Loading

RomeoV commented Sep 26, 2023 •

edited

Loading

ToucheSir commented Sep 27, 2023 •

edited

Loading