Significant compile time latency in Flux with default optimization #1126

Alexander-Barth · 2021-11-27T11:56:52Z

I am trying to optimize a super-resolution GAN, and I see very significant compile time.
This issue is discussed here:
https://discourse.julialang.org/t/significant-compile-time-latency-in-flux-with-a-gan/68518

@ToucheSir was able to reduce this issue to the code below (thanks a lot!).

I am using Julia 1.6.3 on Linux and these packages:

(@v1.6) pkg> st Zygote ZygoteRules Flux
      Status `~/.julia/environments/v1.6/Project.toml`
  [587475ba] Flux v0.12.8 `https://github.com/FluxML/Flux.jl.git#master`
  [e88e6eb3] Zygote v0.6.30
  [700de1a5] ZygoteRules v0.2.2

It appears that this issue is sensitive to the optimization level and to the julia version used.

with julia 1.6.3 (default optimization):

 2.129568 seconds (6.12 M allocations: 352.760 MiB, 5.21% gc time, 99.96% compilation time)
  0.000323 seconds (1.80 k allocations: 138.938 KiB)
2740.221683 seconds (62.97 M allocations: 3.596 GiB, 0.05% gc time, 0.44% compilation time)
  0.004320 seconds (12.04 k allocations: 2.761 MiB)

Note the timing of the first call the loss_grad(lr_images, ps).

In julia 1.5.3 I get the following timing:

  1.969913 seconds (8.92 M allocations: 458.885 MiB, 6.31% gc time)
  0.000265 seconds (1.30 k allocations: 102.344 KiB)
 25.426727 seconds (57.88 M allocations: 2.927 GiB, 3.93% gc time)
  0.005641 seconds (14.71 k allocations: 1.047 MiB)

with "julia 1.6.3 -O1":

 1.389465 seconds (6.12 M allocations: 352.764 MiB, 25.61% gc time, 99.94% compilation time)
  0.000403 seconds (2.10 k allocations: 143.625 KiB)
 28.301099 seconds (62.99 M allocations: 3.596 GiB, 3.96% gc time, 26.40% compilation time)
  0.004889 seconds (12.94 k allocations: 2.775 MiB)

using Flux

channels = 4

function resblock(channels)
    return SkipConnection(Chain(
        Conv((3, 3), channels => channels, pad=1),
        Conv((3, 3), channels => channels, pad=1),
    ), +)
end

model = Chain(
    SkipConnection(
        Chain(
            resblock(channels),
            resblock(channels),
            resblock(channels),
            resblock(channels),
            resblock(channels),

            resblock(channels),
            resblock(channels),
            resblock(channels),
            resblock(channels),
            resblock(channels),

            resblock(channels),
            resblock(channels),
            resblock(channels),
            resblock(channels),
            resblock(channels),
        ),
    +),
    AdaptiveMeanPool((1, 1))
)

display(model)
println()
@show typeof(model) 

loss(x) = sum(model(x))

lr_images = randn(Float32, 2, 2, channels, 1)
@time loss(lr_images)
@time loss(lr_images) 
    
loss_grad(x, ps) = gradient(() -> loss(x), ps)
    
ps = Flux.params(model)
@time loss_grad(lr_images, ps)
@time loss_grad(lr_images, ps)

This issue might be related
#1119

The text was updated successfully, but these errors were encountered:

mcabbott · 2021-11-28T22:44:11Z

With Julia master I get this:

julia> @time loss_grad(lr_images, ps);
 63.622937 seconds (67.18 M allocations: 3.529 GiB, 1.54% gc time, 99.91% compilation time)

julia> @time loss_grad(lr_images, ps);
  0.030501 seconds (13.07 k allocations: 2.931 MiB)

julia> @btime loss_grad($lr_images, $ps);
  min 33.048 ms, mean 33.933 ms (12887 allocations, 2.91 MiB. GC mean 0.94%)

julia> model.layers[1].layers.layers |> length  # 40s compile time with 10 layers, no huge jump
15

julia> VERSION
v"1.8.0-DEV.1063"

If I replace the long tuple with a vector, very crudely:

struct VChain
  layers
  VChain(xs...) = new(collect(xs))
end
Flux.@forward VChain.layers Base.getindex, Base.length, Base.first, Base.last,
  Base.iterate, Base.lastindex, Base.keys

Flux.functor(::Type{<:VChain}, c) = c.layers, ls -> VChain(ls...)

(c::VChain)(x) = applychain(c.layers, x)
applychain(fs, x) = isempty(fs) ? x : applychain(fs[2:end], first(fs)(x))

... then I get better times. Which is perhaps a similar story to the -O1 times above.

julia> @time loss_grad(lr_images, ps);
 14.622136 seconds (66.20 M allocations: 3.480 GiB, 5.22% gc time, 99.71% compilation time)

julia> @time loss_grad(lr_images, ps);
  0.032955 seconds (13.47 k allocations: 1.664 MiB)

julia> @btime loss_grad($lr_images, $ps);
  min 36.019 ms, mean 36.910 ms (13381 allocations, 1.66 MiB. GC mean 0.39%)

ToucheSir · 2021-11-28T23:08:11Z

Interesting that it allocates less as well. I wonder if switching from applychain to foldl now that the latter has been optimized would help. Another thing to test would be @nograding DenseConvDims and ConvDims. I tried ChainRulesCore.@non_differentiable with no luck, but there's a good chance that was user error. Trying to create a MWE without Flux or NNlib is tough, because it basically requires copying over definitions from both wholesale.

mcabbott · 2021-11-29T00:25:55Z

Oh right, foldl is a good idea. Instead of applychain's tuple-peeling. Quick attempt:

using Flux
struct FChain{T}
  layers::T
  FChain(xs...) = new{typeof(xs)}(xs)
  FChain(xs::AbstractVector) = new{typeof(xs)}(xs)  # allows FChain(Any[Layer, Layer, ...]), type-unstable
end
Flux.@forward FChain.layers Base.getindex, Base.length, Base.first, Base.last,
  Base.iterate, Base.lastindex, Base.keys

Flux.functor(::Type{<:FChain}, c) = c.layers, ls -> FChain(ls...)
Flux.functor(::Type{<:FChain{<:AbstractVector}}, c) = c.layers, ls -> FChain(ls)  # maybe?

# (c::FChain)(x) = foldl((y,f) -> f(y), c.layers; init=x) # NO, this forgets the gradient for x
(c::FChain)(x) = foldl((y,f) -> f(y), (x, c.layers...))
(c::FChain{<:AbstractVector})(x) = foldl((y,f) -> f(y), vcat([x], c.layers)))

julia> @time loss_grad(lr_images, ps);
 23.586338 seconds (68.82 M allocations: 3.611 GiB, 3.41% gc time, 99.85% compilation time)

julia> @time loss_grad(lr_images, ps);
  0.026522 seconds (13.36 k allocations: 2.719 MiB)

julia> model.layers[1].layers.layers |> length
15

julia> @btime loss_grad($lr_images, $ps);
  min 32.178 ms, mean 32.809 ms (13357 allocations, 2.72 MiB. GC mean 0.75%)

That's on a cold start. By accident, running this example after some unrelated checks (m = FChain(Dense(2,3), Dense(3,4)); g = gradient(() -> sum(m([1,2])), params(m)); g[m.layers[1].weight]) I got a much faster time:

 13.804523 seconds (23.11 M allocations: 1.178 GiB, 1.70% gc time, 99.72% compilation time)

So perhaps 10s of this is generic startup (and perhaps that could be helped by precompilation?) and not the effects which scale. Comparable warm start for the original Chain:

 52.581168 seconds (21.92 M allocations: 1.122 GiB, 0.59% gc time, 99.88% compilation time)

ToucheSir · 2021-11-29T01:06:11Z

There are precompilation statements for explicit params, but not for the version of pullback that uses implicit params: https://github.com/FluxML/Zygote.jl/blob/master/src/precompile.jl. If/how much of a difference this would make I'm not sure.

DhairyaLGandhi · 2021-11-29T05:36:48Z

Most of the code is shared between the two modes, and the cached precompilation would be cached anyway so I don't know how much of a difference it would make.

Alexander-Barth · 2021-11-29T16:40:04Z

Thank you all for insights. I am impressed how easy it is to implemented an alternative Chain structure. The proposed foldl approach works also well in my original SRGAN implementation.

mcabbott · 2021-11-29T17:34:02Z

The foldl approach might be worth adopting in general. My only reservation is that I'm not certain it will handle 2nd derivatives, it's meant to, but this isn't tested.

A bigger change would be to allow, or even automatically switch to, a deliberately type-unstable approach. This could be nicer (and prob. faster) than my crude VChain above. But some people want very small Chains to be fast; an automatic switch would probably need some heuristic to decide when (by counting sizes on construction?) and sounds like a big project.

ToucheSir · 2021-11-29T17:43:29Z

@Alexander-Barth if you have them, I'd love to see how the times look now on 1.6/1.7.

If we can think of some easy tests to ensure foldl is 2nd derivative safe, I'd vote for switching to it asap. For the type stability tradeoff, we might be able to get away with simply allowing Chain.layers to be non-tuple container types. Then folks not terribly worried about small chain performance could opt to pass a Vector{Any} instead.

mcabbott · 2021-11-29T17:49:45Z

True, allowing Chain([Layer, Layer, Layer]) might not be so strange. I believe foldl should handle that too, have added a method to FChain above. Although it would be worth running the race against VChain.

Alexander-Barth · 2021-11-29T21:21:37Z

@ToucheSir Of course, here are the times for FChain (with a simplified SRGAN model)

Julia 1.7.0-rc3 (first and 2nd call of pullback)

 33.978640 seconds (76.75 M allocations: 3.952 GiB, 4.16% gc time, 99.75% compilation time)
  0.655267 seconds (49.21 k allocations: 3.040 MiB, 98.58% compilation time)

Julia 1.6.3:

 32.954155 seconds (71.80 M allocations: 4.020 GiB, 4.61% gc time, 66.53% compilation time)
  0.579941 seconds (51.56 k allocations: 3.501 MiB, 98.33% compilation time)

So the timing are very similar. Surprisingly `VChain' is a bit faster:

Julia 1.7.0-rc3 (first and 2nd call of pullback)

24.264638 seconds (64.14 M allocations: 3.350 GiB, 5.62% gc time, 99.71% compilation time)                                                                           
  0.041324 seconds (48.69 k allocations: 2.564 MiB, 78.66% compilation time)

Julia 1.6.3:

24.329194 seconds (61.18 M allocations: 3.471 GiB, 5.11% gc time, 64.15% compilation time)
  0.042680 seconds (51.36 k allocations: 3.016 MiB, 78.65% compilation time)

mcabbott · 2021-11-29T21:36:30Z

Thanks for checking. Interesting that the 2nd run with FChain is 10x slower than with VChain. Do they settle down (or @btime) to similar numbers?

Alexander-Barth · 2021-11-30T21:11:21Z

Indeed with @btime, the runtime of both approaches are very similar:

	julia 1.6.3	julia 1.7.0-rc3
`FChain`	7.922 ms (8874 allocations: 1.13 MiB)	7.736 ms (8571 allocations: 1009.80 KiB)
`VChain`	7.931 ms (8675 allocations: 687.95 KiB)	7.696 ms (8023 allocations: 548.80 KiB)

ToucheSir · 2021-11-30T21:44:08Z

Is foldl with init not working a Zygote or ChainRules thing? I imagine getting that ironed out would result in a far more competitive FChain.

mcabbott · 2021-11-30T21:48:17Z

I think it's both, really -- by convention neither package believes keyword arguments should have gradients. This is perhaps the most convincing example I've seen where the gradient would be desired. Cc @oxinabox for design thoughts?

I edited the code above to make it the first element foldl((y,f) -> f(y), (x, c.layers...)). I believe that this should be approximately free when c.layers isa Tuple but haven't checked.

oxinabox · 2021-11-30T22:31:32Z

In theory, I think we can write rules for gradients w.r.t keyword arguments by writing rules for:

rrule(::RuleConfig, ::Core.kwftype(foldl), kwargs::NamedTuple, ::typeof(foldl), op, iter)

But I have never tested that.
It may be that the kwarg handling code that is there now doesn't allow that to work because it redirects to

rrule(::RuleConfig, ::typeof(foldl), op, iter; kwargs...)

I think it might work if the version that takes kwargs without gradients isn't defined though.

mcabbott · 2022-01-10T18:17:01Z

Here's the effect on the above example of adding Base.Experimental.@optlevel 0; Base.Experimental.@max_methods 1 to both Zygote and IRTools.

I believe that most of the speedup is coming from adding it to Zygote, it would be worth separating the effect of each of these. We should also benchmark some functions with more Zygote and less BLAS, I guess, to see if this has bad effects elsewhere:

julia> # @eval Flux (c::Chain)(x) = Base.afoldl(|>, x, c.layers...)

       @time loss(lr_images);  # first Flux.Chain call, cold start, no Zygote
  4.260597 seconds (30.96 M allocations: 1.634 GiB, 7.06% gc time, 99.79% compilation time)  # tagged
  4.621733 seconds (31.05 M allocations: 1.638 GiB, 7.00% gc time, 99.78% compilation time)  # Base.Experimental IRTools + Zygote

julia> @time loss(lr_images);
  0.007167 seconds (2.47 k allocations: 259.031 KiB)

julia> @btime loss($lr_images);
  min 7.199 ms, mean 7.768 ms (1882 allocations, 203.14 KiB)  # tagged
  min 7.373 ms, mean 7.585 ms (2432 allocations, 255.34 KiB)  # tagged, later

julia> loss_grad(x, ps) = gradient(() -> loss(x), ps);

julia> ps = Flux.params(model);

julia> @time loss_grad(lr_images, ps);  # first Zygote.gradient, cold start
 61.904481 seconds (67.77 M allocations: 3.523 GiB, 1.08% gc time, 99.88% compilation time)  # tagged
 14.570048 seconds (63.99 M allocations: 3.330 GiB, 3.65% gc time, 99.67% compilation time)  # Base.Experimental IRTools + Zygote
 14.803469 seconds (64.02 M allocations: 3.331 GiB, 3.65% gc time, 99.66% compilation time)  # Base.Experimental IRTools + Zygote + ZygoteRules
 15.253339 seconds (64.76 M allocations: 3.369 GiB, 3.76% gc time, 99.68% compilation time)  # Base.Experimental Zygote

julia> @time loss_grad(lr_images, ps);  # second
  0.036655 seconds (13.07 k allocations: 2.928 MiB)
  0.037986 seconds (13.14 k allocations: 2.983 MiB)

julia> @btime $loss_grad($lr_images, $ps);
  min 38.108 ms, mean 38.486 ms (13060 allocations, 2.93 MiB)  # tagged
  min 37.827 ms, mean 38.431 ms (13024 allocations, 2.92 MiB)  # tagged, later
  min 39.340 ms, mean 40.376 ms (13090 allocations, 2.98 MiB)  # Base.Experimental IRTools + Zygote
  min 40.165 ms, mean 41.283 ms (13122 allocations, 2.98 MiB)  # Base.Experimental IRTools + Zygote + ZygoteRules
  min 39.469 ms, mean 40.169 ms (13124 allocations, 2.98 MiB)  # Base.Experimental Zygote

julia> length(model.layers[1].layers.layers)  # count of resblock(channels) layers
15

##############################
# With Base.afoldl for Chain:

julia> @time loss(lr_images);  # first Flux.Chain call, cold start, no Zygote
  3.790401 seconds (30.96 M allocations: 1.634 GiB, 8.04% gc time, 99.83% compilation time)  # tagged
  3.732328 seconds (30.98 M allocations: 1.635 GiB, 7.78% gc time, 99.83% compilation time)  # Base.Experimental Zygote

julia> @time loss(lr_images);
  0.005440 seconds (2.47 k allocations: 245.625 KiB)

julia> @btime loss($lr_images);
  min 5.854 ms, mean 6.082 ms (2453 allocations, 243.89 KiB)  # tagged
  min 6.009 ms, mean 6.153 ms (2462 allocations, 245.47 KiB)  # Base.Experimental Zygote

julia> @time loss_grad(lr_images, ps);  # first Zygote.gradient, cold start
 21.698580 seconds (78.74 M allocations: 4.044 GiB, 3.35% gc time, 99.76% compilation time)  # tagged
 14.395047 seconds (75.68 M allocations: 3.890 GiB, 4.68% gc time, 99.68% compilation time)  # Base.Experimental IRTools + Zygote
 13.474056 seconds (75.70 M allocations: 3.891 GiB, 4.54% gc time, 99.66% compilation time)  # Base.Experimental IRTools + Zygote + ZygoteRules
 14.054168 seconds (76.63 M allocations: 3.934 GiB, 4.57% gc time, 99.68% compilation time)  # Base.Experimental Zygote

julia> @btime $loss_grad($lr_images, $ps);
  min 41.539 ms, mean 43.545 ms (17348 allocations, 1.80 MiB)  # tagged
  min 40.177 ms, mean 40.682 ms (17390 allocations, 1.81 MiB)  # tagged, later

  min 40.027 ms, mean 40.519 ms (17439 allocations, 1.82 MiB)  # Base.Experimental IRTools + Zygote
  min 41.294 ms, mean 46.930 ms (16991 allocations, 1.78 MiB)  # Base.Experimental IRTools + Zygote + ZygoteRules
  min 40.574 ms, mean 41.111 ms (17421 allocations, 1.82 MiB)  # Base.Experimental Zygote

########################
# With `foldl` instead:

julia> # @eval Flux (c::Chain)(x) = Base.afoldl(|>, x, c.layers...)
       @eval Flux (c::Chain)(x) = foldl(|>, (x, c.layers...))

julia> @time loss(lr_images);  # first Flux.Chain call, cold start, no Zygote
  4.412313 seconds (32.94 M allocations: 1.749 GiB, 7.79% gc time, 99.76% compilation time)  # tagged
  4.478969 seconds (32.97 M allocations: 1.750 GiB, 6.88% gc time, 99.81% compilation time)  # Base.Experimental Zygote

julia> @time loss(lr_images);
  0.008333 seconds (2.50 k allocations: 267.562 KiB)

julia> @btime loss($lr_images);
  min 8.338 ms, mean 8.797 ms (2452 allocations, 262.05 KiB)
  min 8.208 ms, mean 8.842 ms (2480 allocations, 265.52 KiB)

julia> loss_grad(x, ps) = gradient(() -> loss(x), ps);

julia> ps = Flux.params(model);

julia> @time loss_grad(lr_images, ps);  # first Zygote.gradient, cold start
 45.873403 seconds (72.27 M allocations: 3.775 GiB, 1.52% gc time, 99.86% compilation time)  # tagged
 30.731573 seconds (69.52 M allocations: 3.630 GiB, 2.07% gc time, 99.82% compilation time)  # Base.Experimental Zygote

julia> @time loss_grad(lr_images, ps);  # second
  0.048166 seconds (14.88 k allocations: 4.123 MiB)

julia> @btime $loss_grad($lr_images, $ps);
  min 49.859 ms, mean 51.324 ms (14801 allocations, 4.12 MiB)  # tagged
  min 46.516 ms, mean 48.572 ms (14894 allocations, 4.09 MiB)  # Base.Experimental Zygote

julia> length(model.layers[1].layers.layers)  # count of resblock(channels) layers
15

ToucheSir · 2022-01-10T18:48:27Z

There are most certainly better test cases, but off the top of my head RNNs (especially pointwise-heavy cells like LSTM) should have a higher non-BLAS to BLAS ratio.

mcabbott · 2022-01-10T18:57:31Z

Ineed, maybe from FluxML/Flux.jl#1761 . (Which still isn't merged?)

ToucheSir · 2022-01-10T19:22:25Z

Pre-#1761 might be better, as AD is more involved there.
(a lot of Flux PRs are blocked on Bors working with CI, which should be addressed by FluxML/Flux.jl#1822)

mcabbott · 2022-01-11T01:59:47Z

A much simpler test of some complicated broadcasting shows 3-10x slowdowns: #1147 (comment) . So I doubt -O1 for the whole module is acceptable. Perhaps someone can see how to slice it into Zygote.Fast and Zygote.Slow?

Another idea is to replace foldl with a hand-unrolled @generated version. This is not as quick to compile as Base.afoldl above, but is faster at runtime: FluxML/Flux.jl#1809 (comment) .

CarloLucibello · 2022-02-10T12:46:43Z

Closing in favor of #1119 , please reopen if the 2 issues are not the same

mcabbott mentioned this issue Dec 2, 2021

Very slow first-time gradient calculation #1119

Open

mcabbott mentioned this issue Dec 13, 2021

Replace unrolled foldl used to evaluate Chain with a better one FluxML/Flux.jl#1809

Merged

torfjelde mentioned this issue Jan 10, 2022

Zygote's compilation scales badly with the number of ~ statements TuringLang/Turing.jl#1754

Closed

mcabbott mentioned this issue Jan 10, 2022

use Base.Experimental macros to reduce load time #1147

Closed

CarloLucibello mentioned this issue Feb 10, 2022

Insane precompile times in Metalhead models #1160

Closed

theabhirath mentioned this issue Feb 10, 2022

Implementation of base ViT model FluxML/Metalhead.jl#105

Merged

CarloLucibello closed this as completed Feb 10, 2022

mcabbott mentioned this issue May 9, 2022

Remove unnecessary generated functions #1220

Merged

ToucheSir mentioned this issue Jun 23, 2022

Improved type stability with explicit params #1248

Merged

mcabbott added the performance label Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant compile time latency in Flux with default optimization #1126

Significant compile time latency in Flux with default optimization #1126

Alexander-Barth commented Nov 27, 2021 •

edited

Loading

mcabbott commented Nov 28, 2021

ToucheSir commented Nov 28, 2021 •

edited

Loading

mcabbott commented Nov 29, 2021 •

edited

Loading

ToucheSir commented Nov 29, 2021

DhairyaLGandhi commented Nov 29, 2021

Alexander-Barth commented Nov 29, 2021

mcabbott commented Nov 29, 2021

ToucheSir commented Nov 29, 2021 •

edited

Loading

mcabbott commented Nov 29, 2021

Alexander-Barth commented Nov 29, 2021

mcabbott commented Nov 29, 2021

Alexander-Barth commented Nov 30, 2021

ToucheSir commented Nov 30, 2021

mcabbott commented Nov 30, 2021

oxinabox commented Nov 30, 2021

mcabbott commented Jan 10, 2022 •

edited

Loading

ToucheSir commented Jan 10, 2022

mcabbott commented Jan 10, 2022

ToucheSir commented Jan 10, 2022

mcabbott commented Jan 11, 2022

CarloLucibello commented Feb 10, 2022

Significant compile time latency in Flux with default optimization #1126

Significant compile time latency in Flux with default optimization #1126

Comments

Alexander-Barth commented Nov 27, 2021 • edited Loading

mcabbott commented Nov 28, 2021

ToucheSir commented Nov 28, 2021 • edited Loading

mcabbott commented Nov 29, 2021 • edited Loading

ToucheSir commented Nov 29, 2021

DhairyaLGandhi commented Nov 29, 2021

Alexander-Barth commented Nov 29, 2021

mcabbott commented Nov 29, 2021

ToucheSir commented Nov 29, 2021 • edited Loading

mcabbott commented Nov 29, 2021

Alexander-Barth commented Nov 29, 2021

mcabbott commented Nov 29, 2021

Alexander-Barth commented Nov 30, 2021

ToucheSir commented Nov 30, 2021

mcabbott commented Nov 30, 2021

oxinabox commented Nov 30, 2021

mcabbott commented Jan 10, 2022 • edited Loading

ToucheSir commented Jan 10, 2022

mcabbott commented Jan 10, 2022

ToucheSir commented Jan 10, 2022

mcabbott commented Jan 11, 2022

CarloLucibello commented Feb 10, 2022

Alexander-Barth commented Nov 27, 2021 •

edited

Loading

ToucheSir commented Nov 28, 2021 •

edited

Loading

mcabbott commented Nov 29, 2021 •

edited

Loading

ToucheSir commented Nov 29, 2021 •

edited

Loading

mcabbott commented Jan 10, 2022 •

edited

Loading