Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dropout #454

Merged
merged 6 commits into from
Jan 5, 2023
Merged

Add dropout #454

merged 6 commits into from
Jan 5, 2023

Conversation

mcabbott
Copy link
Member

@mcabbott mcabbott commented Jan 3, 2023

This adds a dropout function.

At least on CPU, this should be about twice as fast as the one in Flux, and use less memory. There are some Flux PRs, but the code is messy, and I got lost & started from scratch.

It's not quite the fastest variant. The gradient can be faster if we believe that every zero in the output comes from the mask, never from the input. That may not be a terrible assumption -- if you don't have a relu before it, zeros are rare; if you do, then the gradient associated to a zero will be discarded anyway. But... perhaps it's best to be correct.

I can put a gist of these variations somewhere. But local benchmarks are:

let x = randn(Float32, 100, 100)
    @btime dropout($x, 0.3) # Flux
    @btime mydropout($x, 0.3)  # sans mask -- this PR
    @btime mydropout2($x, 0.3)  # with mask
end;
  # min 10.375 μs, mean 19.879 μs (13 allocations, 78.45 KiB)
  # min 5.292 μs, mean 10.736 μs (2 allocations, 39.11 KiB)
  # min 12.958 μs, mean 17.476 μs (5 allocations, 44.67 KiB)

let x = randn(Float32, 100, 100)
    @btime gradient(x -> sum(dropout(x, 0.3)), $x) # Flux
    @btime gradient(x -> sum(mydropout(x, 0.3)), $x)  # sans mask... but sadly wrong answers
    @btime gradient(x -> sum(mydropout2(x, 0.3)), $x)  # with boolean mask
    @btime gradient(x -> sum(mydropout3(x, 0.3)), $x)  # keep random numbers -- this PR
end;
  # min 28.292 μs, mean 49.702 μs (70 allocations, 158.55 KiB)
  # min 12.500 μs, mean 22.952 μs (20 allocations, 78.70 KiB)
  # min 26.916 μs, mean 37.291 μs (37 allocations, 84.69 KiB)
  # min 13.208 μs, mean 26.490 μs (23 allocations, 117.92 KiB)


let x = randn(Float32, 100, 100)
    @btime dropout($x, 0.3; dims=1)
    @btime mydropout($x, 0.3; dims=1)
    @btime mydropout2($x, 0.3; dims=1)
end;
  # min 1.363 μs, mean 5.047 μs (3 allocations, 39.59 KiB)
  # min 1.346 μs, mean 6.003 μs (3 allocations, 39.59 KiB)
  # min 1.675 μs, mean 5.735 μs (5 allocations, 39.72 KiB)

let x = randn(Float32, 100, 100)
    @btime gradient(x -> sum(dropout(x, 0.3; dims=1)), $x)
    @btime gradient(x -> sum(mydropout(x, 0.3; dims=1)), $x)
    @btime gradient(x -> sum(mydropout2(x, 0.3; dims=1)), $x)
    @btime gradient(x -> sum(mydropout3(x, 0.3; dims=1)), $x)
    @btime gradient(x -> sum(mydropout4(x, 0.3; dims=1)), $x)  # changed to same strategy as dims=:
end;
  # min 25.458 μs, mean 47.425 μs (108 allocations, 122.16 KiB)
  # min 8.722 μs, mean 17.797 μs (21 allocations, 79.19 KiB)
  # min 13.709 μs, mean 24.887 μs (33 allocations, 79.61 KiB)
  # min 8.680 μs, mean 18.442 μs (21 allocations, 79.19 KiB)
  # min 9.916 μs, mean 22.267 μs (21 allocations, 79.25 KiB)  # slightly slower, but simpler

Edit: timing things on the GPU, at first the dims=: case seemed slow, but now it's not...

# ] add https://github.com/mcabbott/NNlib.jl/tree/dropout

using NNlib, CUDA, BenchmarkTools, Zygote, Flux, ChainRulesCore, Random

NNlib._rng_from_array(::CuArray) = CUDA.default_rng()

NNlib.dropout(cu(ones(3,5)), 0.3)
NNlib.dropout(cu(ones(3,5)), 0.4; dims=1)

Flux.dropout(cu(ones(3,5)), 0.3)
Flux.dropout(cu(ones(3,5)), 0.4; dims=1)

# To try out the copy-the-RNG idea, only for gradients:

@eval Base.copy(r::CUDA.RNG) = $(Expr(:new, CUDA.RNG, :(r.seed), :(r.counter)))

using NNlib: _fast_broadcast!

mydropout6(args...; kw...) = dropout(args...; kw...)  # same forward as this PR
function ChainRulesCore.rrule(::typeof(mydropout6), rng::AbstractRNG, x::AbstractArray, p::Real; dims = :)
    dup_rng = copy(rng)   # save for backward pass
    y = dropout(dup_rng, x, p; dims)  # for other dims, could re-use buffer...
    val = convert(eltype(y), 1/(1-p))
    function back6(Δ)
        dims isa Colon || error("will be wrong")
        dx = rand!(dup_rng, similar(y))
        _fast_broadcast!(dx, unthunk(Δ)) do q, dy
            (q>p) * val * dy
        end
        (NoTangent(), NoTangent(), dx, NoTangent())
    end
    y, back6
end

ChainRulesCore.@non_differentiable CUDA.default_rng()

mydropout6(CUDA.default_rng(), cu(ones(3,5)), 0.3)
gradient(x -> sum(mydropout6(CUDA.default_rng(), x, 0.3)), cu(ones(3,5)))


##### Forward times & allocations

julia> let x = CUDA.rand(100, 1000)
           @btime CUDA.@sync Flux.dropout($x, 0.3)
           @btime CUDA.@sync NNlib.dropout($x, 0.3)
           println()
           @btime CUDA.@sync Flux.dropout($x, 0.3; dims=1)
           @btime CUDA.@sync NNlib.dropout($x, 0.3; dims=1)
       end;
  49.163 μs (94 allocations: 4.27 KiB)
  29.623 μs (48 allocations: 2.31 KiB)

  43.097 μs (85 allocations: 4.05 KiB)
  42.857 μs (85 allocations: 3.98 KiB)

# one pass strategy, for the last:
  39.118 μs (88 allocations: 5.33 KiB)


# First run?
  # 49.339 μs (94 allocations: 4.27 KiB)
  # 261.194 μs (41 allocations: 392.56 KiB)  # awful

  # 42.835 μs (85 allocations: 4.05 KiB)
  # 47.702 μs (77 allocations: 4.05 KiB)

julia> let x = CUDA.rand(1000, 10_000)
           CUDA.@time Flux.dropout(x, 0.3)
           CUDA.@time NNlib.dropout(x, 0.3)
           println()
           CUDA.@time Flux.dropout(x, 0.3; dims=1)
           CUDA.@time NNlib.dropout(x, 0.3; dims=1)
       end;
  0.000415 seconds (94 CPU allocations: 4.266 KiB) (2 GPU allocations: 76.294 MiB, 5.40% memmgmt time)
  0.000269 seconds (48 CPU allocations: 2.312 KiB) (1 GPU allocation: 38.147 MiB, 2.50% memmgmt time)

  0.000185 seconds (86 CPU allocations: 4.062 KiB) (2 GPU allocations: 38.151 MiB, 6.87% memmgmt time)
  0.000186 seconds (86 CPU allocations: 4.000 KiB) (2 GPU allocations: 38.151 MiB, 4.62% memmgmt time)

# one pass:
  0.000214 seconds (88 CPU allocations: 5.328 KiB) (2 GPU allocations: 38.151 MiB, 3.88% memmgmt time)

# Earlier:
  # 0.000474 seconds (94 CPU allocations: 4.266 KiB) (2 GPU allocations: 76.294 MiB, 2.89% memmgmt time)
  # 0.046926 seconds (41 CPU allocations: 38.149 MiB) (1 GPU allocation: 38.147 MiB, 0.02% memmgmt time)

  # 0.000227 seconds (86 CPU allocations: 4.062 KiB) (2 GPU allocations: 38.151 MiB, 12.16% memmgmt time)
  # 0.000235 seconds (78 CPU allocations: 7.641 KiB) (2 GPU allocations: 38.151 MiB, 4.83% memmgmt time)


##### Gradient times & allocations

julia> let x = CUDA.randn(100, 1000), rng = CUDA.default_rng()
           @btime CUDA.@sync gradient(x -> sum(Flux.dropout(x, 0.3)), $x)
           @btime CUDA.@sync gradient(x -> sum(NNlib.dropout(x, 0.3)), $x)
           @btime CUDA.@sync gradient(x -> sum(mydropout6($rng, x, 0.3)), $x)
           println()
           @btime CUDA.@sync gradient(x -> sum(Flux.dropout(x, 0.3; dims=1)), $x)
           @btime CUDA.@sync gradient(x -> sum(NNlib.dropout(x, 0.3; dims=1)), $x)
       end;
  225.455 μs (364 allocations: 16.27 KiB)
  133.620 μs (242 allocations: 11.19 KiB)
  146.186 μs (258 allocations: 12.05 KiB)

  342.355 μs (509 allocations: 22.77 KiB)
  165.137 μs (284 allocations: 12.70 KiB)


# First try, without mydropout6
  # 194.412 μs (364 allocations: 16.27 KiB)
  # 318.048 μs (235 allocations: 401.44 KiB)  # bad, but not because of alloc

  # 259.827 μs (509 allocations: 22.77 KiB)
  # 150.939 μs (276 allocations: 12.77 KiB)  # great



julia> let x = CUDA.randn(1000, 10_000), rng = CUDA.default_rng()
           CUDA.@time gradient(x -> sum(Flux.dropout(x, 0.3)), x)
           CUDA.@time gradient(x -> sum(NNlib.dropout(x, 0.3)), x)
           CUDA.@time gradient(x -> sum(mydropout6(rng, x, 0.3)), x)
           println()
           CUDA.@time gradient(x -> sum(Flux.dropout(x, 0.3; dims=1)), x)
           CUDA.@time gradient(x -> sum(NNlib.dropout(x, 0.3; dims=1)), x)
       end;
  0.116546 seconds (61.41 k CPU allocations: 2.759 MiB, 26.81% gc time) (7 GPU allocations: 190.735 MiB, 8.33% memmgmt time)
  0.078563 seconds (52.12 k CPU allocations: 2.933 MiB) (6 GPU allocations: 152.589 MiB, 0.07% memmgmt time)
  0.078514 seconds (42.42 k CPU allocations: 2.552 MiB) (5 GPU allocations: 114.442 MiB, 0.07% memmgmt time)

  0.090443 seconds (54.14 k CPU allocations: 2.910 MiB) (8 GPU allocations: 152.596 MiB, 0.07% memmgmt time)
  0.083492 seconds (44.09 k CPU allocations: 2.394 MiB) (6 GPU allocations: 114.445 MiB, 0.07% memmgmt time)

# First try, without mydropout6
  # 0.070830 seconds (35.09 k CPU allocations: 1.953 MiB) (7 GPU allocations: 190.735 MiB, 0.10% memmgmt time)
  # 0.149181 seconds (52.21 k CPU allocations: 41.074 MiB, 14.75% gc time) (6 GPU allocations: 152.589 MiB, 0.18% memmgmt time)

  # 0.087902 seconds (54.21 k CPU allocations: 2.912 MiB) (8 GPU allocations: 152.596 MiB, 0.07% memmgmt time)
  # 0.082117 seconds (44.14 k CPU allocations: 2.399 MiB) (6 GPU allocations: 114.445 MiB, 0.06% memmgmt time)


##### CPU, on cyclops

julia> let x = rand(100, 1000)
           @btime Flux.dropout($x, 0.3)
           @btime NNlib.dropout($x, 0.3)
           println()
           @btime Flux.dropout($x, 0.3; dims=1)
           @btime CUDA.@sync NNlib.dropout($x, 0.3; dims=1)
       end;
  491.842 μs (13 allocations: 1.53 MiB)
  297.053 μs (2 allocations: 781.30 KiB)

  139.822 μs (3 allocations: 782.17 KiB)
  143.676 μs (3 allocations: 782.17 KiB)

# one pass strategy, for the last:
  143.043 μs (3 allocations: 782.17 KiB)

@ToucheSir
Copy link
Member

One trick for GPU is to save the RNG state and lazily generate the mask in a kernel instead of materializing it (see https://triton-lang.org/master/getting-started/tutorials/04-low-memory-dropout.html for an implementation). This works because the default RNG is counter-based and doesn't have any dependence on previous states. Wikipedia tells me Xoshiro does not fall into this category, but we'd likely be doing a linear traversal on CPU and allocations are cheap(er) anyhow. All this to say that as long as things aren't slower on GPU, we can always optimize them separately in NNlibCUDA.

@mcabbott
Copy link
Member Author

mcabbott commented Jan 3, 2023

Ah that's an idea I did not consider, would probably pay on CPU too.

Copying Xoshiro is certainly fast and appears to work. It's just some bits... where do you read that it has other state?

julia> let rs = [copy(Random.default_rng()) for _ in 1:10]
         @test rand(rs[1]) == rand(rs[2])
         @test randn(rs[3], 3) == randn(rs[4], 3)
         @test rand!(rs[5], similar(1:0, 10, 10)) == rand!(rs[6], similar(1:0, 10, 10))
         rand(rs[1])
       end
0.8655824473102265

julia> dump(copy(Random.default_rng()))
Xoshiro
  s0: UInt64 0x3c8c337c7236f4ab
  s1: UInt64 0x5758ee14bcde3f3c
  s2: UInt64 0xb7bbc6e271015391
  s3: UInt64 0xb76992934d86a2b9

What I'm not sure of is whether copy(::AbstractRNG) is guaranteed to work like this. Copying MersenneTwister takes a μs.

A quick attempt is not quite as fast, but close, and saves memory.

mydropout6(args...; kw...) = dropout(args...; kw...)  # same forward as this PR
function ChainRulesCore.rrule(::typeof(mydropout6), rng::AbstractRNG, x::AbstractArray, p::Real; dims = :)
    dup_rng = copy(rng)   # save for backward pass
    y = dropout(dup_rng, x, p; dims)  # for other dims, could re-use buffer...
    val = convert(eltype(y), 1/(1-p))
    function back6(Δ)
        dims isa Colon || error("will be wrong")
        dx = rand!(dup_rng, similar(y))
        _fast_broadcast!(dx, unthunk(Δ)) do q, dy
            (q>p) * val * dy
        end
        (NoTangent(), NoTangent(), dx, NoTangent())
    end
    y, back6
end
let x = randn(Float32, 100, 100), rng = Random.default_rng()
    @btime gradient(x -> sum(dropout($rng, x, 0.3)), $x)     # first commit of PR
    @btime gradient(x -> sum(mydropout6($rng, x, 0.3)), $x)  # copy the RNG
end;
  # min 13.041 μs, mean 23.654 μs (22 allocations, 117.95 KiB)
  # min 16.916 μs, mean 24.421 μs (22 allocations, 78.97 KiB)

Edit: Here's a test with CUDA, which seems to reliably repeat on small sizes, but fail on large arrays. I presume that means it's dividing the work up in a non-deterministic way. (Maybe there are ways to control that?)

julia> @eval Base.copy(r::CUDA.RNG) = $(Expr(:new, CUDA.RNG, :(r.seed), :(r.counter)))

julia> let r1 = copy(CUDA.default_rng()), r2 = copy(CUDA.default_rng())
         x1 = rand(r1, 3000)
         x2 = rand(r2, 3000)
         x1==x2
       end
true

julia> let r1 = copy(CUDA.default_rng()), r2 = copy(CUDA.default_rng())
         x1 = rand(r1, 30_000)
         x2 = rand(r2, 30_000)
         x1==x2
       end
false

@ToucheSir
Copy link
Member

ToucheSir commented Jan 3, 2023

where do you read that it has other state?

I was going off https://en.wikipedia.org/wiki/Counter-based_random_number_generator_(CBRNG)#Background. It's not that it has other state, but that you can't get the Nth random value without first calculating some previous N-x values to advance the internal state. Whereas for the Philox RNG CUDA.jl uses, you just set the counter to however many steps forward you want and it gives you a value.

What I'm not sure of is whether copy(::AbstractRNG) is guaranteed to work like this.

This was my concern as well. We could always use dispatch to have a fallback path for unknown RNGs that materializes the mask.

src/dropout.jl Outdated
similar(A, T, ntuple(d -> d in dims ? size(A,d) : 1, ndims(A)))
end
rand!(rng, keep)
Y = @. (keep>p) * A * val
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the mask idea doesn't work out, another idea for making this type stable while still allowing short-circuiting on 0 and 1 is to make a lazy mask wrapper which only creates the mask if necessary. Depending on how devious we want to be, we can even use promote_op or a custom mask_type(xtype::Type{A}, ...) where A <: AbstractArray to determine what type the wrapped mask should be (since we can't determine it in the short-circuiting case).

@mcabbott
Copy link
Member Author

mcabbott commented Jan 3, 2023

but that you can't get the Nth random value without first calculating some previous N-x values to advance the internal state

Ah but that's fine I think. We don't need to skip into the future. We only need that the two copies used forward and backward produce the same, starting from whatever we happen to start.

We could always use dispatch to have a fallback path for unknown RNGs that materializes the mask.

Oh man, I'd be quite keen not to have to maintain two paths. (Already the no-gradient and gradient versions differ a bit here...)

@ToucheSir
Copy link
Member

ToucheSir commented Jan 3, 2023

Yes, the skipping is more relevant on GPU. IIRC there's a way to split a Xoshiro RNG for multithreading, so that's likely not an issue on CPU.

On divergent paths, I don't think it's too too bad. The dispatch split point could be on what Flux calls dropout_mask. We then override broadcast(*, ::LowMemDropoutMask, x) or introduce our own apply_mask function with a x .* mask fallback. The rrule is technically not required because AD will do dx = mask .* dy automatically.

@mcabbott
Copy link
Member Author

mcabbott commented Jan 3, 2023

I think we should do this, it's a step forward.

The copy-the-RNG story does work, and save memory on the GPU. But needs some constructor methods, and perhaps consultation with experts about safety. And copying MerseinTwister on 1.6 is pretty slow... we should kick the can down the road I think. No chance that changes the interface.

This saves memory & time over what's now in Flux, and will let us remove that. It needs a 2-line NNlibCUDA PR, and some tests there. has a companion PR: FluxML/NNlibCUDA.jl#60

Copy link
Member

@ToucheSir ToucheSir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections from me. All the aforementioned tricks can be done as additive or at least non-breaking changes in future PRs.

src/dropout.jl Show resolved Hide resolved
src/dropout.jl Show resolved Hide resolved
src/dropout.jl Outdated Show resolved Hide resolved
Comment on lines +147 to +155
"""
_rng_from_array(x)

Return the random number generator most appropriate for `x`:
`CUDA.default_rng()` for `CuArray`, else `Random.default_rng()`
"""
_rng_from_array(::AbstractArray) = Random.default_rng()

@non_differentiable _rng_from_array(::Any)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just simplifying a bit, I couldn't figure out why there were so many functions. Why make a different choice on 1.6

julia> using Random

julia> Random.default_rng()
MersenneTwister(0x9687b6121c4ccb062f473c9c3c8bccc6)

julia> Random.GLOBAL_RNG
Random._GLOBAL_RNG()

julia> VERSION
v"1.6.0"

compared to master:

julia> using Random

julia> Random.default_rng()
TaskLocalRNG()

julia> Random.GLOBAL_RNG
Random._GLOBAL_RNG()

julia> VERSION
v"1.10.0-DEV.204"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember now, but based on FluxML/Flux.jl#1849 (comment) it might've been related to thread safety?

Copy link
Member

@ToucheSir ToucheSir Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cthulhu tells me that rand(...) uses default_rng() on 1.6 as well and it returns a thread-local RNG, so maybe this was much ado about nothing. cc @darsnack if I've missed something though, and I think this function can be public like the Flux one.

@mcabbott mcabbott merged commit 2bae421 into FluxML:master Jan 5, 2023
@mcabbott mcabbott deleted the dropout branch January 5, 2023 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants