Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic read-modify-write gets much slower with the new API #41843

Open
bicycle1885 opened this issue Aug 9, 2021 · 1 comment
Open

Atomic read-modify-write gets much slower with the new API #41843

bicycle1885 opened this issue Aug 9, 2021 · 1 comment
Labels
multithreading Base.Threads and related functionality performance Must go faster

Comments

@bicycle1885
Copy link
Member

I guess core devs already know this issue, but I couldn't find a dedicated issue. So, let me file an issue.

I discovered that the new per-field atomics API couldn't match the previous API in terms of performance.

mutable struct Atomic{T}
    @atomic data::T
end
increment!(x::Atomic) = @atomic x.data += 1
increment!(x::Threads.Atomic) = Threads.atomic_add!(x, 1)

As can be seen, the new API is almost 30 times slower:

julia> x = Atomic{Int}(0);

julia> @btime for _ in 1:1_000_000
           increment!($x)
       end
  109.877 ms (3000000 allocations: 61.04 MiB)

julia> x = Threads.Atomic{Int}(0);

julia> @btime for _ in 1:1_000_000
           increment!($x)
       end
  3.895 ms (0 allocations: 0 bytes)

Of course, this is because the @atomic x.data += 1 call is failed to be optimized down to a sequence of lock and xadd instructions of AMD64. If we deprecate the old API, I think the new API should provide an alternative way that is comparable in terms of performance.

@vtjnash vtjnash added performance Must go faster multithreading Base.Threads and related functionality labels Aug 9, 2021
@tkf
Copy link
Member

tkf commented Sep 14, 2021

FWIW, here's yet another benchmark.

The performance difference in the OP seems to be already resolved (I'm checking this after #41859 and #42017). But the difference is observable in a parallel setting:

mutable struct Atomic{T}
    @atomic data::T
end
increment!(x::Atomic) = @atomic x.data += 1
increment!(x::Threads.Atomic) = Threads.atomic_add!(x, 1)

function serial_increments!(ref, n = 2^20)
    for _ in 1:n
        increment!(ref)
    end
end

function parallel_increments!(ref, n = 2^20)
    Threads.@threads for _ in 1:Threads.nthreads()
        serial_increments!(ref, n)
    end
end

using BenchmarkTools

suite = BenchmarkGroup()
suite["serial Threads"] = @benchmarkable serial_increments!(Threads.Atomic{Int}(0))
suite["serial @atomic"] = @benchmarkable serial_increments!(Atomic{Int}(0))
suite["parallel Threads"] = @benchmarkable parallel_increments!(Threads.Atomic{Int}(0))
suite["parallel @atomic"] = @benchmarkable parallel_increments!(Atomic{Int}(0))

result = run(suite; verbose = true)

I get

julia> sort!(collect(Dict(result)); by = first)
4-element Vector{Pair{String, BenchmarkTools.Trial}}:
 "parallel @atomic" => 457.105 ms
 "parallel Threads" => 227.970 ms
   "serial @atomic" => 5.950 ms
   "serial Threads" => 5.636 ms

julia> Threads.nthreads()
16

(Aside: I find it puzzling that the difference can be observed in the parallel benchmark but not in the serial one. I thought it'd be the other way around because the cache traffic would hide the function call overhead.)

This may be due to that modifyfield! is not completely inlined:

julia> @code_llvm increment!(Atomic(0))
;  @ ....
define i64 @"julia_increment!_1222"({}* nonnull align 8 dereferenceable(8) %0) #0 {
top:
; ┌ @ Base.jl:56 within `modifyproperty!`
   %1 = bitcast {}* %0 to i64*
   %2 = load atomic i64, i64* %1 monotonic, align 8
   br label %xchg

done_xchg:                                        ; preds = %xchg
; └
  ret i64 %4

xchg:                                             ; preds = %xchg, %top
; ┌ @ Base.jl:56 within `modifyproperty!`
   %3 = phi i64 [ %2, %top ], [ %6, %xchg ]
   %4 = call i64 @"j_+_1224"(i64 signext %3, i64 signext 1) #0
   %5 = cmpxchg i64* %1, i64 %3, i64 %4 seq_cst monotonic
   %6 = extractvalue { i64, i1 } %5, 0
   %7 = extractvalue { i64, i1 } %5, 1
   br i1 %7, label %done_xchg, label %xchg
; └
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants