You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As can be seen, the new API is almost 30 times slower:
julia> x = Atomic{Int}(0);
julia> @btime for _ in 1:1_000_000
increment!($x)
end
109.877 ms (3000000 allocations: 61.04 MiB)
julia> x = Threads.Atomic{Int}(0);
julia> @btime for _ in 1:1_000_000
increment!($x)
end
3.895 ms (0 allocations: 0 bytes)
Of course, this is because the @atomic x.data += 1 call is failed to be optimized down to a sequence of lock and xadd instructions of AMD64. If we deprecate the old API, I think the new API should provide an alternative way that is comparable in terms of performance.
The text was updated successfully, but these errors were encountered:
The performance difference in the OP seems to be already resolved (I'm checking this after #41859 and #42017). But the difference is observable in a parallel setting:
mutable struct Atomic{T}
@atomic data::Tendincrement!(x::Atomic) =@atomic x.data +=1increment!(x::Threads.Atomic) = Threads.atomic_add!(x, 1)
functionserial_increments!(ref, n =2^20)
for _ in1:n
increment!(ref)
endendfunctionparallel_increments!(ref, n =2^20)
Threads.@threadsfor _ in1:Threads.nthreads()
serial_increments!(ref, n)
endendusing BenchmarkTools
suite =BenchmarkGroup()
suite["serial Threads"] =@benchmarkableserial_increments!(Threads.Atomic{Int}(0))
suite["serial @atomic"] =@benchmarkableserial_increments!(Atomic{Int}(0))
suite["parallel Threads"] =@benchmarkableparallel_increments!(Threads.Atomic{Int}(0))
suite["parallel @atomic"] =@benchmarkableparallel_increments!(Atomic{Int}(0))
result =run(suite; verbose =true)
I get
julia>sort!(collect(Dict(result)); by = first)
4-element Vector{Pair{String, BenchmarkTools.Trial}}:"parallel @atomic"=>457.105 ms
"parallel Threads"=>227.970 ms
"serial @atomic"=>5.950 ms
"serial Threads"=>5.636 ms
julia> Threads.nthreads()
16
(Aside: I find it puzzling that the difference can be observed in the parallel benchmark but not in the serial one. I thought it'd be the other way around because the cache traffic would hide the function call overhead.)
This may be due to that modifyfield! is not completely inlined:
I guess core devs already know this issue, but I couldn't find a dedicated issue. So, let me file an issue.
I discovered that the new per-field atomics API couldn't match the previous API in terms of performance.
As can be seen, the new API is almost 30 times slower:
Of course, this is because the
@atomic x.data += 1
call is failed to be optimized down to a sequence oflock
andxadd
instructions of AMD64. If we deprecate the old API, I think the new API should provide an alternative way that is comparable in terms of performance.The text was updated successfully, but these errors were encountered: