-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define and export undefs
similar to zeros
and ones
#42620
Conversation
Cross references:
|
Your idea for a function was down-voted, by e.g. Kristoffer which I highly respect (and you too), so I'm just guessing his concern. I think the point of the current way to do this is to be intentionally hard. I note that both your function and the status quo return arrays for you in the example full of zeros (why only for Ints, more likely for them or just coincidence?). I could support adding your function with one change. If you change the first byte of the array to non-zero (e.g. 0xFF, not sure, needs not be random), it will not lead people into a false sense of security. It's not that I don't want that for the status quo (I seemed to remember it being discussed), but maybe it's harder to do there, so an argument for your function. It should be a simple change (I don't think we allow true 0-length arrays, I mean empty arrays also allocate some memory? Unlike in C. EDIT: or maybe we do, why the bug below? Then this wouldn't work...). FYI: In both my 1.6 and 1.7-0-rc1: |
Yes, that's a bug. Please open an issue. |
This has been discussed a fair bit in the past. While this functionality does seem convenient, I think in this case the proposed name is a bit misleading: A similar argument can of course be made for Stefan proposed |
I've cited some of the prior discussion above. This is a controversial topic. The main new aspect here is a concrete implementation using a distinct symbol,
I really do not see what the point of making something intentionally hard is. All we are doing is making it harder than necessary for people who are new to Julia to write efficient code.
I've pursued the bug in the other issue, but I keep wondering what you were actually trying to do here. It seems to me that the current syntax might actually be quite confusing for you as well. Here you've tried to create an array with an element type of julia> Array{UndefInitializer}(undef,0)
UndefInitializer[]
julia> Array{UndefInitializer}(undef,())
0-dimensional Array{UndefInitializer, 0}:
UndefInitializer() Did the syntax has tripped you up as well? I'm not really sure why you would try to set the type to to the value, |
|
I would be less bothered by this if there was less of a difference between in performance between |
The lack of Julia: julia> @benchmark zeros((1024,1024))
BenchmarkTools.Trial: 2190 samples with 1 evaluation.
Range (min … max): 1.661 ms … 7.812 ms ┊ GC (min … max): 0.00% … 65.22%
Time (median): 1.826 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.278 ms ± 1.036 ms ┊ GC (mean ± σ): 19.89% ± 22.23% Python / NumPy: In [2]: %timeit np.zeros((1024,1024))
90.6 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) |
Out of curiosity, do you have some non-microbenchmark code to share as well where this has a significant impact? |
It's "fake" performance though: with The downside would be this: let's say someone allocates an array of 10 items and then sets the values in a for-loop. Currently the profiler would assign responsibility where it lies, with the use of |
That's not quite true. The operating system may have a pool of memory that it knows has already been initialized, and may be able to just allocate it. https://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc/2688522#2688522
|
That text indicates that the shared memory is used when you access the values, but setting values triggers allocation. A similar discussion started at https://discourse.julialang.org/t/julias-applicable-context-is-getting-narrower-over-time/55042/10?u=tim.holy. The conclusion there seems to be the Switching to |
That's exactly what @timholy was getting at here:
at which point, if the program doesn't write to the |
I would expect to see a large difference here between A and C as I write to it. Do we see it? Edit: fixed benchmark implementation, note the below is on Windows julia> faster_zeros(::Type{T}, dims...) where T = unsafe_wrap(Array{T}, Ptr{T}(Libc.calloc(prod(dims), sizeof(T))), dims; own = true)
faster_zeros (generic function with 1 method)
julia> @btime faster_zeros(Float64, 1024, 1024);
12.600 μs (2 allocations: 8.00 MiB)
julia> @btime zeros(Float64, 1024, 1024);
1.641 ms (2 allocations: 8.00 MiB)
julia> inds = CartesianIndices(1:5:1024*1024);
julia> @benchmark A[$inds] .= 3 setup = ( A = zeros(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 1995 samples with 1 evaluation.
Range (min … max): 81.800 μs … 274.700 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 92.600 μs ┊ GC (median): 0.00%
Time (mean ± σ): 99.389 μs ± 21.563 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▆██▆▅▄▃▄▂▁▁ ▁
▇█████████████▇▇▇▇▆▆▆▅▇▇▇▄▆▅▆▁▆▆▆▇▆▆▅▅▄▆▅▆▅▇▆▆▅▄▆▄▅▄▅▄▄▁▄▁▆▆ █
81.8 μs Histogram: log(frequency) by time 198 μs <
Memory estimate: 80 bytes, allocs estimate: 2.
julia> @benchmark C[$inds] .= 3 setup = ( C = faster_zeros(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 9547 samples with 1 evaluation.
Range (min … max): 315.100 μs … 2.603 ms ┊ GC (min … max): 0.00% … 70.03%
Time (median): 334.400 μs ┊ GC (median): 0.00%
Time (mean ± σ): 503.536 μs ± 355.562 μs ┊ GC (mean ± σ): 30.18% ± 26.42%
█▆▃▄▄▃▂ ▃▂▂▃▃▃▂ ▂
█████████▇▇▆▆▅▆▄▅▄▁▄▃▄▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▆▇██████████▇▇▇▆▆ █
315 μs Histogram: log(frequency) by time 1.46 ms <
Memory estimate: 80 bytes, allocs estimate: 2.
julia> versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake) |
Yes, I do observe that difference. Click for benchmarks
further, the For what it's worth (and relevant to the original PR), I think having something like |
IntroductionAfter talking to @Seelengrab on Zulip, part of the difference seems attributable to operating system. I therefore compared Julia under Windows and under Windows Subsystem for Linux 2 (WSL2). WindowsUnder Windows, julia> @btime zeros(Float64, 1024, 1024);
1.646 ms (2 allocations: 8.00 MiB)
julia> @btime faster_zeros(Float64, 1024, 1024);
12.700 μs (2 allocations: 8.00 MiB)
julia> @btime Array{Float64}(undef, 1024, 1024);
12.500 μs (2 allocations: 8.00 MiB)
julia> versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake) Windows Subsystem for Linux 2, on the same machineI then repeated the same benchmarks under WSL2 on the same machine. julia> @btime zeros(Float64, 1024, 1024);
464.700 μs (2 allocations: 8.00 MiB)
julia> @btime faster_zeros(Float64, 1024, 1024);
33.300 μs (2 allocations: 8.00 MiB)
julia> @btime Array{Float64}(undef, 1024, 1024);
14.463 μs (2 allocations: 8.00 MiB)
julia> versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake) |
Repeating the earlier set of benchmarks on Windows Subsystem for Linux 2: julia> using BenchmarkTools
julia> faster_zeros(::Type{T}, dims...) where T = unsafe_wrap(Array{T}, Ptr{T}(Libc.calloc(prod(dims), sizeof(T))), dims; own = true)
faster_zeros (generic function with 1 method)
julia> @btime faster_zeros(Float64, 1024, 1024);
16.700 μs (2 allocations: 8.00 MiB)
julia> @btime zeros(Float64, 1024, 1024);
461.300 μs (2 allocations: 8.00 MiB)
julia> inds = CartesianIndices(1:5:1024*1024);
julia> @benchmark A[$inds] .= 3 setup = ( A = zeros(Float64, 1024, 1024) )
BenchmarkTools.Trial: 6658 samples with 1 evaluation.
Range (min … max): 49.700 μs … 358.800 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 92.000 μs ┊ GC (median): 0.00%
Time (mean ± σ): 90.246 μs ± 28.876 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▅▅▅▂▁▃▅▆▅▆▆▅▃▂▄▇██▅▃▃▃▂▂▁▁ ▁ ▂
██████████████████████████████▇█▆▇████████▇▇▇▆▇▇▆▆▆█▆▆▆▆▆▆▆▅ █
49.7 μs Histogram: log(frequency) by time 203 μs <
Memory estimate: 80 bytes, allocs estimate: 2.
julia> @benchmark C[$inds] .= 3 setup = ( C = faster_zeros(Float64, 1024, 1024) )
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 40.800 μs … 44.661 ms ┊ GC (min … max): 0.00% … 99.74%
Time (median): 46.900 μs ┊ GC (median): 0.00%
Time (mean ± σ): 109.898 μs ± 601.519 μs ┊ GC (mean ± σ): 53.03% ± 31.13%
█▆▂▃▁ ▁▁▁▁ ▂▄▃▂▂▂▁ ▁
███████████▇▆▅▅▅▃▄▄▄▃▃▁▁▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁████████▇▇▇▆▆▆▇▆▇▇▇▆ █
40.8 μs Histogram: log(frequency) by time 448 μs <
Memory estimate: 80 bytes, allocs estimate: 2.
julia> versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake) |
Continuing the benchmarks, I wondered what the time total time to initialize, modify, and retrieve was using arrays obtained from WindowsOn Windows, julia> function a_func()
A = zeros(Float64, 1024, 1024)
A[1:5:length(A)] .= 2.0
sum(A)
end
a_func (generic function with 1 method)
julia> function c_func()
A = faster_zeros(Float64, 1024, 1024)
A[1:5:length(A)] .= 2.0
sum(A)
end
c_func (generic function with 1 method)
julia> @btime a_func()
2.373 ms (4 allocations: 8.00 MiB)
419432.0
julia> @btime c_func()
2.071 ms (4 allocations: 8.00 MiB)
419432.0 Windows Subsystem for Linux 2On WSL2, julia> function a_func()
A = zeros(Float64, 1024, 1024)
A[1:5:length(A)] .= 2.0
sum(A)
end
a_func (generic function with 1 method)
julia> function c_func()
C = faster_zeros(Float64, 1024, 1024)
C[1:5:length(C)] .= 2.0
sum(C)
end
c_func (generic function with 1 method)
julia> @btime a_func()
1.168 ms (4 allocations: 8.00 MiB)
419432.0
julia> @btime c_func()
581.300 μs (4 allocations: 8.00 MiB)
419432.0 ConclusionsOverall, it seems that the |
Regarding the name, I'm not fixed on |
With On Windows, `zeros_via_calloc`, and `undef` initialization take ~12.5 microseconds. Broadcasting addition across the array takes 3.2 ms when initialized via `zeros_via_calloc`, and 3.7 ms when initialized via `zeros`. On WSL2, broadcasting addition across the array takes 0.8 ms using `zeros_via_calloc` and 1.3 ms using `zeros`.# Windows
julia> @btime zeros_via_calloc(Float64, 1024, 1024);
12.400 μs (2 allocations: 8.00 MiB)
julia> @btime Array{Float64}(undef, 1024, 1024);
12.500 μs (2 allocations: 8.00 MiB)
julia> @btime zeros_via_calloc(Float64, 1024, 1024) .+ 1;
3.244 ms (4 allocations: 16.00 MiB)
julia> @btime zeros(Float64, 1024, 1024) .+ 1;
3.742 ms (4 allocations: 16.00 MiB)
# Windows Subsystem for Linux 2
julia> @btime zeros_via_calloc(Float64, 1024, 1024);
205.700 μs (2 allocations: 8.00 MiB)
julia> @btime Array{Float64}(undef, 1024, 1024);
13.289 μs (2 allocations: 8.00 MiB)
julia> @btime zeros_via_calloc(Float64, 1024, 1024) .+ 1;
811.800 μs (4 allocations: 16.00 MiB)
julia> @btime zeros(Float64, 1024, 1024) .+ 1;
1.308 ms (4 allocations: 16.00 MiB) If
I've seen enough samples across Windows, Linux, and Mac showing that a
Part of this is a false dichotomy. The existence of That said, we may need to consider how we might separate the language into a "Convenience API" and a "Sophisticated API". Perhaps there should be another default module called Presently the issue I am trying to address is that we have a convenience API that is slower than the sophisticated one for a common task, performant array creation. It is particularly slower on Windows than Linux, which is an important aspect where there are few developers looking into these differences. It's harder for me to evangelize for Julia when this is the case. I end up having to say something along the lines of "Yes, Julia can be quite fast, but you have to use this sophisticated syntax that I do not have time to explain at the moment. Also, you may need to switch operating systems". One development in light of |
Some breadcrumbs for future people:
|
Create an easy to use function,
undefs
that creates an uninitialized array using similar syntax tozeros
andones
.New users to Julia often use
zeros
orones
to initialize arrays even though initialization may not be needed. Part of the reason is that the syntax ofzeros
andones
is uncomplicated and reminiscent of similar functions in other languages and frameworks. Additionally, these functions also have a default type ofFloat64
.To make the creation of uninitialized arrays easier for users new to Julia, the method
undefs
is added that mimics the syntax and argument order ofones
andzeros
. Similar toones
andzeros
, it also has a default type ofFloat64
.While the functionality is redundant with
Array{T}(undef, dims)
, except for theFloat64
default, orsimilar
, the syntax is straightforward, does not require the use of curly braces, or the use of an existing array.Let's make efficient Julia easier to use with
undefs
.See also
numpy.empty