Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define and export undefs similar to zeros and ones #42620

Closed
wants to merge 2 commits into from

Conversation

mkitti
Copy link
Contributor

@mkitti mkitti commented Oct 13, 2021

Create an easy to use function, undefs that creates an uninitialized array using similar syntax to zeros and ones.

New users to Julia often use zeros or ones to initialize arrays even though initialization may not be needed. Part of the reason is that the syntax of zeros and ones is uncomplicated and reminiscent of similar functions in other languages and frameworks. Additionally, these functions also have a default type of Float64.

To make the creation of uninitialized arrays easier for users new to Julia, the method undefs is added that mimics the syntax and argument order of ones and zeros. Similar to ones and zeros, it also has a default type of Float64.

While the functionality is redundant with Array{T}(undef, dims), except for the Float64 default, or similar, the syntax is straightforward, does not require the use of curly braces, or the use of an existing array.

Let's make efficient Julia easier to use with undefs.

julia> Array{Float64}(undef, 5,5)
5×5 Matrix{Float64}:
 5.0e-324      2.5e-323      7.64061e-316  7.64062e-316  5.0e-323
 2.0e-323      7.64061e-316  7.64062e-316  4.4e-323      5.4e-323
 7.6406e-316   7.64061e-316  4.0e-323      4.4e-323      7.64063e-316
 7.64061e-316  3.0e-323      4.0e-323      7.64063e-316  7.64064e-316
 2.5e-323      3.5e-323      7.64062e-316  7.64063e-316  0.0

julia> undefs(5, 5)
5×5 Matrix{Float64}:
 7.63889e-316  1.22467e-315  5.99379e-316  7.63893e-316  7.64422e-316
 7.63889e-316  7.64418e-316  7.63891e-316  7.63893e-316  5.99388e-316
 7.63889e-316  7.63889e-316  7.63892e-316  7.64421e-316  5.99402e-316
 7.64418e-316  7.64418e-316  7.63892e-316  5.99387e-316  7.63894e-316
 5.99377e-316  7.64419e-316  7.63892e-316  7.63872e-316  0.0

julia> Array{Int}(undef, 5,5)
5×5 Matrix{Int64}:
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0

julia> undefs(Int, 5, 5)
5×5 Matrix{Int64}:
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0

See also numpy.empty

@mkitti
Copy link
Contributor Author

mkitti commented Oct 13, 2021

Cross references:

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Oct 14, 2021

Your idea for a function was down-voted, by e.g. Kristoffer which I highly respect (and you too), so I'm just guessing his concern.

I think the point of the current way to do this is to be intentionally hard. I note that both your function and the status quo return arrays for you in the example full of zeros (why only for Ints, more likely for them or just coincidence?). I could support adding your function with one change. If you change the first byte of the array to non-zero (e.g. 0xFF, not sure, needs not be random), it will not lead people into a false sense of security.

It's not that I don't want that for the status quo (I seemed to remember it being discussed), but maybe it's harder to do there, so an argument for your function. It should be a simple change (I don't think we allow true 0-length arrays, I mean empty arrays also allocate some memory? Unlike in C. EDIT: or maybe we do, why the bug below? Then this wouldn't work...).

FYI: In both my 1.6 and 1.7-0-rc1:
julia> n = Array{undef}[] # gives long error, I'm probably doing it wrong, but still a bug
Internal error: encountered unexpected error in runtime:

@KristofferC
Copy link
Member

julia> n = Array{undef}[]  # long error, I'm probably doing it wrong, but still a bug?

Yes, that's a bug. Please open an issue.

@ararslan
Copy link
Member

This has been discussed a fair bit in the past. While this functionality does seem convenient, I think in this case the proposed name is a bit misleading: ones and zeros actually give you ones and zeros (in the sense of one and zero) of the specified type but undefs doesn't always give you actually undefined values (in the sense of !isdefined) as its name seems to imply, it just gives you values that happen to be in the memory it grabbed when allocating the array, which will be !isdefined for non-bits-types.

A similar argument can of course be made for Array{T}(undef, dims) but note that the undef there used to be uninitialized, which was more correct but also painfully long. It was shortened as a compromise, since there are cases where you do want this, but the thinking was that generally speaking it should be encouraged to ask for initialized memory.

Stefan proposed junk(T, dims) for this, which I think is actually more descriptive and is at the very least more amusing. Though I'm personally not super convinced that this niche-ish case warrants a more concise constructor.

@mkitti
Copy link
Contributor Author

mkitti commented Oct 15, 2021

I've cited some of the prior discussion above. This is a controversial topic. The main new aspect here is a concrete implementation using a distinct symbol, undefs. Elements of the discussion trace back to 2014 in #9147 and 2011 in #130.

I think the point of the current way to do this is to be intentionally hard. I note that both your function and the status quo return arrays for you in the example full of zeros (why only for Ints, more likely for them or just coincidence?). I could support adding your function with one change. If you change the first byte of the array to non-zero (e.g. 0xFF, not sure, needs not be random), it will not lead people into a false sense of security.

I really do not see what the point of making something intentionally hard is. All we are doing is making it harder than necessary for people who are new to Julia to write efficient code.

n = Array{undef}[]

I've pursued the bug in the other issue, but I keep wondering what you were actually trying to do here. It seems to me that the current syntax might actually be quite confusing for you as well. Here you've tried to create an array with an element type of undef with no elements. You're stated intention is to create some kind of zero-length array. Perhaps you meant something like this?

julia> Array{UndefInitializer}(undef,0)
UndefInitializer[]

julia> Array{UndefInitializer}(undef,())
0-dimensional Array{UndefInitializer, 0}:
UndefInitializer()

Did the syntax has tripped you up as well? I'm not really sure why you would try to set the type to to the value, undef rather than the type UndefInitializer.

@timholy
Copy link
Member

timholy commented Oct 15, 2021

Array{T}(undef, sz...) generalizes to MySpecialArray{T}(undef, sz...). undefs does not. I think it's better to push people to generalizable syntax that teaches them to be sophisticated Julia users & developers, and not pander to lessons learned from painfully limited programming languages.

zeros is not my favorite Julia function, and ones is even worse (it should be oneunits which is awful).

@mkitti
Copy link
Contributor Author

mkitti commented Oct 15, 2021

I would be less bothered by this if there was less of a difference between in performance between zeros(...) and Array{T}(undef, ...).

@mkitti
Copy link
Contributor Author

mkitti commented Oct 15, 2021

The lack of calloc and the resulting timings got to me today.

Julia:

julia> @benchmark zeros((1024,1024))
BenchmarkTools.Trial: 2190 samples with 1 evaluation.
 Range (min  max):  1.661 ms  7.812 ms  ┊ GC (min  max):  0.00%  65.22%
 Time  (median):     1.826 ms             ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.278 ms ± 1.036 ms  ┊ GC (mean ± σ):  19.89% ± 22.23%

Python / NumPy:

In [2]: %timeit np.zeros((1024,1024))
90.6 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@KristofferC
Copy link
Member

KristofferC commented Oct 15, 2021

Out of curiosity, do you have some non-microbenchmark code to share as well where this has a significant impact?

@timholy
Copy link
Member

timholy commented Oct 15, 2021

It's "fake" performance though: with calloc you pay the price when you set values in the array, rather than when you allocate it; it's truly faster only if the array always stays all-zeros, which I'm guessing is pretty rare.

The downside would be this: let's say someone allocates an array of 10 items and then sets the values in a for-loop. Currently the profiler would assign responsibility where it lies, with the use of zeros, and the user will discover they should switch to Array{T}(undef, ...). If we switch to calloc, then the use of zeros will look fast, but the first iteration through the loop will be slow, and the profiler will flag the loop as the problem. That makes it seem like loops are slow and that there's nothing the user can do to fix this, when that's not actually true. Personally I think it's better to be transparent about where the issues arise from, but I don't dispute that use of calloc makes us look bad.

@mkitti
Copy link
Contributor Author

mkitti commented Oct 15, 2021

That's not quite true. The operating system may have a pool of memory that it knows has already been initialized, and may be able to just allocate it.

https://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc/2688522#2688522

Some processes allocate memory and then read from it without modifying it. This means that a lot of pages in memory across different processes may be filled with pristine zeroes returned from mmap(). Since these pages are all the same, the kernel makes all these virtual addresses point a single shared 4 KiB page of memory filled with zeroes. If you try to write to that memory, the processor triggers another page fault and the kernel steps in to give you a fresh page of zeroes that isn't shared with any other programs.

@timholy
Copy link
Member

timholy commented Oct 15, 2021

That text indicates that the shared memory is used when you access the values, but setting values triggers allocation.

A similar discussion started at https://discourse.julialang.org/t/julias-applicable-context-is-getting-narrower-over-time/55042/10?u=tim.holy. The conclusion there seems to be the calloc doesn't actually help, though of course different systems may yield different results.

Switching to calloc would be pretty trivial, so if someone can clearly demonstrate an advantage without major disadvantages (that requires more than a single, simple benchmark), I'd change my tune.

@Seelengrab
Copy link
Contributor

Some processes allocate memory and then read from it without modifying it.

That's exactly what @timholy was getting at here:

with calloc you pay the price when you set values in the array, rather than when you allocate it; it's truly faster only if the array always stays all-zeros, which I'm guessing is pretty rare.

at which point, if the program doesn't write to the zeros array at all, why allocate it?

@mkitti
Copy link
Contributor Author

mkitti commented Oct 15, 2021

I would expect to see a large difference here between A and C as I write to it. Do we see it?

Edit: fixed benchmark implementation, note the below is on Windows

julia> faster_zeros(::Type{T}, dims...) where T = unsafe_wrap(Array{T}, Ptr{T}(Libc.calloc(prod(dims), sizeof(T))), dims; own = true)
faster_zeros (generic function with 1 method)

julia> @btime faster_zeros(Float64, 1024, 1024);
  12.600 μs (2 allocations: 8.00 MiB)

julia> @btime zeros(Float64, 1024, 1024);
  1.641 ms (2 allocations: 8.00 MiB)

julia> inds = CartesianIndices(1:5:1024*1024);

julia> @benchmark A[$inds] .= 3 setup = ( A = zeros(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 1995 samples with 1 evaluation.
 Range (min  max):  81.800 μs  274.700 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     92.600 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   99.389 μs ±  21.563 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃▆██▆▅▄▃▄▂▁▁                                               ▁
  ▇█████████████▇▇▇▇▆▆▆▅▇▇▇▄▆▅▆▁▆▆▆▇▆▆▅▅▄▆▅▆▅▇▆▆▅▄▆▄▅▄▅▄▄▁▄▁▆▆ █
  81.8 μs       Histogram: log(frequency) by time       198 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

julia> @benchmark C[$inds] .= 3 setup = ( C = faster_zeros(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 9547 samples with 1 evaluation.
 Range (min  max):  315.100 μs    2.603 ms  ┊ GC (min  max):  0.00%  70.03%
 Time  (median):     334.400 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   503.536 μs ± 355.562 μs  ┊ GC (mean ± σ):  30.18% ± 26.42%

  █▆▃▄▄▃▂                                       ▃▂▂▃▃▃▂         ▂
  █████████▇▇▆▆▅▆▄▅▄▁▄▃▄▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▆▇██████████▇▇▇▆▆ █
  315 μs        Histogram: log(frequency) by time       1.46 ms <

 Memory estimate: 80 bytes, allocs estimate: 2.

julia> versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

@Seelengrab
Copy link
Contributor

Yes, I do observe that difference.

Click for benchmarks
julia> z1 = @benchmark A[$inds] .= 5.0 setup=(A=zeros(Float64, 1024,1024)) evals=1
BenchmarkTools.Trial: 2916 samples with 1 evaluation.
 Range (min … max):  172.588 μs … 335.784 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     211.805 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   217.761 μs ±  21.527 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂▃▂ ▃▄▆▆▅█▆             ▃▁
  ▂▂▁▁▁▁▁▂▃▄▇████████████▆▆▄▃▃▃▂▃▃▃▄▆██▄▄▄▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  173 μs           Histogram: frequency by time          293 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

julia> z2 = @benchmark A[$inds] .= 5.0 setup=(A=faster_zeros(Float64, 1024,1024)) evals=1
BenchmarkTools.Trial: 5091 samples with 1 evaluation.
 Range (min … max):  177.771 μs … 63.472 ms  ┊ GC (min … max):  0.00% … 99.00%
 Time  (median):     218.067 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   368.626 μs ±  1.183 ms  ┊ GC (mean ± σ):  28.69% ± 23.18%

  ▃▇█▇▄▃▃▂▂▁               ▅▂  ▁                   ▁▄▄▃▂▂▂▂▂   ▂
  ███████████▇▅▅▁▁▁▁▁▁▁▁▁▁████▆██▇▇▅▄▁▃▁▁▁▁▁▁▁▁▁▁▁▁██████████▆ █
  178 μs        Histogram: log(frequency) by time       918 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

julia> z3 = @benchmark A[$inds] .= 5.0 setup=(A=Array{Float64}(undef, 1024,1024)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  169.891 μs … 697.408 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     206.907 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   258.157 μs ± 119.616 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▆▇▅██▇▅▃▃▃▂▃▂▁▁▁                            ▂▅▂▁ ▁▅▄▂▂▂▁▁▁▁▁ ▂
  ███████████████████▇▅▄▅▃▁▁▃▃▃▁▁▁▃▁▁▁▁▁▃▁▁▁▁▁▁████████████████ █
  170 μs        Histogram: log(frequency) by time        579 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

further, the Array{Float64}(undef, ...) approach is fastest across the board.

For what it's worth (and relevant to the original PR), I think having something like junk(T, dims..) is a good idea, calling it undefs isn't.

@mkitti
Copy link
Contributor Author

mkitti commented Oct 15, 2021

Introduction

After talking to @Seelengrab on Zulip, part of the difference seems attributable to operating system. I therefore compared Julia under Windows and under Windows Subsystem for Linux 2 (WSL2).

Windows

Under Windows, zeros is considerably slower (more than 100x) than either faster_zeros or undef initialization , which are actually almost the same in their benchmarks.

julia> @btime zeros(Float64, 1024, 1024);
  1.646 ms (2 allocations: 8.00 MiB)

julia> @btime faster_zeros(Float64, 1024, 1024);
  12.700 μs (2 allocations: 8.00 MiB)

julia> @btime Array{Float64}(undef, 1024, 1024);
  12.500 μs (2 allocations: 8.00 MiB)

julia> versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

Windows Subsystem for Linux 2, on the same machine

I then repeated the same benchmarks under WSL2 on the same machine. zeros on WSL 2 is faster than zeros on Windows by a factor of 3. However, faster_zeros still outperforms zeros here by a factor of more than 10x, although it is slower than undef initialization by a factor of two.

julia> @btime zeros(Float64, 1024, 1024);
  464.700 μs (2 allocations: 8.00 MiB)

julia> @btime faster_zeros(Float64, 1024, 1024);
  33.300 μs (2 allocations: 8.00 MiB)

julia> @btime Array{Float64}(undef, 1024, 1024);
  14.463 μs (2 allocations: 8.00 MiB)

julia> versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

@mkitti
Copy link
Contributor Author

mkitti commented Oct 15, 2021

Repeating the earlier set of benchmarks on Windows Subsystem for Linux 2:

julia> using BenchmarkTools

julia> faster_zeros(::Type{T}, dims...) where T = unsafe_wrap(Array{T}, Ptr{T}(Libc.calloc(prod(dims), sizeof(T))), dims; own = true)
faster_zeros (generic function with 1 method)

julia> @btime faster_zeros(Float64, 1024, 1024);
  16.700 μs (2 allocations: 8.00 MiB)

julia> @btime zeros(Float64, 1024, 1024);
  461.300 μs (2 allocations: 8.00 MiB)

julia> inds = CartesianIndices(1:5:1024*1024);

julia> @benchmark A[$inds] .= 3 setup = ( A = zeros(Float64, 1024, 1024) )
BenchmarkTools.Trial: 6658 samples with 1 evaluation.
 Range (min  max):  49.700 μs  358.800 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     92.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   90.246 μs ±  28.876 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▅▅▅▂▁▃▅▆▅▆▆▅▃▂▄▇██▅▃▃▃▂▂▁▁ ▁                                ▂
  ██████████████████████████████▇█▆▇████████▇▇▇▆▇▇▆▆▆█▆▆▆▆▆▆▆▅ █
  49.7 μs       Histogram: log(frequency) by time       203 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

julia> @benchmark C[$inds] .= 3 setup = ( C = faster_zeros(Float64, 1024, 1024) )
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):   40.800 μs   44.661 ms  ┊ GC (min  max):  0.00%  99.74%
 Time  (median):      46.900 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   109.898 μs ± 601.519 μs  ┊ GC (mean ± σ):  53.03% ± 31.13%

  █▆▂▃▁ ▁▁▁▁                               ▂▄▃▂▂▂▁              ▁
  ███████████▇▆▅▅▅▃▄▄▄▃▃▁▁▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁████████▇▇▇▆▆▆▇▆▇▇▇▆ █
  40.8 μs       Histogram: log(frequency) by time        448 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

julia> versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

@mkitti
Copy link
Contributor Author

mkitti commented Oct 15, 2021

Continuing the benchmarks, I wondered what the time total time to initialize, modify, and retrieve was using arrays obtained from zeros or faster_zeros under both Windows and WSL2. .

Windows

On Windows, zeros was slower than faster_zeros but the resulting array from zeros was faster to write to than the array from faster_zeros. In a combined operation below we see that the version with faster_zeros takes 87% of the time compared to the version with zeros.

julia> function a_func()
           A = zeros(Float64, 1024, 1024)
           A[1:5:length(A)] .= 2.0
           sum(A)
       end
a_func (generic function with 1 method)

julia> function c_func()
           A = faster_zeros(Float64, 1024, 1024)
           A[1:5:length(A)] .= 2.0
           sum(A)
       end
c_func (generic function with 1 method)

julia> @btime a_func()
  2.373 ms (4 allocations: 8.00 MiB)
419432.0

julia> @btime c_func()
  2.071 ms (4 allocations: 8.00 MiB)
419432.0

Windows Subsystem for Linux 2

On WSL2, zeros was also slower than faster_zeros. The resulting array from zeros was more consistent in terms of write performance, although it had a similar mean to the array from faster_zeros. In a combined operation below we see that the version with faster_zeros takes 50% of the time compared to the version with zeros.

julia> function a_func()
           A = zeros(Float64, 1024, 1024)
           A[1:5:length(A)] .= 2.0
           sum(A)
       end
a_func (generic function with 1 method)

julia> function c_func()
           C = faster_zeros(Float64, 1024, 1024)
           C[1:5:length(C)] .= 2.0
           sum(C)
       end
c_func (generic function with 1 method)

julia> @btime a_func()
  1.168 ms (4 allocations: 8.00 MiB)
419432.0

julia> @btime c_func()
  581.300 μs (4 allocations: 8.00 MiB)
419432.0

Conclusions

Overall, it seems that the calloc based faster_zeros does tend to outperform the current implementation of zeros. Initialization via faster_zeros is faster than zeros under Windows and Linux. Write operations were slower on the array from faster_zeros compared to zeros. However, in a combined benchmark measuring initialization, write, and summation, the use of faster_zeros appears to have outperformed zeros on both Windows and Linux.

@mkitti
Copy link
Contributor Author

mkitti commented Oct 15, 2021

Regarding the name, I'm not fixed on undefs. junk is perfectly fine for me as well.

@mkitti
Copy link
Contributor Author

mkitti commented Oct 16, 2021

With calloc on Windows, zeros_via_calloc(dims) becomes nearly as fast as Array{Float64}(undef, dims) and time is saved even after subsequent operations. This is a significant improvement over the current zeros based on malloc followed by memset which is a half a millisecond slower after subsequent operations. Contrary to preconceived notations, it seems that by using calloc we can take advantage of some fusion of operations.

On Windows, `zeros_via_calloc`, and `undef` initialization take ~12.5 microseconds. Broadcasting addition across the array takes 3.2 ms when initialized via `zeros_via_calloc`, and 3.7 ms when initialized via `zeros`. On WSL2, broadcasting addition across the array takes 0.8 ms using `zeros_via_calloc` and 1.3 ms using `zeros`.
# Windows

julia> @btime zeros_via_calloc(Float64, 1024, 1024);
  12.400 μs (2 allocations: 8.00 MiB)

julia> @btime Array{Float64}(undef, 1024, 1024);
  12.500 μs (2 allocations: 8.00 MiB)

julia> @btime zeros_via_calloc(Float64, 1024, 1024) .+ 1;
  3.244 ms (4 allocations: 16.00 MiB)

julia> @btime zeros(Float64, 1024, 1024) .+ 1;
  3.742 ms (4 allocations: 16.00 MiB)
  
 # Windows Subsystem for Linux 2

julia> @btime zeros_via_calloc(Float64, 1024, 1024);
  205.700 μs (2 allocations: 8.00 MiB)

julia> @btime Array{Float64}(undef, 1024, 1024);
  13.289 μs (2 allocations: 8.00 MiB)

julia> @btime zeros_via_calloc(Float64, 1024, 1024) .+ 1;
  811.800 μs (4 allocations: 16.00 MiB)

julia> @btime zeros(Float64, 1024, 1024) .+ 1;
  1.308 ms (4 allocations: 16.00 MiB) 

If zeros(dims) becomes essentially as fast as Array{T}(undef, dims), then a convenience constructor for the later is no longer needed.

Switching to calloc would be pretty trivial, so if someone can clearly demonstrate an advantage without major disadvantages (that requires more than a single, simple benchmark), I'd change my tune.

calloc was originally brought up in #130 in 2011 and attempted in #22953 in 2017. It appears to be more than trivial unfortunately in part because of the desire to align the memory allocation and integrate that alignment into garbage collection.

I've seen enough samples across Windows, Linux, and Mac showing that a calloc based zeros comes ahead of the current zeros that uses malloc and memset on average across all three operating systems even after considering write and read operations to the array. This does not appear to have been in dispute since 2011.

Array{T}(undef, sz...) generalizes to MySpecialArray{T}(undef, sz...). undefs does not. I think it's better to push people to generalizable syntax that teaches them to be sophisticated Julia users & developers, and not pander to lessons learned from painfully limited programming languages.

zeros is not my favorite Julia function, and ones is even worse (it should be oneunits which is awful).

Part of this is a false dichotomy. The existence of ones, zeros, and undefs does not preclude the generalizable Array{T}(undef, sz...) syntax or vice versa. I want to be able to teach Julia in an Introduction to Programming setting without having to get into the nuances of type parameters and singletons on the first day. Explaining what Array{T}(undef, sz...) does require some explanation of those concepts. When that syntax becomes worthwhile is when you actually do need a different array type. I also do not think that all users of Julia necessarily need to become sophisticated users & developers. Not everyone wants to spend the time to develop that sophistication. In the spirit of the two-language-problem, Julia can be both a simple to use language and a sophisticated one.

That said, we may need to consider how we might separate the language into a "Convenience API" and a "Sophisticated API". Perhaps there should be another default module called Convenience where we put methods like ones, zeros, and possibly undefs while reserving Base for the one-true-syntax. In some situations, such as within Base itself perhaps the Convenience module is not imported.

Presently the issue I am trying to address is that we have a convenience API that is slower than the sophisticated one for a common task, performant array creation. It is particularly slower on Windows than Linux, which is an important aspect where there are few developers looking into these differences. It's harder for me to evangelize for Julia when this is the case. I end up having to say something along the lines of "Yes, Julia can be quite fast, but you have to use this sophisticated syntax that I do not have time to explain at the moment. Also, you may need to switch operating systems".

One development in light of calloc is that I now see a clear case for a ZeroInitializer with singleton zeroinit that can be used via the following syntax Array{zeroinit, dims}. The use of zeroinit would indicate that the array should be allocated via calloc rather than malloc whenever that is implemented.

@Seelengrab
Copy link
Contributor

Some breadcrumbs for future people:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants