Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Get rid of #undef and replace it with null in Array{Union{Null, T}} #23721

Closed
nalimilan opened this issue Sep 15, 2017 · 21 comments
Closed
Labels
arrays [a, r, r, a, y, s] breaking This change will break code

Comments

@nalimilan
Copy link
Member

I mentioned this idea at #16029 (comment) but I thought this proposal entails large enough changes that it deserves its own issue. The idea is to get rid of #undef entries in non-isbits arrays, and replace it with null(from #23642). This would have several advantages:

  • Use a single representation for missing values everywhere: currently #undef is kind of weird since you can get this value, but never set it again, and having multiple notions of missing/uninitialized value increases the complexity of the language.
  • Guarantee that an Array{T} always contains valid values of type T: currently there's always the possibility that indexing into a non-isbits array throws an UndefRefError, which is akin to the "billion dollar mistake" (though in a less severe form since it only affects arrays).

Concretely, in order to be able to set uninitialized values to null, we must force all uninitialized array constructors to create Array{Union{Null, T}} rather than Array{T} objects. For non-isbits arrays, this is not a problem since there is room to store the type tag already, or a NULL pointer can be translated to null as a built-in special case. For isbits arrays, with @quinnj's recent work to optimize Unions (#22441), the performance impact is limited: it is equivalent to allocating an Array{T} plus an Array{UInt8} (the latter storing the type, i.e. whether the value is initialized or not). Moreover, with Union{Null, T} the type tag for Null is equal to 0, which means the Array{UInt8} part can be allocated directly and efficiently using calloc. But of course for isbits arrays it would be more common to create an initialized Array{T} directly using zeros or fill (which would be the recommended constructors).

Once all entries in an unitialized Array{Union{Null, T}} have been replaced with valid objects, the array can be turned into an Array{T} via a simple call to convert. This never requires making a copy, so it's very efficient. Then the Array{T} can safely be passed to functions which expect it to contain only valid T objects, which is now always guaranteed.

If it turns out that creating the Array{UInt8} type tag part is too costly in some very particular cases, an unsafe_array function could be provided, which would allow creating uninitialized Array{T} objects for isbits types. This should probably not be allowed for non-isbits types (though if it was the current behavior of #undef could possibly be retained, with the expectation that most users would never see it).

Since it forces filling arrays with valid values on construction, this proposal effectively amounts to forbidding uninitialized arrays (#9147).

@nalimilan nalimilan changed the title RFC: Get rid of #undef and replace it with null in Array{Union{T, Null} RFC: Get rid of #undef and replace it with null in Array{Union{Null, T}} Sep 15, 2017
@quinnj
Copy link
Member

quinnj commented Sep 15, 2017

@vtjnash and I discussed this at JuliaCon (getting rid of #undef). I'm not sure about the Union{T, Null} approach, as it seems to imply every array would potentially cost an extra byte per element. Jameson and I talked about just getting rid of the uninitialized Array constructors (i.e. Vector{T}(n)), so you'd be left w/ fill(val) to initialize arrays. My impression is that it actually wouldn't be that crazy of a deprecation/removal, as you almost always know a value to initialize with.

While it would certainly be nice to reduce language complexity w/ one less form of "undef"/"missinginess", we mainly discussed this in the context of #18632, which has been a long-desired optimization. While that PR has been deferred due to being "an optimization", we should still consider any deprecations that might be needed in order to facilitate those optimizations.

@nalimilan
Copy link
Member Author

nalimilan commented Sep 15, 2017

I think "getting rid of the uninitialized Array constructors" is essentially equivalent to this proposal, since then you have no way to create an "empty" Array{T} except by creating an Array{Union{Null, T}} and converting it once it's fully initialized. So (as I say in the description) in most cases you'd rather use zeros or fill if you know a reasonable initial value and it's not too costly to create.

@quinnj
Copy link
Member

quinnj commented Sep 15, 2017

Oh, one other thing I'd mention is that with nice syntax for Union{T, Null}, it would be a very trivial change to require of users to go from Vector{T}(n) to Vector{T?}(n).

@mbauman
Copy link
Member

mbauman commented Sep 15, 2017

Since #undef values are entirely poisonous, would this change be breaking? I think this could be a feature in 1.x.

@mbauman mbauman added this to the 1.x milestone Sep 15, 2017
@mbauman mbauman added the arrays [a, r, r, a, y, s] label Sep 15, 2017
@nalimilan
Copy link
Member Author

What would be breaking would be removing uninitialized array constructors, which is mostly what this proposal is about. So I don't think the milestone is appropriate.

@mbauman
Copy link
Member

mbauman commented Sep 18, 2017

You're absolutely right - and the representation of the eltype would need to be carefully considered, too.

@mbauman mbauman removed this from the 1.x milestone Sep 18, 2017
@mbauman mbauman added breaking This change will break code triage This should be discussed on a triage call labels Sep 18, 2017
@JeffBezanson
Copy link
Member

If we remove #undef, we also need to talk about undefined variables. Some variables (namely captured mutable variables) are implemented as struct fields, so variables and struct fields need to support the same behaviors with respect to undefined-ness.

My interpretation of the "billion dollar mistake" is the introduction of first-class null references --- i.e. null references you can pass to a function. The problem is that null references propagate, so that a problem can appear far from the source. Immediately raising an error on reading a null reference helps localize the problem. So whatever the merit of replacing #undef with first-class Null values might be, it moves closer to the "billion dollar mistake" rather than farther IMO.

Also, Any includes Null, so would Vector{Any}(n) yield an array full of nulls? I understand wanting to encourage constructors like fill, but replacing an array of undefined references with an array of objects that are technically "defined" but not what you want is not necessarily an improvement.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Sep 19, 2017

If we remove #undef, we also need to talk about undefined variables.

The implementation could be changed to implicitly use T? or Some{T}? for values that can be uninitialized (depending on whether nothing isa T or not).

My suspicion is that the really problematic part of the billion dollar mistake is not that null is first-class in C/C++/Java, but rather that its exclusion cannot be easily expressed in their type systems. In other words, for any reference type in those languages (pointers in C/C++, objects in Java), you cannot express that definitely don't have a null value, so you always have to be worried that you might have a null instead of a valid reference to a value of that type. In this approach that would still be a problem for Any but assuming Any is the parent type of Void, it wouldn't be a problem for any other abstract types: if you have a Vector{<:Number} you can be sure that there are no nothing values; you could only have nothing values in a Vector{<:Number?}.

@StefanKarpinski
Copy link
Member

I should add that I think we're getting a bit late in the 1.0 timeline to be considering such a fundamental change. However, since it's hard to write code that relies on #undef precisely because it's not a first-class value, we may want to consider if it might be possible to leave ourselves the option of changing this behavior in a non-breaking or minimally breaking way.

@JeffBezanson
Copy link
Member

JeffBezanson commented Sep 19, 2017

Yes, being able to express the exclusion of null mitigates the issue to a large extent, but the point is that raising an error on reading an undefined location is not the billion dollar mistake. In fact it's almost as far as you can get from it --- not only can you express that something is non-null, you're forced to express it in many contexts (i.e. it implicitly puts a ::NonNull on certain operations).

@nalimilan
Copy link
Member Author

Deprecating uninitialized array constructors would clearly be breaking; fortunately it's not terribly hard to do (marking them as unsafe_*). Subtler issues about undefined variables and fields could be less breaking and may require more work.

@JeffBezanson
Copy link
Member

I'm definitely interested in changing the array constructors for #16029. Keeping our current semantics but making it "harder" to get uninitialized arrays is something I could be on board with.

@andyferris
Copy link
Member

andyferris commented Sep 20, 2017

I agree with everything Jeff has said.

The typical "legitimate" use case for #undef is when constructing (in multiple steps) a mutable struct or array (eg make an array and populate the elements in a loop). One way to accommodate this might be to have a "construction" phase followed by a "publishing" step where all refs are validated to be defined (and perhaps that any user defined invariants are followed)? It's a bit of an odd idea, I admit.

@nalimilan
Copy link
Member Author

One way to accommodate this might be to have a "construction" phase followed by a "publishing" step where all refs are validated to be defined (and perhaps that any user defined invariants are followed)? It's a bit of an odd idea, I admit.

That's more or less what this proposal suggests, by requiring a call to convert(Array{T}, x) once initialization is done, which would guarantee early failures if there are nulls.

@andyferris
Copy link
Member

@nalimilan I wonder if I'd prefer something... completely orthogonal to the type system, I suppose. For example, I want my Vector{Any} to only contain nulls where I put them, so if I forget to populate each element explicitly it throws.

@nalimilan
Copy link
Member Author

Yeah, but that system doesn't work for isbits types, where an uninitialized value cannot be distinguished from a valid one. It's annoying to have inconsistent safety features depending on the kind of type you are using (or which the caller may have passed to you).

@ChrisRackauckas
Copy link
Member

I recall reading early threads which wanted the uninitialized array constructors because zeros and ones was far less performant. It seems that's still the case?

@benchmark zeros(1000)

 
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     455.375 ns (0.00% GC)
  median time:      1.205 μs (0.00% GC)
  mean time:        1.926 μs (25.12% GC)
  maximum time:     24.584 μs (90.79% GC)
  --------------
  samples:          10000
  evals/sample:     144

@benchmark Vector{Float64}(1000)

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     101.663 ns (0.00% GC)
  median time:      183.116 ns (0.00% GC)
  mean time:        381.337 ns (44.79% GC)
  maximum time:     2.417 μs (68.70% GC)
  --------------
  samples:          10000
  evals/sample:     956

@JeffBezanson
Copy link
Member

Yes, we'd still need a way to get uninitialized arrays. But the syntax for it can change.

@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Sep 21, 2017
@JeffBezanson
Copy link
Member

Conclusion from triage: radical changes to #undef are too much to bite off for 1.0; we should focus on the array constructors.

@nalimilan
Copy link
Member Author

Array constructors are indeed the main source of #undef values, so removing/hiding them would go a long way already. Though I've realized resize! creates #undef too, and it's hard to change.

BTW, regarding the more general discussion, I've bumped into this comment from Erik Lippert (one of the C# creators), which was posted during discussions on Nullable (#19034 (comment)):

Visual Basic 6 had Null (database null), Nothing (invalid object reference), Empty (default value) , Undefined (an uninitialized variant), Missing (the value passed when an optional argument is omitted) and Error types. OMG WHAT A MESS. No one understood the distinctions between them, how they compared, what the semantics were, and so on. I spent so much time in the VBScript runtime getting all this nonsense sorted out and I will never get that time back.

If we had to do it all over again, I think I would want to make the connection between null reference and null value stronger and more consistent. Starting with nullable reference types, then grafting on nullable value types one version later, and then attempting to do non-nullable reference types a decade after that, made for a bit of a mess. I wouldn’t want to multiply the number of kinds of nulls.

And while we’re dreaming, it also would have been nice to be really, really clear about the intended semantics of null. “Unknown”, “missing” and “invalid” all have subtly different semantics but we use null for all of them. (“True or unknown” is definitely true, but “true or invalid” could arguably resolve to invalid.)

@vtjnash
Copy link
Member

vtjnash commented Jun 3, 2021

Union{T,Nothing} does this now

@vtjnash vtjnash closed this as completed Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] breaking This change will break code
Projects
None yet
Development

No branches or pull requests

8 participants