Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representation of unions #13

Closed
nikomatsakis opened this issue Aug 30, 2018 · 31 comments
Closed

Representation of unions #13

nikomatsakis opened this issue Aug 30, 2018 · 31 comments
Assignees
Labels
A-layout Topic: Related to data structure layout (`#[repr]`) S-writeup-needed Status: Ready for a writeup and no one is assigned

Comments

@nikomatsakis
Copy link
Contributor

Discussing how unions are laid out.

  • Is #[repr(C)] meaningful when applie to a union?
  • When (if ever) do we guarantee that all fields start at offset 0?
  • When (if ever) do we guarantee that all fields have the same address?
  • Any key things to note re: FFI interop?
@nikomatsakis nikomatsakis added the A-layout Topic: Related to data structure layout (`#[repr]`) label Aug 30, 2018
@joshtriplett
Copy link
Member

#[repr(C)] is meaningful (it guarantees that the union will have the same layout as an equivalent C union would); we do need to determine if #[repr(Rust)] wants to diverge from that, though. (Or if we want to guarantee that it'll never diverge in the future.)

Also, some relevant text from C11:
6.5.3.6.6:

One special guarantee is made in order to simplify the use of unions: if a union contains several structures that share a common initial sequence (see below), and if the union object currently contains one of these structures, it is permitted to inspect the common initial part of any of them anywhere that a declaration of the completed type of the union is visible. Two structures share a common initial sequence if corresponding members have compatible types (and, for bit-fields, the same widths) for a sequence of one or more initial members.

6.5.8.5:

All pointers to members of the same union object compare equal.

6.7.2.1.16:

A pointer to a union object, suitably converted, points to each of its members (or if a member is a bit- field, then to the unit in which it resides), and vice versa.

@hanna-kruppe
Copy link

AFAIK the first paragraph you quoted (6.5.3.6.6) is solely about strict aliasing/TBAA, which Rust doesn't do. The other two quotes seem to practically guarantee that all members of the union start at the same offset, and offset 0 at that unless the union foo * -> struct union_member* cast adjusts the pointer. Does that sound right?

@joshtriplett
Copy link
Member

joshtriplett commented Aug 30, 2018

@rkruppe Yes. (Also, I think the "suitably converted" there exclusively means a type conversion, not a value change.)

@avadacatavra avadacatavra added A-layout Topic: Related to data structure layout (`#[repr]`) active discussion topic and removed A-layout Topic: Related to data structure layout (`#[repr]`) labels Aug 31, 2018
@nikomatsakis
Copy link
Contributor Author

we do need to determine if #[repr(Rust)] wants to diverge from that, though. (Or if we want to guarantee that it'll never diverge in the future.)

Seems like we might as well reserve the right for that, although I don't see much motivation. Maybe we should drill in to some of the more specific questions:

  • Do we guarantee that fields begin at offset 0 for #[repr(Rust)] unions?
  • Or that they have the same address?

I am sort of of tempted to do so, because I don't know that there is much practical utility to doing otherwise, but I'd be curious to hear of use cases.

@joshtriplett
Copy link
Member

In the interests of full evaluation of alternatives: the only argument I've heard for doing otherwise would be if we could detect that all the variants in a repr(Rust) union had "holes" in their representations, and then arrange those representations within the union such as to give the overall union a "hole". We could only do so if we 1) allowed flexibility in representation for repr(Rust) unions, and 2) prohibited repr(Rust) unions from containing arbitrary unknown bit patterns not expressed by the field types (which repr(C) unions have to allow).

I don't believe we should do either of those things, but I wanted to mention the arguments for doing so for completeness.

@hanna-kruppe
Copy link

hanna-kruppe commented Sep 5, 2018

That's an interesting line of thought! However, what could we actually use these "holes" for? I assume you're referring to padding. I'm not aware of any way to stash discriminants or other data in a contained type's padding. Any write or copy is allowed to omit or clobber padding bytes at random. For example, suppose Result<(u8, u32), u8> wanted to place the discriminant or the u8 in the padding of the tuple, this breaks as soon as someone takes a mutable reference to the tuple and writes to it.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 5, 2018

Do we guarantee that fields begin at offset 0 for #[repr(Rust)] unions?

Do we want #[repr(Rust)] unions? What do they allow ? E.g. I would be fine with just requiring that all unions must be #[repr(C)] for now, adding a warning for non #[repr(C)] unions, and maybe in the 2018 edition turning that into an error.

We can then, at some point, re-consider adding #[repr(Rust)] unions, with suitable motivation. I am not saying that they would be useless, but that if we are going to specify them exactly the same as #[repr(C)], we don't need both, and not being able to use them in C-FFI would be a downside of repr(Rust) union w.r.t. repr(C) here.

We don't have to allow all kinds of types for all reprs, and doing so makes us waste time in the specification of each repr.

@joshtriplett
Copy link
Member

joshtriplett commented Sep 5, 2018

@gnzlbg We already have repr(Rust) unions in stable, and people are actively using them. People want to be able to build the space-efficient data structures they allow, and similar.

@joshtriplett
Copy link
Member

@rkruppe I'm not talking about padding. I'm talking about things like enums and bool not using all the bits in their representation.

If I have a repr(Rust) union of a bool and a three-variant enum, and I then wrap that in an Option, could that fit in one byte?

Again, I don't think that we should do that, but people have suggested doing so.

@hanna-kruppe
Copy link

Ah, that makes more sense. I also don't think we should do this, though, I'm in favor of the "unions are bags of uninterpreted bits" approach that we seem to be slowly converging on (e.g. with the disposition to merge rust-lang/rfcs#2514).

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 6, 2018

We already have repr(Rust) unions in stable, and people are actively using them. People want to be able to build the space-efficient data structures they allow, and similar.

My question was more about: how is repr(Rust) for unions different from repr(C) ? AFAICT they are identical with the sole exception that repr(C) unions can be passed to C code via FFI while doing so with repr(Rust) unions is UB. So if boths reprs are going to be identical except for this difference, I'd rather prefer to have one single repr that can be passed to C, and that would be repr(C).

We don't really have to break any code for doing this change. We can just say that unions are repr(C) by default, so that writing repr(C) on them becomes redundant, and that if you want a repr(Rust) union you have to actually write and spell repr(Rust), which is something you can't do now.

@RalfJung
Copy link
Member

RalfJung commented Sep 6, 2018

If I have a repr(Rust) union of a bool and a three-variant enum, and I then wrap that in an Option, could that fit in one byte?

I think this is part of the next discussion, the one about validity invariants: We first have to decide which bit patterns are valid for a union; once that is fixed, if the decision rules out some bit patterns as invalid, we can talk about layout optimizations.

@hanna-kruppe
Copy link

That's a good point, but it's also inconvenient since pretty much everything that there is to decide about union layout now depends on the validity invariant =/

@RalfJung
Copy link
Member

RalfJung commented Sep 6, 2018

Well, there are still the things @joshtriplett brought up above.

But yeah, while for enums layout came first and now we retrofit the validity invariant, it seems more reasonable to me to proceed the other way around with unions.

I guess one question we could discuss here, that could inform that validity discussion, is: Is there a strong need for layout optimizations on unions? We know for sure that some unions should not get layout optimized (MaybeUninit, for example). Would it be a noticeable loss to never do layout optimizations for unions? If yes, that might be a motivation to have a non-trivial validity invariant as proposed by @petrochenkov.

My personal stanza on this is that if someone really wants their union layout optimized, we should provide attributes to let them do that -- attributes that could also be used on struct, to e.g. make NonNull* types definable by a library. We shouldn't use the field types for this, because unions are often used to "get around" the type system so those field types might not be as reliable as they are in other places.

@hanna-kruppe
Copy link

Well, there are still the things @joshtriplett brought up above.

Which ones? The C standard quotes just more or less confirm what I think everyone agrees on for repr(C) layout, and as far as I can see all the points on repr(Rust) are driven by layout optimizations that interact with the validity invariant.

My personal stanza on this is that if someone really wants their union layout optimized, we should provide attributes to let them do that -- attributes that could also be used on struct, to e.g. make NonNull* types definable by a library. We shouldn't use the field types for this, because unions are often used to "get around" the type system so those field types might not be as reliable as they are in other places.

+1

@petrochenkov
Copy link

Is there a strong need for layout optimizations on unions?

My primary use case was ManuallyDrop, but it was converted to a struct recently.

@nikomatsakis
Copy link
Contributor Author

So, this seems like a place where we can "describe the controversy", rather than laying out specific rules. I think I could summarize the comments like so thus far (let me know if I am missing something):

  • #[repr(C)] unions are guaranteed to be layout and ABI compatible with C unions
    • presuming their interior types share the same guarantee etc
  • the default layout of unions is not yet fixed, because we are trying to figure out if we want room to do layout optimizations
    • we'll be able to figure this out after we work through the validity invariants
    • but we currently know of no strong use cases for unions to have layout optimizations and we would like people to send us examples

Personally, I lean towards the view that we should not assume anything about the bits of a union until we see an actual access (which then gives us the type we can use). This seems to imply that we can't do layout optimizations, because we can't be sure that the union is even initialized, and all layout optimizations rely on a notion of initialization. (But perhaps this is overly strict?)

@joshtriplett
Copy link
Member

@nikomatsakis That would be my preference. And even then, the compiler can't necessarily keep assuming that type over time.

@alercah
Copy link

alercah commented Sep 11, 2018

I think the idea here is that you may be able to do layout optimizations provided that you can find some guarantee that applies to all members? I think the "bag of bits" thing is a bit of a red herring here, although it does depend on the exact decisions we make about uninitialized memory.

For instance, given union<T, U> TwoPointer(&T, &U);, it seems to me that it ought to be possible to null-pointer optimize Option<TwoPointer<T, U>>, because the set of valid representations of a union ought to be the union of the set of valid representations of its fields.

Now that I think about it more, I don't think there's a sensible way for the compiler to offset fields within the union in order to achieve this result more generally. The niches into which we perform enum optimizations are quite different from padding holes, which we have discussed already cannot be assumed to remain constant (and requiring that may force undersized assignments, which may be a pessimization). Without going to too much detail right now, I'm pretty sure we cannot make use of this unless, as Niko floated for enums, we have some way to insist that you cannot bind references to individual fields.

Furthermore, I think we're converging on saying that a single-field struct always lays its field at offset 0. If that is the case, then the same should probably apply to unions, so we would say that all fields are laid out at offset 0.

This leaves two questions unanswered, however: a) can a union be more strictly aligned than any of its fields and b) can a union have unnecessary trailing padding? As far as I can tell, C answers "yes" to both, though I do not know why. I think this implies that repr(Rust) should also do so, as providing guarantees more strict than the C ABI on the platform seems odd. But I'm also tempted to say that repr(Rust) may not match the C ABI exactly, in that we may choose to omit the padding/stricter alignment.

@nikomatsakis
Copy link
Contributor Author

@alercah

Furthermore, I think we're converging on saying that a single-field struct always lays its field at offset 0. If that is the case, then the same should probably apply to unions, so we would say that all fields are laid out at offset 0.

I don't recall us saying anything about this in the struct chapter. Perhaps I just missed it. In any case, it seems like something we might want to add into #31.

@RalfJung
Copy link
Member

RalfJung commented Oct 5, 2018

For instance, given union<T, U> TwoPointer(&T, &U);, it seems to me that it ought to be possible to null-pointer optimize Option<TwoPointer<T, U>>, because the set of valid representations of a union ought to be the union of the set of valid representations of its fields.

The latter part of your statement is not correct. We should allow things like this. The union is not valid for any of its variants after the assignment to f.y.1, but that code should remain legal.

So, if anything we'd have to do that union bytewise. I'd rather if we didn't. Unions are bags of bits with names to access some offsets, end of story. That's already complicated enough to use that I wouldn't want to further complicate the story by adding layout optimizations to the mix.

If people show compelling use-cases for layout-optimized unions, I'd suggest we work towards stabilizing something like rust-lang/rust#54032. That covers your union-of-two-references.

@nikomatsakis nikomatsakis added the S-writeup-needed Status: Ready for a writeup and no one is assigned label Oct 11, 2018
@RalfJung
Copy link
Member

I heard people want me to do the writeup. Assigning to me then.

I didn't hear of any deadline though... ;)

@RalfJung RalfJung self-assigned this Oct 11, 2018
@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 12, 2018

What is the layout of union variants when the repr of the different variants differ? E.g.:

union U {
    a: __m128, // repr(simd)
    b: (),
}

Currently, U.a appears to have array layout instead of repr(simd) layout.

@hanna-kruppe
Copy link

hanna-kruppe commented Oct 12, 2018

Memory layout is no different. Calling convention details are different (and that factors into the relatively superficial difference in the IR we produce observed over there in that PR), but I see no reason to specify those for repr(rust) unions, as they are irrelevant outside of FFI which one should use repr(C) for anyway.

@RalfJung
Copy link
Member

So the distinction between[...] and <...> in LLVM should just affect calling conventions, but it does affect more, and that's the problem in that PR?

@hanna-kruppe
Copy link

No, arrays vs vectors is a quite important distinction for the IR, but none of those differences except ABI lowering affect the sort of visible behavior we are documenting here.

@RalfJung
Copy link
Member

To summarize the discussion that happened here, the consensus seems to be that repr(C) unions have all their fields at offset 0, and for repr(Rust) that's likely the most sensible option but we'll have to await the discussion about validity invariants for unions to rule out use-cases like #13 (comment). Any objections?

@nikomatsakis nikomatsakis assigned RalfJung and unassigned RalfJung Oct 23, 2018
@nikomatsakis
Copy link
Contributor Author

@RalfJung sounds great to me! Do you think you can get a write-up done by this Thursday? Would be good to have somethng by the meeting. =)

@RalfJung
Copy link
Member

Done: #39

Feels rather short, but what else is there to say?

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 20, 2018

It would be useful to use unions like MaybeUninit<T> in C FFI, but for that MaybeUninit<T> would need to have the same repr as T. Currently we don't have anything like repr(transparent) for unions, and for the case of MaybeUninit<T> which is an union with two variants, it's unclear to me whether something like repr(transparent) would work.

@RalfJung
Copy link
Member

Turns out regex relies on repr(Rust) unions having their field at offset 0:

https://github.com/rust-lang/regex/blob/172898a4fda4fd6a2d1be9fc7b8a0ea971c84ca6/src/vector/ssse3.rs#L80-L83

I bet they are not the only ones...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-layout Topic: Related to data structure layout (`#[repr]`) S-writeup-needed Status: Ready for a writeup and no one is assigned
Projects
None yet
Development

No branches or pull requests

8 participants