Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unaligned volatile intrinsics #52391

Merged
merged 1 commit into from
Jul 25, 2018
Merged

Conversation

Amanieu
Copy link
Member

@Amanieu Amanieu commented Jul 14, 2018

Surprisingly enough, it turns out that unaligned volatile loads are actually useful for certain (very niche) types of lock-free code. I included unaligned volatile stores for completeness, but I currently do not know of any use cases for them.

These are only exposed as intrinsics for now. If they turn out to be useful in practice, we can work towards stabilizing them.

r? @alexcrichton

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 14, 2018
@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Jul 15, 2018

lock-free code

Huh? Volatile accesses don't synchronize, so how can these intrinsics not be a source of data races in this context?

@Amanieu
Copy link
Member Author

Amanieu commented Jul 15, 2018

@rkruppe I have a large array of AtomicU8, each byte representing one bucket in the hash table. I want to perform an unaligned load of a __m128 from this array (movups) so that I can scan 16 bytes of the table at once.

Each individual byte is independent and there is no memory ordering requirement between them. I use an Acquire fence after the load to enforce memory ordering.

The load needs to be volatile because the underlying bytes may be concurrently modified by other threads as they are being read. volatile guarantees that the compiler will only read from the memory once and will not assume that the data at that address stays the same.

@hanna-kruppe
Copy link
Contributor

I'm fairly certain that's a data race, I don't see how the later fence can help. The "right" way would be an unaligned atomic load, but that's rarely supported and especially not for anything larger than word size.

Because of that, I am actually skeptical whether movups is even doing what you need it to do here at the hardware level (leaving aside all the additional restrictions imposed by Rust and LLVM): is there any guarantee by the ISA that it will actually be a single atomic load in all circumstances, even e.g. if it straddels a cache line? I'm not super familiar with this area, but the architecture manual specifically says (vol. 3, §8.2.3.1):

The Intel-64 memory-ordering model guarantees that, for each of the following memory-access instructions, the constituent memory operation appears to execute as a single memory access:

  • Instructions that read or write a single byte.
  • Instructions that read or write a word (2 bytes) whose address is aligned on a 2 byte boundary.
  • Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary.
  • Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary.

[...]

Other instructions may be implemented with multiple memory accesses. From a memory-ordering point of view, there are no guarantees regarding the relative order in which the constituent memory accesses are made. There is also no guarantee that the constituent operations of a store are executed in the same order as the constituent operations of a load.

So it would appear that even movaps isn't atomic, and even if it were, atomic accesses must generally be aligned.

@Amanieu
Copy link
Member Author

Amanieu commented Jul 15, 2018

@rkruppe I don't need the full 16 byte load to be atomic: it's fine if each of the 16 bytes is loaded individually in an arbitrary order.

In other words, I want to perform 16 individual relaxed AtomicU8::load, but with a single instruction.

@hanna-kruppe
Copy link
Contributor

I see, thanks for clarifying. But at the same time I still don't see how the Intel language could be read to guarantee that movups will behave like 16 atomic byte-sized accesses. It seems that at minimum you'd have to make some additional assumptions about how the smaller split-up accesses behave. More generally, you're in a very dark corner of weak memory models. For example, I've recently learned that virtually all academic work on the subject has even ignored overlapping aligned accesses of different sizes! I am very reluctant to believe any of this works in the absence of an explicit affirmative statement by Intel or other authorities.

In any case, even if this turns out to be guaranteed by Intel, I still don't think a volatile access + acq fence is sufficient to rule out language/IR level UB. I would recommend side-stepping the issue by using movups explicitly via inline assembly.

@alexcrichton
Copy link
Member

Given the thoughts from @rkruppe is this still needed for your use case @Amanieu?

@Amanieu
Copy link
Member Author

Amanieu commented Jul 16, 2018

But at the same time I still don't see how the Intel language could be read to guarantee that movups will behave like 16 atomic byte-sized accesses. It seems that at minimum you'd have to make some additional assumptions about how the smaller split-up accesses behave.

16 separate byte loads with relaxed ordering is literally the weakest restriction that can possibly put on such a load: you cannot split it up any further, and there is no restriction on the order that the individual bytes are loaded. If the hardware loads multiple bytes at once then that only makes the access stronger, not weaker.

In fact, this is exactly what the Intel documentation states:

Other instructions may be implemented with multiple memory accesses. From a memory-ordering point of view, there are no guarantees regarding the relative order in which the constituent memory accesses are made.

The reason I can't just use a normal unaligned load is that the underlying bytes can be changed by another thread at any time, even in the middle of a load. By default, the compiler assumes that normal memory accesses are data-race free and two consecutive loads will always return the same value, which is not the case here.

The use of the volatile keyword removes the data-race free requirement and guarantees that the data is only read once (the Linux kernel actually relies on this for its READ_ONCE macro). This means that the volatile read could return a split value (i.e. half of the bytes from before a store and the other half from after that store), but this isn't a problem for me because I only care about individual bytes (see above).

@alexcrichton So yes, I still need this for my use case. We aren't stabilizing anything here, I would just like to have this available so that I can experiment on nightly.

@hanna-kruppe
Copy link
Contributor

16 separate byte loads with relaxed ordering is literally the weakest restriction that can possibly put on such a load: you cannot split it up any further, and there is no restriction on the order that the individual bytes are loaded. If the hardware loads multiple bytes at once then that only makes the access stronger, not weaker.

This assumes byte-sized accesses are the smallest possible access at the relevant hardware levels, which seems natural but doesn't appear to be explicitly stated anywhere in the memory model document. There may also be other implicit assumptions I don't see right now. This all sounds a bit remote, admittedly, but as I said, this is an extremely dark and obscure corner of memory models to begin with. I wouldn't be surprised if none of the people who wrote that document have ever actually thought about this specific scenario.

In any case, let's leave the ISA details behind us, I'll go on assuming you are right. Even then I object to this intrinsic.

The use of the volatile keyword removes the data-race free requirement [...]

This is incorrect. Volatile accesses don't do anything like that, they have "no cross-thread synchronization behavior" (LLVM LangRef) and there is no exception carved out for volatile accesses in the Rust-level statement "data races are UB". In LLVM there is a vague bullet point that the byte value read by a volatile read "is target dependent", but this is justified by referencing signal handlers and "addresses that do not behave like normal memory", with no mention of cross-thread communication through ordinary memory.

While volatile is always murky and partially implementation-defined and compilers tend to err on the side of caution, it is factually incorrect that volatile side-steps memory ordering concerns.

[...] and guarantees that the data is only read once (the Linux kernel actually relies on this for its READ_ONCE macro)

Yes, a volatile access is (if possible) lowered to a single load or store, but that has nothing to do with data race freedom. Furthermore, I don't care much for precedent from the Linux kernel when it comes to matters of standard/rules lawyering. The kernel is very happy to rely on GCC implementation-defined behavior, and even altering GCC's default behavior when they don't like it (e.g., strict aliasing rules, or null dereferences being UB).

So yes, I still need this for my use case.

Again, you can use inline assembly (with the volatile specifier) to get the instruction you want, and be in the clear w.r.t. Rust and LLVM rules. You've been reasoning in terms of movups and the Intel memory model, so even besides UB concerns it seems like it would be more natural to just request a movups in your source code rather than obscure your intent with this intrinsic.

And as a bonus, we won't have this huge footgun lying around. Concurrency aside, on quite a few ISAs unaligned loads and stores don't even exist, so anyone (correctly) using volatile to e.g. do MMIO (where the "single instruction" rule would be relevant) would be screwed if they tried to use the unaligned variant for that.

@Amanieu
Copy link
Member Author

Amanieu commented Jul 16, 2018

This is incorrect. Volatile accesses don't do anything like that, they have "no cross-thread synchronization behavior" (LLVM LangRef) and there is no exception carved out for volatile accesses in the Rust-level statement "data races are UB". In LLVM there is a vague bullet point that the byte value read by a volatile read "is target dependent", but this is justified by referencing signal handlers and "addresses that do not behave like normal memory", with no mention of cross-thread communication through ordinary memory.

While volatile is always murky and partially implementation-defined and compilers tend to err on the side of caution, it is factually incorrect that volatile side-steps memory ordering concerns.

From your link, you can see that if R is not atomic and R_byte may see more than one write, then it falls through to the last bullet point and R_byte returns undef. This is what I mean by a data race, and this is what I am trying to avoid.

However this does not apply when the access is volatile, instead the result is target-dependent. In practice, this means that the result is whatever would be returned by a load instruction on that target, which is never undef (it will just pick one of the previous writes arbitrarily).

This is of course separate from any memory orderings on the load itself (e.g. load(Acquire)), for which I can just use an explicit fence(Acquire) instead.

Again, you can use inline assembly (with the volatile specifier) to get the instruction you want, and be in the clear w.r.t. Rust and LLVM rules. You've been reasoning in terms of movups and the Intel memory model, so even besides UB concerns it seems like it would be more natural to just request a movups in your source code rather than obscure your intent with this intrinsic.

I want to use this on more than just x86, I only used __m128 as an example. On ARM and AArch64 I am using this same intrinsic to perform a volatile unaligned u64 load, which allows me to load 8 bytes at once.

And as a bonus, we won't have this huge footgun lying around. Concurrency aside, on quite a few ISAs unaligned loads and stores don't even exist, so anyone (correctly) using volatile to e.g. do MMIO (where the "single instruction" rule would be relevant) would be screwed if they tried to use the unaligned variant for that.

For architectures without unaligned memory access, LLVM will lower this to a sequence of single byte load. Also I disagree that this is a huge footgun, we are only adding an unstable intrinsic in the PR. We are not stabilizing anything yet.

@hanna-kruppe
Copy link
Contributor

From your link, you can see that if R is not atomic and R_byte may see more than one write, then it falls through to the last bullet point and R_byte returns undef. This is what I mean by a data race, and this is what I am trying to avoid.

That's not the Rust definition of data races though, so not relevant unless Rust defers to LLVM here. But it's not even clear whether Rust is using LLVM's definition of happens-before etc. or C11's or C++11's or something subtly different, much less whether we're treating volatile accesses the same way. I mostly referenced the LLVM document to point out that even there, doing cross-thread synchronization with target ISA semantics is explicitly not the intended use of volatile accesses.

This is of course separate from any memory orderings on the load itself (e.g. load(Acquire)), for which I can just use an explicit fence(Acquire) instead.

Again, I don't see how the later fence can establish any happens-before for the earlier load. Furthermore, your argument up to this point seemed to be that happens-before doesn't matter because your load is going straight to the target hardware semantics, so why are we back to caring about happens-before?

(For that matter, wouldn't you also need the writes to be volatile to ensure they work by target machine semantics?)

I want to use this on more than just x86, I only used __m128 as an example. On ARM and AArch64 I am using this same intrinsic to perform a volatile unaligned u64 load, which allows me to load 8 bytes at once.

There, too, you will have to reason about the target ISA's memory model to verify whether your algorithm is correct. So it seems like a feature, not a bug, that you'd need a separate bit of asm for every architecture.

For architectures without unaligned memory access, LLVM will lower this to a sequence of single byte load.

Yes, that's precisely why it would be completely wrong to use for anyone who wants to produce a single machine level memory access.

Also I disagree that this is a huge footgun, we are only adding an unstable intrinsic in the PR. We are not stabilizing anything yet.

You'd be using it, others can too. I'm particularly worried since in other contexts unaligned accesses are a safer default (since they assume strictly less), but here it's different depending on target ISA details.

@pietroalbini
Copy link
Member

Ping from triage! What's the status of this PR?

@Amanieu
Copy link
Member Author

Amanieu commented Jul 23, 2018

The PR is fine and the added functions do exactly what the name says: it's a volatile load/store with an alignment of 1.

You'd be using it, others can too. I'm particularly worried since in other contexts unaligned accesses are a safer default (since they assume strictly less), but here it's different depending on target ISA details.

I don't see how this is any different from volatile_copy_memory which we already have. The only difference is that volatile_copy_memory makes both the load and the store volatile, instead of just one of them.

@alexcrichton
Copy link
Member

Ok I agree with @Amanieu that these intrinsics are unstable and just here for experimentation so it's fine to land them. I don't particularly have an opinion on the usage of them in an atomic situation.

@bors: r+

@bors
Copy link
Contributor

bors commented Jul 24, 2018

📌 Commit 303306c has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 24, 2018
@kennytm
Copy link
Member

kennytm commented Jul 24, 2018

@bors rollup

Mark-Simulacrum added a commit to Mark-Simulacrum/rust that referenced this pull request Jul 24, 2018
…richton

Add unaligned volatile intrinsics

Surprisingly enough, it turns out that unaligned volatile loads are actually useful for certain (very niche) types of lock-free code. I included unaligned volatile stores for completeness, but I currently do not know of any use cases for them.

These are only exposed as intrinsics for now. If they turn out to be useful in practice, we can work towards stabilizing them.

r? @alexcrichton
bors added a commit that referenced this pull request Jul 25, 2018
Rollup of 7 pull requests

Successful merges:

 - #52391 (Add unaligned volatile intrinsics)
 - #52402 (impl PartialEq+Eq for BuildHasherDefault)
 - #52645 (Allow declaring existential types inside blocks)
 - #52656 (Stablize Redox Unix Sockets)
 - #52658 (Prefer `Option::map`/etc over `match` wherever it improves clarity)
 - #52668 (clarify pointer offset function safety concerns)
 - #52677 (Release notes: add some missing 1.28 libs stabilization)

Failed merges:

r? @ghost
@bors bors merged commit 303306c into rust-lang:master Jul 25, 2018
@RalfJung
Copy link
Member

RalfJung commented Aug 2, 2018

The use of the volatile keyword removes the data-race free requirement and guarantees that the data is only read once

That's just wrong. The Linux kernel relies on GCC-specific behavior. volatile has nothing to do with concurrency. It is never correct to use volatile to get rid of data races. The only effect of volatile is to prevent the compiler from duplicating or removing the accesses. It has no bearing on the relative ordering of other, non-volatile accesses (including atomic accesses).

If that is the only motivation for this PR, I think it should get reverted. I mean, I can totally see that unaligned volatile accesses are useful, e.g. for memory-mapped IO registers. They are just never useful for lock-free code.

@ghost
Copy link

ghost commented Aug 2, 2018

@RalfJung

Looks like there's been some confusion. Perhaps we should look at a concrete example in crossbeam-deque (this is an implementation of the Chase-Lev deque):

https://github.com/stjepang/crossbeam-deque/blob/939da6f7be4148b5765fc61c2ebcd34c03711147/src/lifo.rs#L344-L356

So this part of the steal method works in three steps:

  1. We read a value from the shared array at a specific index. This value might be concurrently overwritten by another thread. This is a data race, but it's totally OK because we're going to check whether a data race happened before reading the value! It's the only way to implement the Chase-Lev deque.

  2. We try to "commit" the steal operation by compare-and-swapping an atomic index.

  3. If the CAS succeeded, we can be 100% sure there was no data race in the first step and we can safely return the value. However, if the CAS failed, that means the value we've read is "broken". We must throw it away with mem::forget and go back to the beginning.

So the crucial question here is: How should the value in first step be read?

I'm going to offer three options:

  1. Use ptr::read.
  2. Use ptr::read_volatile.
  3. Cast the pointer *const T to *const AtomicU8, read the value byte-by-byte using load(Relaxed), then transmute the read bytes into a value of type T.

The first one is dubious - maybe it's OK but I wouldn't bet on it.
The second one is okay, at least I hope so.
The third one is definitely okay.

What do you think - is the second option okay?

@RalfJung
Copy link
Member

RalfJung commented Aug 3, 2018

Chase-Lev deque.

Ah, IIRC that one is famous for causing trouble. ;) Sounds somewhat similar to a seq lock to me, which ha a similar problem (doing a read that is racy but checking later).

We read a value from the shared array at a specific index. This value might be concurrently overwritten by another thread. This is a data race, but it's totally OK because we're going to check whether a data race happened before reading the value!

What do you think - is the second option okay?

If it happens through an atomic load (option 3), your reasoning is fine, I think. Can you not use some Atomic type directly for the array? Ah, I guess you cannot because things are not aligned?

If this happens through a non-atomic access, unfortunately data races are insta-UB no matter what you do with the value. The docs say we are following the C11 memory model, and this is how that model works. There are optimizations which are incorrect if you weaken this to "read-write races make the read return undef". (And even if we followed that model, you may need MaybeInitialized because we might not want to have undef outside of that.)

So, just from the reference/standard, option 1 and 2 are equally UB. Whether they lead to actual problems is a different question, but I think we should not play games of the sort "but the compiler will never be smart enough to exploit this" -- in particular not with concurrency where exhaustive testing is pretty much impossible.

@RalfJung
Copy link
Member

RalfJung commented Aug 3, 2018

That said, the only optimization I am aware of that would break this is a redundant read elimination, which is not valid when using volatile loads.

Still, that's not a theorem. The only ways to actually follow the standard are:

  • Chang Rust's concurrency standard to an LLVM-style model of data races: Write-write races are still insta-UB, but read-write races just make the read return undef or poison or so. However, then we'd also have to figure out questions about when it is okay to have undef in your program.
  • Do relaxed atomic reads. I suppose you are thinking of a primitive like
unsafe fn read_atomic_relaxed(ptr: *const u8) -> u8 {
    use std::sync::atomic;
    let aptr = ptr as *const atomic::AtomicU8;
    (&*aptr).load(atomic::Ordering::Relaxed)
}

I could also imagine having load_unaligned and store_unaligned methods in the atomic types, if LLVM has anything like that.

@RalfJung
Copy link
Member

RalfJung commented Aug 3, 2018

I opened a PR to clarify the documentation of the stable volatile pointer methods: #53022

@ghost
Copy link

ghost commented Aug 3, 2018

@RalfJung

Thank you for clarifying, that's very helpful!

Sounds somewhat similar to a seq lock to me, which ha a similar problem (doing a read that is racy but checking later).

Right. Incidentally, we're also facing the same problem in a seq lock here. :)

Write-write races are still insta-UB, but read-write races just make the read return undef or poison or so. However, then we'd also have to figure out questions about when it is okay to have undef in your program.

So in Chase-Lev and seq lock we have read-write races. We could just wrap data in MaybeUninit<T> and use ptr::read_volatile to get a MaybeUninit<T> - that 'd solve the poison problem, right?

The current blockers for this are:

  • Changing Rust's concurrency standard.
  • Unions with non-Copy types are unstable.

Do relaxed atomic reads. I suppose you are thinking of a primitive like

Unfortunately, AtomicU8 is still unstable so we can't do that. :(

What would you suggest to do at this moment? ptr::read_volatile will only probably work, but I don't see any other solution on stable Rust.

@RalfJung
Copy link
Member

RalfJung commented Aug 3, 2018

Right. Incidentally, we're also facing the same problem in a seq lock here. :)

Heh. ;) You might be interested in http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf

We could just wrap data in MaybeUninit and use ptr::read_volatile to get a MaybeUninit - that 'd solve the poison problem, right?

Precisely. And then only access the inner value after checking that there was no race.

This solves undef/poison, but not the data race issue.

Unions with non-Copy types are unstable.

I'm working on that. ;)

Unfortunately, AtomicU8 is still unstable so we can't do that. :(

The same goes for the intrinsic you added here, though?

What would you suggest to do at this moment? ptr::read_volatile will only probably work, but I don't see any other solution on stable Rust.

Hm, fair point. Coming from a formal/theory background, I tend to sometimes see things very black -and-white: If the theorem doesn't hold, it's false and all hope is lost. So please stand by while I reconfigure to a more applied researcher... ;)

I suppose using volatile is better than not using it, as it "keeps the compiler away from your code", so to speak.

Other than that, you could... push stabilization of Stuff (TM)? ;)

@Amanieu
Copy link
Member Author

Amanieu commented Aug 3, 2018

The problem with using AtomicU8 is that the compiler is dumb and will issue individual single-byte loads to read the data. If we have a 16-byte struct to read then it will be done using 16 load instructions.

My intent with unaligned volatile loads was to perform a 16-byte load in a single instruction, but without any requirement that the load be atomic (which is the case if the address is unaligned). Where the memory model allows a read to "see" more than one write, each byte of the load is allowed to return the corresponding byte from any one of those writes (different bytes may return data from different writes).

This requires slightly stronger guarantees than just returning undef on a read-write race, since valid data (which might be a byte-level mix from different writes) is returned. AFAIK LLVM's memory model does make this guarantee.

@RalfJung
Copy link
Member

RalfJung commented Aug 3, 2018

My intent with unaligned volatile loads was to perform a 16-byte load in a single instruction, but without any requirement that the load be atomic (which is the case if the address is unaligned).

I understand. However, according to C11 concurrency, and hence in Rust, a non-atomic load racing with an atomic write is undefined behavior. Not "unspecified value" or so, but full UB.

This requires slightly stronger guarantees than just returning undef on a read-write race, since valid data (which might be a byte-level mix from different writes) is returned. AFAIK LLVM's memory model does make this guarantee.

No, it does not:

  • If R is volatile, the result is target-dependent. (Volatile is supposed to give guarantees which can support sig_atomic_t in C/C++, and may be used for accesses to addresses that do not behave like normal memory. It does not generally provide cross-thread synchronization.)
  • Otherwise, if there is no write to the same byte that happens before Rbyte, Rbyte returns undef for that byte.
  • Otherwise, if Rbyte may see exactly one write, Rbyte returns the value written by that write.
  • Otherwise, if R is atomic, and all the writes Rbyte may see are atomic, it chooses one of the values written. See the Atomic Memory Ordering Constraints section for additional constraints on how the choice is made.
  • Otherwise Rbyte returns undef.

So certainly for non-volatile, the result is undef. For volatile, it doesn't specify, but note the part about not providing cross-thread synchronization.

@ghost
Copy link

ghost commented Aug 3, 2018

@RalfJung

For volatile, it doesn't specify, but note the part about not providing cross-thread synchronization.

Not disagreeing with you, but: Relaxed atomic accesses do not provide cross-thread synchronization either (i.e. they don't establish happens-before relationships), so I believe that sentence is about a completely separate issue.


I just took a peek into the C11 standard and it seems you're right about data races (more specifically, "torn" reads) being instant UB, even if the data is never read. So we really need that AtomicU8.

cramertj added a commit to cramertj/rust that referenced this pull request Aug 3, 2018
volatile operations docs: clarify that this does not help wrt. concurrency

Triggered by rust-lang#52391. Cc @stjepang @Amanieu

Should the intrinsics themselves also get more documentation? They generally do not seem to have much of that.
kennytm added a commit to kennytm/rust that referenced this pull request Aug 4, 2018
volatile operations docs: clarify that this does not help wrt. concurrency

Triggered by rust-lang#52391. Cc @stjepang @Amanieu

Should the intrinsics themselves also get more documentation? They generally do not seem to have much of that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants