-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std: Force Instant::now()
to be monotonic
#56988
Conversation
r? @Kimundi (rust_highfive has picked a reviewer for you, use r? to override) |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
Thanks for fixing this so quickly @alexcrichton! |
cf6b327
to
0f132c6
Compare
let now = cmp::max(LAST_NOW, os_now); | ||
LAST_NOW = now; | ||
Instant(now) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A mutex seems a bit heavy-handed. Many uses of rdtsc (where available) are for minimal-overhead, thread-local timing of functions.
Wouldn't rdtsc + some atomic ops that prevent things from going backwards be much lighter than potential thread suspension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a full-on mutex is quite a heavy hammer for this use case! I wasn't sure though how to best to minimize the cost here.
The Windows documentation at least "strongly discourages" rdtsc for handling VM migration issues as well as some supposed hardware. If that's the case I think we probably want to avoid that?
I figured it'd probably be best to start from a conservative position and we can always come in later as necessary and try to use atomics and/or different tricks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean to use rdtsc
directly. We can still defer to QueryPerformanceCounter
/clock_gettime
, i.e. the current Instant
implementation which in the end boils down to rdtsc on many x86 systems.
Just tack on a sanity check/correction with atomics instead of a mutex.
I'm mostly concerned that thread-contention might unexpectedly hit people if they litter their code with instants because it used to be fast. I have no concrete examples, just experience with gratuitous use of timing functions that were fast on linux suddenly making an application slow on windows because it decided to not use rdtsc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I'm all for adding a more lightweight implementation, do you have one in mind? Instant
has a varying size across platforms, which makes it difficult to select an appropriate atomic and/or more lightweight method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had sizeof checks, some type punning and AtomicU128/U64 in mind. Beyond that it would be your standard read, check, CAS. similar to what you're now doing in the lock's critical section, except in a loop.
The mutex would still be needed as fallback if the checks don't work out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, that's what I thought, and yeah my worry about that is that it wouldn't solve the original problem of monotonic clocks going backwards, so I'm afraid we'd still end up a the solution proposed in this PR.
We, as far as I know, don't have a great handle on how big the errors are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought some more about optimization potential
- we can use relaxed atomics everywhere. Justification: One thread cannot observe another thread's
Instant
s without some external synchronization happening, e.g. other ordered loads and stores. So until those happen only intra-thread ordering is relevant, Relaxed is sufficient for that. - in the good case we only need to do a test and return from the perspective of the main sequence of instructions. There's no dependency on the writes to global state happening, so this should be friendly to instruction parallelism.
- we can limit the XCHG loop by bailing out early if it fails because another thread updated it to a larger value than we are trying. it doesn't prevent the cache-line from bouncing around but at least can allow multiple threads to make progress simultaneously.
It could approximately look like this:
static mut LAST_NOW: AtomicU128 = 0.into();
let last_now = LAST_NOW.load(Relaxed); // insert type punning here
let os_now = time::Instant::now();
if likely(os_now > last_now) {
loop {
match LAST_NOW.compare_exchange_weak(last_now, os_now, Relaxed, Relaxed) {
Ok(_) => break,
Err(x) if x >= os_now => break, // some other thread is ahead of us, no need to update
_ => {}
}
return os_now;
}
return last_now
It's a bit smaller hammer but still not the rubber kind. To soften it further we either need a I (don't) care about broken systems switch somewhere or use better platform detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@the8472 yes that's all possible but 128-bit atomics are only available on a few platforms, so we can't use them generally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, that's what I thought, and yeah my worry about that is that it wouldn't solve the original problem of monotonic clocks going backwards, so I'm afraid we'd still end up a the solution proposed in this PR.
Wouldn't it still make sense to try the cheaper thread-local version first and switch to a full lock if it does turn out to be insufficient? If we directly go to the lock, then we will not be able to determine whether a cheaper thread-local variant would also have been sufficient. @the8472's theory that this arises only due to migration between cores at least sounds very plausible, so I think it would be worthwhile to try this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikic I think it's incorrect to avoid the full lock though? If the time is less than a thread-local version than you definitely have to acquire the lock, but even if it's greater than a thread local version you need to check the lock for the global one as well. Right now the bug primarily happens on one thread, but the documented guarantees of this API are that it works across all threads
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
0f132c6
to
4ce8d27
Compare
Is there a reason we're using the "internal" Mutex for the implementation though? I'd kind of expect that we could just use |
@Mark-Simulacrum the |
// * https://bugzilla.mozilla.org/show_bug.cgi?id=1487778 - a similar | ||
// Firefox bug | ||
// | ||
// It simply seems that this it just happens so that a lot in the wild |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete 'this'?
Maybe we should get a perf run on this for a system where |
return Instant(os_now) | ||
} | ||
|
||
static LOCK: Mutex = Mutex::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have dead-code elimination in MIR debug builds ? Otherwise LLVM-IR for this code will always be emitted independently of the result of actually_monotonic
, and whether that code will end up generating machine code will depend on the optimization level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't, now, but actually_monotonic
is a function that'll be trivially inlined so LLVM will optimize this away
r? @sfackler |
Hardware Is Bad @bors r+ |
📌 Commit 4ce8d27b0d76157c3c02125c519c08a99c6ef4ed has been approved by |
⌛ Testing commit 4ce8d27b0d76157c3c02125c519c08a99c6ef4ed with merge 1896be8145237e298d75c1b685fd2ae7dea733f5... |
💔 Test failed - status-travis |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
Tested on commit rust-lang/rust@2f19f8c. Direct link to PR: <rust-lang/rust#56988> 🎉 rls on linux: test-fail → test-pass (cc @nrc @Xanewok, @rust-lang/infra).
} | ||
} | ||
|
||
pub fn actually_monotonic() -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edelsonh Can you confirm ppc/ppc64 has a reliable monotonic clock?
Why not switch to |
My understanding of adjtime is that they specifically do not cause any monotonicity problems. That is, adjtime can't be the cause. If we trust manual pages. |
FWIW |
NTP can just yank a clock backwards of the higher stratum side if the delta is over 128ms. https://github.com/ntp-project/ntp/blob/stable/parseutil/dcfd.c#L1059 |
Hm, that should fail with -EINVAL if the new offset is less than the current monotonic offset... If anyone wants to investigate this further, I made a short program to test the different clocks on linux. Specifically, @IntrepidPig seems to be able to reproduce it. |
I know this is a little late, but an atomic umax would be better than the compare-exchange this currently uses. LLVM has the atomicrmw umax instruction and RISC-V and probably other architectures have an instruction for that. Using the atomic umax instruction allows the atomic operation to be executed wherever the memory is cached (TileLink has special support for that) instead of having to move the cached memory, saving time. |
Does it affect |
It would have an atomic op either way. |
@vi |
Hello there, It's a bit late, and I'm more on the system than dev side, but anyway here's some hints / events / configurations on x86 & Linux that may (or may not) help you work on this matter (to be aware of the bad things that can happen). So for Linux, you have multiple timing functions (some are available since a more recent kernel) CS = Clock Source (incremental monotonic counter)
On modern x86, most of the selection work is done in Standard calls for time (so, the non-coarse) means to deliver precise timing, and calling for a time delta between the last timekeeper update and the effective function call. To speedup things, these standard calls also try to use direct hardware: vread_pvclock (para-virtualized) vread_hvclock (hyper-v), vread_tsc() or vendor-supplied timeeking in VMware So, here's how I choose which one depending on the use-case:
some pitfalls
Whatever the OS, at the lower level, timing may be found from multiple hardware sources: TimeStampCounter (TSC)An incremental counter based on each cpu, value read by RDTSC or RDTSCP
High Precision Event Timer (HPET)It's composed of a single counter (fixed frequency), and many comparators that will generate an interrupt when the counter reaches a value they are waiting for. The counter frequency should just be higher than 10MHz. Real Time Clock (RTC)Well, not used anymore... I know you have to cover multiple arches, and OS types, and OS versions, so always keep a skeptical eye when you implement a low-level feature :) A feature I do on all language I use: a timing function/class that asks for what I intend to measure, and choose the right function for me. Maybe you can integrate it directly. |
This commit is an attempt to force
Instant::now
to be monotonicthrough any means possible. We tried relying on OS/hardware/clock
implementations, but those seem buggy enough that we can't rely on them
in practice. This commit implements the same hammer Firefox recently
implemented (noted in #56612) which is to just keep whatever the lastest
Instant::now()
return value was in memory, returning that instead ofthe OS looks like it's moving backwards.
Closes #48514
Closes #49281
cc #51648
cc #56560
Closes #56612
Closes #56940