-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sleeping computer causes span timings to be incorrect in browser #852
Comments
Originally reported by @nvolker |
@draffensperger @obecny would like to get your expertise on this. |
One solution would be to use |
Is this only effecting browsers or also node? As far as I know node uses also |
can you provide more details. Are spans still being created when computer sleeps or how does this really affect that ? |
sleep you computer for a few hours. any spans created after you wake up have timestamps that show you a few hours in the past because the clock paused. This behavior continues until the browser is restarted. |
and what system / laptop has that - as I'm working on mac and my computer sleeps every day, browser is not restarted for weeks - I think I would notice that before hmm. |
The machine which had the problem was a mac using chrome. Possibly this is a sleep v hibernate issue? |
ok I see it now too |
maybe we should raise this as a chrome bug ? |
This issue of someone sleeping the computer, or leaving the tab open to go get coffee, etc. is sort of a basic limitation of measurement in the browser world. If a user starts an operation, shuts their computer, and then comes back 8 hours later and it finishes - then how long did the operation take? Here are a few ideas for how we could deal with this:
I think it would be worth us digging more into the browser specs, docs, etc. to better understand this Chrome on Mac behavior and what was intended by the authors for When I run |
just close laptop for 30 seconds, open and run it, seems like a valid bug |
I think all we have to do is to keep the performance, but then when we export spans we should add delta. |
I think you would have to add the delta at the time that you generate the timestamp. if you do it at export time the whole span will shift if it is started before the computer sleeps |
@obecny do you know if this affects the CORS preflight timing? |
@dyladan this might effect all, but I think we should create an issue to investigate what is affected and then fix everything, I can take it after finishing grpc |
I tried NodeJS 13.11.0 and Firefox 74.0 on Windows 10 on my Dell E7440. I would not expect that NodeJs processes for servers are executed on notebooks going to hibernate frequently but AWS lambda suspends/freezes the containers used. |
This is a difficult thing to test in lambda as you have so little control over the sleep |
This means there is simply no way to get true wall time on those platform. These have the same appearance:
Though the first is 30 minutes wall time and second is not.
IDK, but it's a common issue. Go has the same problem. |
In opentelemetry JS we use the performance timer which is not affected by adjustments to the system clock, so (2) would have the correct duration. |
Here are the options so far:
For what should be obvious reasons, I do not think we should go with (1). On the call today we discussed options 2 and 3. Option 2 has the advantage of having timings always accurate according to the system clock, but if the system clock changes during a span the timings could be incorrect. Option 3 has the advantage of using a monotonic high resolution clock for duration independent from system clock changes, but may report incorrect span timings if the performance timer pauses during a span. |
Aren't (1) and (3) the same? |
No. With (1) your spans may show as the whole span being in the past even if the system clock is correct, and the duration could be incorrect if the performance clock pauses during the span. With (3) you will get the span start time as the correct time (if the system clock is correct), but the duration could be incorrect if the performance timer pauses during the span. |
My suggestion would be that we try to keep the precision of the monotonic clock where possible, both for all aspects of the span, it's start/end times and it's timed events. This is because it can help people troubleshoot networking issues, or short timings in a UI to have that kind of accuracy (e.g. how long did it take to set up an SSL connection, how long did it take to render these frames, etc.) One idea would be to focus first on detecting the condition and annotating spans about it. That way at least we are beginning to communicate the challenge to users in such a way that they could tell how often it's happening in their own data. Then in a basic way they could just exclude those spans from their queries, etc. Then as the next phase once we have a better understanding of how often this happens and how exactly the different browser clocks relate, we could do some more effective mitigations. Just a thought! |
@draffensperger so you're suggesting allow the span timings to be incorrect, but maybe annotate spans with the offset when the system clock diverges from the performance timer by more than a second or so? |
Correct - my suggestion would be for version 1 of this work, we just detect when the spans timings are probably incorrect and we add some extra labels to the span to indicate how we think it's incorrect. For starters it could just be a Maybe version 1.1 of the idea would be to add labels that communicate what we think about the pause e.g. Then in version 2, once we better understand how often this happens, and maybe do some experimentation with the performance.now clock, maybe write some design doc / RFC thing, etc. and get a solid understanding of the limitations of different browsers, time changes, users adjusting their time, etc. then we actually write code to do the correction. I like the idea of shifting the span forward, but it feels tricky to do that well. What if the first part of a larger operation triggers an API call that finishes, but the second part (that they did after their coffee break and resumed their computer) triggers a second API call. Shifting the span forwards or backwards would dislocate it from the backend API spans that it's associated with. So I think it's hard to really win here. But the idea would be that with the version 2, some mitigation strategy, we would still annotate every thing we know about the span as best we can. E.g. add a |
I think another user is being affected by this. This time in a kubernetes setup. It seems like for some reason the performance timer may not be reliable in his setup. I'm not familiar enough with kubernetes to know why this might be behind the scenes. /cc @romilpunetha |
Hey all, wanted to chime in since I've spent several hours highly confused as to why I couldn't see query my spans and it sounds like this might be why. It looks to me like Until a few minutes ago, I hadn't restarted Firefox for a couple weeks, but had put my computer to sleep every night. As a result, all my OpenTelemtry traces had timestamps from Dec 11, and I was pulling out my hair trying to understand why I couldn't find my spans in Lightstep, Honeycomb, or Elastic. Restarting Firefox magically solved my problem. Am I correct that this bug is probably to blame? And does that mean this bug is minor for Chrome but kinda huge for Firefox? (I.e., any Firefox user on my site will be invisible in my monitoring tool of choice unless they've recently restarted their browser.) |
I saw an issue similar to this in chrome. I ended up sampling the delta just before initializing the exporter and then adjusting starttimes before upload. I doubt this is enough to actually address all possible permutations of clockskew though :/ |
Most likely
Unfortunately this is also likely correct I tried to tackle this a while ago in #1019. It is unfortunately a tricky problem. I do think it should be solved though. |
I see two related but separate issues here:
Considerations:
Note quite sure of the solution, I suppose it is in impossible to completely solve this in general. |
I suspect it may be impossible to truly solve this in the general without some server to synchronize clocks. One solution is to use the low-resolution clock for start times, and the high-resolution monotonic clock for durations. This would ensure (on systems whose system time is correct) that spans have close to accurate start times and very accurate durations, while ensuring that system clock changes do not affect span timings. |
I think we have been discussing this for quite long time :). I would be in favor of creating some solution and then validate it. If the only thing that is 100% up to date is |
Is there a way to make this configurable on the client so people can experiment? For example, provide a clock implementation to the opentelemetry API to use to generate timestamps? |
The interface for a clock implementation would have to make assumptions about the final implementation. For instance do we implement |
I think we should differentiate between node and browser regarding this. I don't think we should add unneeded overhead/complexity in node because of hibernate/sleep issues effecting effectively only browsers. Besides that I think the API should offer both, get an absolute timestamp and get a duration. Currently users can provide timestamps for span start/end but they have no possibility via API to use the same timesource as SDK. Having a clock interface in API would improve this. |
We are actually affected by this bug on Microsoft Azure (confirmed with Azure Functions Node.js v14.18.1 on Windows), we observe time drifts of the span start in the order of 10 seconds relative to the wall clock time. I can also reproduce the issue locally be hibernating my notebook, with Node v15.9.0 on Windows 11. So it seems to not only affect Node 8 as well. |
I think the wall clock time could get out of sync with the high resolution timer for any number of reasons, so many that I think it can be taken for a fact that it will drift if the application is running for any longer amount of time. Examples: Only affecting wall clock time but not high resolution timer:
Affecting high resolution timer:
It seems like You found nodejs/node#17893 already, but even if the offset is calculated correctly at startup, there is just no way to solve this issue with a constant calculated-only-once offset. There is libuv/libuv#1674 for the underlying uv_hrtime API, which has been closed as stale. uv_hrtime is implemented with QueryPerformanceCounter on Windows. https://github.com/libuv/libuv/blob/f250c6c73ee45aa93ec44133c9e0c635780ea741/src/win/util.c#L490-L514, which is definitely liable to the aforementioned time drift. The Linux (actually Linux-specific, other Unix have different implementations) is more complicated: The generic Unix entrypoint has only one line https://github.com/libuv/libuv/blob/0b1c752b5c40a85d5c749cd30ee6811997a8f71e/src/unix/core.c#L110-L112 calling the specific part with {{UV_CLOCK_PRECISE}} https://github.com/libuv/libuv/blob/c40f8cb9f8ddf69d116952f8924a11ec0623b445/src/unix/linux-core.c#L121-L155. This ends up calling clock_gettime(CLOCK_MONOTONIC), which is a libc/POSIX API, described for Linux e.g. here https://man7.org/linux/man-pages/man3/clock_gettime.3.html:
On Linux there would be CLOCK_BOOTTIME which would (supposedly) solve this problem but I believe it is not accessible through Node. Also, there is no equivalent for Windows (unless you want something a bit less precise; though maybe CLOCK_BOOTTIME is also less precise than CLOCK_MONOTONIC). You could look at how other SDKs solve this problem. Python has a different API https://docs.python.org/3/library/time.html#time.time_ns, which ends up in this implementation https://github.com/python/cpython/blob/0ff626f210c69643d0d5afad1e6ec6511272b3ce/Python/pytime.c#L847-L955, but that's probably not really applicable to Node.js as (an is a "a bit less precise" at least on Windows). I think you could implement rather one to one what Java does though: There, each local root span takes a timestamp with both the most precise available wall clock time and the high resolution timer when it starts to calculate the offset. From then on, only the HR timer is used, i.e. relative offsets of local child span start times and all end times are precise. Of course, if a suspension / leap seconds / ... happens during the trace, you still have the time drift, but that is much less likely and the next trace will be correct again. This is implemented with the aptly named https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/AnchoredClock.java and this logic at span creation: https://github.com/open-telemetry/opentelemetry-java/blob/16be81aed803e15694de29c9cea25f7bcf4d77c1/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/SdkSpan.java#L151-L182 |
Is this issue fixed now that #3134 has been merged ? |
It is waiting on #3259 for the release but yes |
* chore(deps): update dependency @types/node to v16 * fix: typescript issues
* chore(deps): update dependency @types/node to v16 * fix: typescript issues
* chore(deps): update dependency @types/node to v16 * fix: typescript issues
While debugging an issue for a user here https://gitter.im/open-telemetry/opentelemetry-node?at=5e6a4b54d17593652b7c8154 it was found that while a computer is slept or hibernated, the
performance.now()
monotonic clock may be paused. This causes the assumption thatperformance.timeOrigin + performance.now() ~= Date.now()
to be incorrect by some arbitrary amount of time which may be hours or days.The text was updated successfully, but these errors were encountered: