-
Notifications
You must be signed in to change notification settings - Fork 0
Measure between whole iterations of the event loop #1
base: master
Are you sure you want to change the base?
Conversation
Original version checked uv_hrtime() against uv_now(handle->loop), which doesn't seem to work -- it doesn't detect e.g. long event loop iterations caused by a large for loop. Checking against the time of the same point at the last iteration of the event loop does.
17ec69c
to
bbff77e
Compare
Impressive, nice work |
Bugger, this doesn't work after all. It can't tell the difference between the event loop being busy-blocked from node being completely idle (no IO events, no timers ending etc.), i.e. libuv just not scheduling any event loop iterations. :/ Will keep trying |
Annoying I can't see a way to do the event-loop-stats thing properly, libuv doens't give us enough hooks. If I measure from one event loop to the next, I can't distinguish idleness from blocking. If I measure from update-loop-time to check, it fails to detect blockage -- still not sure why, maybe libuv has started updating Think I might have to go back my original idea of just using js timers... |
If you're prepared to have a js timer running with a delay comparable with the delays you're trying to detect, then you also know that the native method will work, because you'll be limiting the max time that can be spent idle. |
Good point, could have a js timer (or a native one for that matter, using http://docs.libuv.org/en/v1.x/timer.html ). But what would we actually be gaining by doing that (eventloop-to-eventloop timing and having a separate interval timer) over just doing measuring between ticks of an interval timer (which'd avoid the native code and extra dependency)? Either way you can only be sure about delays of >= your interval timer period |
I don't think either is a great solution. The UV eventloop could trivially tell you the information, without any penalty, if it wanted to. It would be good to know for sure that we can't make that work before looking at other solutions. |
I'm open to suggestions. Here's the core event loop, here's all the hooks we get, here's the order they're called in. So: we can easily detect a blocked event loop if it's blocked due in the calling-timeouts (+ nextTick callbacks) phase, by checking the time from The difficulty is if there's a problem in the 'poll for IO' phase, ie in the callback for some IO event. Because if there's no IO, the event loop blocks in that phase. So I can't see any way of telling whether a 5s gap in that phase (ie from I tried this experimentally: Spin in a setTimeout: a check handler gives Am I missing anything? |
Can we tell the difference by looking at the cpu cycles? We should be able to tell the difference between the process being stalled on blocking I/O versus the processing being busy. We're already periodically getting cpu usage (see https://github.com/ably/realtime/blob/master/common/lib/util/cpuusage.js). |
Not sure if you're suggesting looking at the cpu info from js code from the eventloop health monitor, or from native code on each tick. If from js, it'd be way too coarse-grained (30s) to correlate anything with any confidence, so presumably native. We could reimplement CpuUsage in c++ using So ISTM that that'd take more effort than the interval timer solution, and be a worse heuristic -- at least with the setInterval solution you know the issue is with the current process, not some other one on the same box. |
FWIW, even if the changes proposed here do not cover all cases they seem to be an improvement over the previous stats collection method used in the upstream repo. I suggested to incorporate them upstream and while doing so found a little issue regarding the handling of |
Just trying to understand the discussion a little better: Is this the only problem with the approach implemented here? Because that sounds a bit like an "academic" problem. Production apps would then only falsely report a large event loop lag if they do not receive any request, is that right? If that is the case, code using the |
Sure, I'm glad it was helpful, even if it didn't work for us.
This was a while ago, but from memory, I think so, yes.
Perhaps. Not sure that's as easy as it sounds. For a start you need to have the thing reporting the timestamp of incoming calls be external to the node process, else you have no way of distinguishing between an IO event at time X (where node was idle for the previous 500ms) from an IO event that appeared to come in at time X but would have happened at time X - 500 if node hadn't been in a busy loop for 500ms. So you now need to proxy all incoming calls through another process. Which also means the correlation either has to be done over IPC or the timestamps persisted (which means every socket read now entails a disk write, which could be a nontrivial extra load even if with rollups) and the correlation done in a batch job. And for all but the simplest apps, tracking all IO events is not necessarily trivial - for our usecase the list of possible IO events is large enough (rest requests, websocket messages, redis and cassandra queries returning, several internal transports for various purposes (grpc, axon-mq, gossip tcp sockets, ...), a large number of timeouts and interval timers, ...) that trying to track them all to correlate with apparent eventloop delays would be bit of a fool's errand. You don't want to sink a lot of effort investigating an apparent blocked event loop if it might just be some io event you forgot to correlate. We ended up just going with a simple pure-javascript timer, with an interval of a few hundred ms:
|
@basti1302 We ended up writing our own extension (we needed more stats than what is provided in @SimonWoolf The problem with the |
Of course -- it's a balance. An interval of x will catch event loops greater than x but you can only prove a delay of (reported delay - x), so a reasonable heuristic might be to set the interval at about half of the smallest event loop long enough to be considered a bug. Probably somewhere in the range of 50-250ms. You would never set it to 0 (and even the low milliseconds would be pretty pointless -- for one thing the default kernel cpu scheduler min granularity on aws is already 1.5ms) |
Yes, that would be interesting - or maybe provide a link to your extension in case it is public. |
@paddybyers @mattheworiordan
Context: https://github.com/ably/realtime/issues/1130 . Measures from check_handlers on one iteration to check_handles on the next, rather than from loop time update to check_handles on one iteration.
I don't know why the original approach doesn't work. Looking at http://docs.libuv.org/en/v1.x/design.html#the-i-o-loop , seems like it should work. But it doesn't seem to.
Before this change, with an artificial large
for
loop:After: