Skip to content
This repository has been archived by the owner on Mar 3, 2021. It is now read-only.

Measure between whole iterations of the event loop #1

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

SimonWoolf
Copy link

@paddybyers @mattheworiordan

Context: https://github.com/ably/realtime/issues/1130 . Measures from check_handlers on one iteration to check_handles on the next, rather than from loop time update to check_handles on one iteration.

I don't know why the original approach doesn't work. Looking at http://docs.libuv.org/en/v1.x/design.html#the-i-o-loop , seems like it should work. But it doesn't seem to.

Before this change, with an artificial large for loop:

stats { min: 0, max: 12, num: 8623, sum: 99 }
stats { min: 0, max: 4, num: 4468, sum: 28 }
staring artificial delay
finished artificial delay
Detected huge event loop iteration (6190ms)!
stats { min: 0, max: 4, num: 3657, sum: 15 }
stats { min: 0, max: 4, num: 4396, sum: 25 }

After:

stats { min: 0, max: 9, num: 4477, sum: 5000 }
stats { min: 0, max: 23, num: 4484, sum: 5000 }
staring artificial delay
finished artificial delay
Detected huge event loop iteration (6407ms)!
stats { min: 0, max: 6185, num: 2902, sum: 9414 }
stats { min: 0, max: 10, num: 486, sum: 562 }
stats { min: 0, max: 23, num: 4452, sum: 4999 }

Original version checked uv_hrtime() against uv_now(handle->loop), which
doesn't seem to work -- it doesn't detect e.g. long event loop iterations
caused by a large for loop. Checking against the time of the same point at
the last iteration of the event loop does.
@mattheworiordan
Copy link
Member

Impressive, nice work

@SimonWoolf
Copy link
Author

Bugger, this doesn't work after all. It can't tell the difference between the event loop being busy-blocked from node being completely idle (no IO events, no timers ending etc.), i.e. libuv just not scheduling any event loop iterations. :/

Will keep trying

@SimonWoolf
Copy link
Author

Annoying I can't see a way to do the event-loop-stats thing properly, libuv doens't give us enough hooks. If I measure from one event loop to the next, I can't distinguish idleness from blocking. If I measure from update-loop-time to check, it fails to detect blockage -- still not sure why, maybe libuv has started updating uv_now in more places than just the start of the loop (maybe something to do with this?)

Think I might have to go back my original idea of just using js timers...

@paddybyers
Copy link
Member

Think I might have to go back my original idea of just using js timers...

If you're prepared to have a js timer running with a delay comparable with the delays you're trying to detect, then you also know that the native method will work, because you'll be limiting the max time that can be spent idle.

@SimonWoolf
Copy link
Author

SimonWoolf commented Jun 19, 2017

If you're prepared to have a js timer running with a delay comparable with the delays you're trying to detect, then you also know that the native method will work, because you'll be limiting the max time that can be spent idle.

Good point, could have a js timer (or a native one for that matter, using http://docs.libuv.org/en/v1.x/timer.html ). But what would we actually be gaining by doing that (eventloop-to-eventloop timing and having a separate interval timer) over just doing measuring between ticks of an interval timer (which'd avoid the native code and extra dependency)? Either way you can only be sure about delays of >= your interval timer period

@paddybyers
Copy link
Member

But what would we actually be gaining by doing that (eventloop-to-eventloop timing and having a separate interval timer) over just doing measuring between ticks of an interval timer (which'd avoid the native code and extra dependency)? Either way you can only be sure about delays of >= your interval timer period

I don't think either is a great solution. The UV eventloop could trivially tell you the information, without any penalty, if it wanted to. It would be good to know for sure that we can't make that work before looking at other solutions.

@SimonWoolf
Copy link
Author

SimonWoolf commented Jun 19, 2017

I don't think either is a great solution. The UV eventloop could trivially tell you the information, without any penalty, if it wanted to. It would be good to know for sure that we can't make that work before looking at other solutions.

I'm open to suggestions. Here's the core event loop, here's all the hooks we get, here's the order they're called in.

So: we can easily detect a blocked event loop if it's blocked due in the calling-timeouts (+ nextTick callbacks) phase, by checking the time from check in iteration n to prepare in iteration n+1.

The difficulty is if there's a problem in the 'poll for IO' phase, ie in the callback for some IO event. Because if there's no IO, the event loop blocks in that phase. So I can't see any way of telling whether a 5s gap in that phase (ie from prepare to check) is due to the event loop idly blocking for IO for 5s due to a lack of activity, or a 5s busy loop triggered by some IO.

I tried this experimentally: Spin in a setTimeout: a check handler gives since_last_check: 519, since_last_prepare: 20. Spin in a websocket onConnection callback gives since_last_check: 3109, since_last_prepare: 3109 -- same as if you just do nothing for 3s, time(prepare->check) == time(check->check).

Am I missing anything?

@paddybyers
Copy link
Member

Can we tell the difference by looking at the cpu cycles? We should be able to tell the difference between the process being stalled on blocking I/O versus the processing being busy. We're already periodically getting cpu usage (see https://github.com/ably/realtime/blob/master/common/lib/util/cpuusage.js).

@SimonWoolf
Copy link
Author

SimonWoolf commented Jun 21, 2017

Can we tell the difference by looking at the cpu cycles? We should be able to tell the difference between the process being stalled on blocking I/O versus the processing being busy. We're already periodically getting cpu usage (see https://github.com/ably/realtime/blob/master/common/lib/util/cpuusage.js).

Not sure if you're suggesting looking at the cpu info from js code from the eventloop health monitor, or from native code on each tick. If from js, it'd be way too coarse-grained (30s) to correlate anything with any confidence, so presumably native. We could reimplement CpuUsage in c++ using GetCPUInfo / uv_cpu_info, and update it every second or something so we're not doing it on every tick. But it still wouldn't be a very good heruistic -- maybe cpu usage was high since the last tick because some other process on the box is using a lot of cpu, and our node process has just been idle for a few seconds, or maybe the event loop is blocked because someone used a sync method that didn't cause high cpu, etc.

So ISTM that that'd take more effort than the interval timer solution, and be a worse heuristic -- at least with the setInterval solution you know the issue is with the current process, not some other one on the same box.

@basti1302
Copy link

FWIW, even if the changes proposed here do not cover all cases they seem to be an improvement over the previous stats collection method used in the upstream repo. I suggested to incorporate them upstream and while doing so found a little issue regarding the handling of previous_now, which I fixed (TL;DR: it's not a good idea to reset previous_now on every sense call).

@basti1302
Copy link

Bugger, this doesn't work after all. It can't tell the difference between the event loop being busy-blocked from node being completely idle (no IO events, no timers ending etc.), i.e. libuv just not scheduling any event loop iterations. :/

Just trying to understand the discussion a little better: Is this the only problem with the approach implemented here? Because that sounds a bit like an "academic" problem. Production apps would then only falsely report a large event loop lag if they do not receive any request, is that right?

If that is the case, code using the event-loop-stats package would need to correlate the produced stats with the number of incoming calls (which can be acquired by other means) to make a somewhat intelligent decision whether there is an actual problem because of an event loop log or if the app is simply sitting idle and therefore reporting a false event loop lag.

@SimonWoolf
Copy link
Author

I suggested to incorporate them upstream

Sure, I'm glad it was helpful, even if it didn't work for us.

Just trying to understand the discussion a little better: Is this the only problem with the approach implemented here?

This was a while ago, but from memory, I think so, yes.

Production apps would then only falsely report a large event loop lag if they do not receive any request, is that right? If that is the case, code using the event-loop-stats package would need to correlate the produced stats with the number of incoming calls

Perhaps. Not sure that's as easy as it sounds.

For a start you need to have the thing reporting the timestamp of incoming calls be external to the node process, else you have no way of distinguishing between an IO event at time X (where node was idle for the previous 500ms) from an IO event that appeared to come in at time X but would have happened at time X - 500 if node hadn't been in a busy loop for 500ms. So you now need to proxy all incoming calls through another process. Which also means the correlation either has to be done over IPC or the timestamps persisted (which means every socket read now entails a disk write, which could be a nontrivial extra load even if with rollups) and the correlation done in a batch job. And for all but the simplest apps, tracking all IO events is not necessarily trivial - for our usecase the list of possible IO events is large enough (rest requests, websocket messages, redis and cassandra queries returning, several internal transports for various purposes (grpc, axon-mq, gossip tcp sockets, ...), a large number of timeouts and interval timers, ...) that trying to track them all to correlate with apparent eventloop delays would be bit of a fool's errand. You don't want to sink a lot of effort investigating an apparent blocked event loop if it might just be some io event you forgot to correlate.

We ended up just going with a simple pure-javascript timer, with an interval of a few hundred ms:

let lastNow = Date.now();
this.interval = setInterval(() => {
	const now = Date.now(),
		delay = now - (lastNow + checkInterval);

	/* actual delay might be anywhere between now - (lastNow + checkinterval)
	 * and now - lastNow], but can only prove the lowest end of that */
	if(delay > maxDelay) {
		// snip logging & stats collection
	}
	lastNow = now;
}, checkInterval);

@rochdev
Copy link

rochdev commented Mar 13, 2019

@basti1302 We ended up writing our own extension (we needed more stats than what is provided in event-loop-stats anyway) that calculates the CPU usage between ticks to know if something was actually blocking. Backporting this logic to event-loop-stats shouldn't be too difficult, I can open a PR if you want :) Not sure if it's a silver bullet either, but so far it seems to report exactly the numbers we expect in every scenario.

@SimonWoolf The problem with the setInterval approach is that is doesn't only monitor the event loop but also triggers it. So for example, if you would use an interval of 0 (the worst case scenario) then your app could easily go from ~50 ticks per second to millions of ticks per second, resulting in unnecessary processing.

@SimonWoolf
Copy link
Author

The problem with the setInterval approach is that is doesn't only monitor the event loop but also triggers it.

Of course -- it's a balance. An interval of x will catch event loops greater than x but you can only prove a delay of (reported delay - x), so a reasonable heuristic might be to set the interval at about half of the smallest event loop long enough to be considered a bug. Probably somewhere in the range of 50-250ms. You would never set it to 0 (and even the low milliseconds would be pretty pointless -- for one thing the default kernel cpu scheduler min granularity on aws is already 1.5ms)

@basti1302
Copy link

@rochdev

We ended up writing our own extension (we needed more stats than what is provided in event-loop-stats anyway) that calculates the CPU usage between ticks to know if something was actually blocking. Backporting this logic to event-loop-stats shouldn't be too difficult, I can open a PR if you want :) Not sure if it's a silver bullet either, but so far it seems to report exactly the numbers we expect in every scenario.

Yes, that would be interesting - or maybe provide a link to your extension in case it is public.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants