Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discuss] event loop idle metrics #33026

Closed
jasnell opened this issue Apr 23, 2020 · 26 comments
Closed

[discuss] event loop idle metrics #33026

jasnell opened this issue Apr 23, 2020 · 26 comments
Labels
feature request Issues that request new features to be added to Node.js. perf_hooks Issues and PRs related to the implementation of the Performance Timing API. performance Issues and PRs related to the performance of Node.js.

Comments

@jasnell
Copy link
Member

jasnell commented Apr 23, 2020

@trevnorris is working on landing a change to libuv that will track and report the amount of time the event loop spends in idle time. It's an extremely useful metric that can provide for us a measurement of "event loop utilization". In a world of worker threads, monitoring CPU no longer becomes an effective way of monitoring performance and event loop delay is not enough on it's own, so having a built in mechanism for measuring event loop utilization would be fantastic. While there is some work still to be done to get the PR landed in libuv and get that new libuv version landed in core, I did want to briefly discuss how the new metric should be exposed in core.

In this comment @trevnorris suggests a simple performance.idleTime() that returns the direct value of this metric, which records the cumulative time spent in idle since the loop was configured to track. To calculate event loop utilization, however, we also need to know how long the event loop has been running (well, to be specific, how long it's been since the loop was configured to collect the data, which can be turned on but not turned off). Assuming we started the loop and started collecting the metric from the start, we do already record the start time of the event loop (using the performance milestones) so someone could calculate the utilization on their own by accessing those values. However, I think it might make more sense for us to just do the calculation for users and provide an API like performance.idleTime() that returns an object with two values { idle: n, utilization: y } where idle is the raw idle time and utilization is the calculated utilization value. The API should be very low cost to sample using AliasedArray or AliasedStruct as a backing.

/cc @nodejs/diagnostics @addaleax @mcollina

@jasnell jasnell added perf_hooks Issues and PRs related to the implementation of the Performance Timing API. performance Issues and PRs related to the performance of Node.js. labels Apr 23, 2020
@sam-github
Copy link
Contributor

I think its important that what the Node.js API returns is mappable to Prometheus in as high-value a way as possible, I'll take a look at this in the next day or so.

@jasnell
Copy link
Member Author

jasnell commented Apr 23, 2020

@sam-github >>> nodejs/TSC#853

@Flarna
Copy link
Member

Flarna commented Apr 23, 2020

Is utilization here the wall clock time since measurement starts or the actual CPU time consumed by the thread executing the loop tasks?

@trevnorris
Copy link
Contributor

@Flarna It's a bit in depth, but you should read my post on libuv/libuv#2725 (comment) about the differences.

TL;DR "idle time" is measured using uv_hrtime(), which I believe defaults to monotonic, and represents the amount of time the event loop was idle, not the amount of time the event loop was spent in the event provider (e.g. poll). "event loop utilization" is "idle time" divided by the amount of time the event loop has been running (also measured using uv_hrtime()).

@Flarna
Copy link
Member

Flarna commented Apr 23, 2020

@trevnorris Good post!

I think a common use case would be to calculate utilization for regular intervals not always from loop startup. Therefore not sure if adding this overall utilization from loop start is that useful.
But as long as a call to performance.idleTime() isn't stateful this is possible for everyone to do.

Looks very useful to me.

@jasnell
Copy link
Member Author

jasnell commented Apr 23, 2020

Currently this mechanism uses uv_configure_loop() to enable the metrics tracking which enables configuration options but currently does not implement any semantics for disabling those options. Once enabled, the idle stats in libuv are calculated continuously over the lifetime of the event loop and are not reset when it is read. The idle time would accumulate indefinitely until the uint64_t overflows.

In theory, we could record the moment performance.idleTime() is called internally in Node.js and track that in the response and as part of the calculation of the utilization but that starts to make the api more complicated... for instance...

// Get the idle time since the start of the event loop
let idle = performance.idleTime()
// idle = { idle: n, mark: x, utilization: y }
// where mark == the uv_hrtime() when performance.idleTime() was processed.

// then later...
idle = performance.idleTime(idle)
// idle = { idle: n, mark: x, utilitization: y }
// where idle is now the difference between the current idle time and the provided idle.idle.
// mark is the new uv_hrtime()
// and utilization is the new n / the time since the provided idle.mark

Definitely possible but obviously starts to increase the complexity of the API so we need to be careful.

@trevnorris
Copy link
Contributor

trevnorris commented Apr 23, 2020

@Flarna Instead of reporting "utilization", if idleTime() returned { idle: n, active: m }, where active is the cumulative time spent in uv_run(), then it would be easy enough for users to implement their own.

A simple example:

function loopUtilization(last, current = performance.idleTime()) {
  return (current.idle - last.idle) / (current.active - last.active);
}

const now = performance.idleTime();
const t = Date.now();
while (Date.now() - t < 3000);
console.log(loopUtilization(now));

Though if active is included then maybe it shouldn't be called idleTime() anymore. :P

@jasnell
Copy link
Member Author

jasnell commented Apr 23, 2020

Yep, so the key question here is: do we provide the calculation of utilization for users or do we just provide the idle time and leave it to them to calculate.

@trevnorris
Copy link
Contributor

trevnorris commented Apr 23, 2020

If it's similar to my example above, to prevent yet another one-liner module we might just as well add it. So maybe it would operate something like so:

// This simply logs out the loop utilization for a three second duration
setTimeout((last) => {
  console.log(performance.loopUtilization(last));
}, 3000, performance.idleTime());

Internally we just store active_time = uv_hrtime(); just before calling uv_run(). So, some pseudo code to show the calls between JS and C++, the return value of idleTime() would be something like:

{
  idle: uv_metrics_idle_time(event_loop()),
  active: active_time,
}

@Flarna
Copy link
Member

Flarna commented Apr 23, 2020

I'm in favor of an easy, stateless API. If we add state we have to think about multi-user support.
As shown above utilization is easy calculated if idle time is available. I assume active above is actually the wall clock time in the same unit as idle time.

@trevnorris
Copy link
Contributor

trevnorris commented Apr 23, 2020

I assume active above is actually the wall clock time in the same unit as idle time.

Yes. It is the same unit as idle time.

@jasnell
Copy link
Member Author

jasnell commented Apr 23, 2020

Given the likely complexity of any other option, I'm +1 on the really simple Just Give Me The Idle Time API. We can calculate active from the existing perf milestones, we don't actually need the new API to give us that.

@addaleax
Copy link
Member

@jasnell Just as a note here, we should probably make idle time include Atomics.wait() once we implement an API for this in Node.js. That’s not hard to do, just something that seems like it’s easy to overlook :)

@jasnell
Copy link
Member Author

jasnell commented Apr 24, 2020

I'm not super familiar with how we'd be able to observe waits but +1 to including those. @trevnorris , have you looked into that at all?

@addaleax
Copy link
Member

@jasnell Luckily, I’ve done quite a bit of work around that in the past: https://github.com/v8/v8/blob/1d00b7856f9071c2c62ad66dbdbfdebff6dc370c/include/v8.h#L8867-L8944

Adding support for measuring that timing should be fairly straightforward, we basically register a callback that’s called once when Atomics.wait() starts and once when it ends.

(I’ve also been meaning to add a logging API for those events to Node.js, for better debugging Atomics.wait() deadlocks.)

@Flarna
Copy link
Member

Flarna commented Apr 24, 2020

To my understanding the time spent in Atomic.wait is no idle time in the sense that this thread can process something else. Therefore I don't think we should add the time here.
But monitoring the time spent in Atomic.wait is for sure a good enhancement.

@addaleax
Copy link
Member

@Flarna Well … that’s a good point, but I don’t think it’s a bit more complex than that. Maybe we need to make this customizable.

I’m currently working on code that uses Atomics.wait() when nothing else is happening on the event loop in a Worker, as a way of reducing latency and overall overhead. That means that the Atomics.wait() call effectively fulfills the function of the event loop wait, and should count as idle time.

@trevnorris
Copy link
Contributor

trevnorris commented Apr 24, 2020

@addaleax

That means that the Atomics.wait() call effectively fulfills the function of the event loop wait, and should count as idle time.

The metric should only measure idle time in the event provider. Specifically the amount of time idle while it waits for an event to be received (reason for this is explained in libuv/libuv#2725 (comment)). Looking at AtomicsWaitEvent, it seems like Atomic.wait() could be considered a type of event provider. Though if it's working in conjunction with uv_run() it'd probably be better to not include it.

Well, I guess that actually depends on whether the Worker's event loop has any handles from which it expects to receive an event. If not then Atomic.wait() would be considered the event provider. And since signal handlers can't be added from a Worker thread, there wouldn't be another way to interrupt the event provider. And if you're running with UV_RUN_NOWAIT then there technically wouldn't ever be idle time in uv_run()'s event provider. So I can see what you're getting at.

That being said, measuring the amount of time spent in Atomic.wait() would be a great step towards calculating total CPU idle time per thread. Since there isn't a platform agnostic way to do getrusage(RUSAGE_THREAD, ..., being able to include all other types of idle time is immensely helpful for analysis.

So for usability maybe just provide both times as separate values? IMO that would be the most helpful.

TL;DR idleTime() should only return the amount of idle time within the event provider. Not just idle time in uv_run(). So if Atomic.wait() is the acting event provider, and uv_run() is running with UV_RUN_NOWAIT, then idleTime() should return the amount of time in wait().

@mhdawson
Copy link
Member

In respect to the question about whether to just return the raw number since idle time metric was enabled or trying to return a calculated value over an interval, my preference is to leave it as the raw number.

Otherwise it's a slippery slope of people wanting to be able to specify the interval etc. so probably better to just let them calculate from the raw number in the first place and keep it as simple as possible.

@jasnell
Copy link
Member Author

jasnell commented Apr 24, 2020

Just had a great catch up with @trevnorris... it's been too long since we've chatted! On this specific API, I'm going to move forward with a draft PR that pulls in his libuv changes and adds a field to the existing Performance API nodeTiming object to get the current idle time. Using that, and the already existing loopStart timestamp, it ought to be trivial to calculate the utilization value.

nodeTiming.idleTime / (process.hrtime() - nodeTiming.loopStart)

We can provide the cumulative Atomics.wait time as a separate property off nodeTiming.

@jasnell jasnell added the feature request Issues that request new features to be added to Node.js. label Jul 6, 2020
@trevnorris
Copy link
Contributor

trevnorris commented Aug 25, 2020

Now that the patch has landed in libuv v1.39.0 I'll create a PR with this feature once it has now since it has landed in node.

@cjihrig
Copy link
Contributor

cjihrig commented Aug 25, 2020

It has already landed in Node in #34915.

EDIT: Ah, I see the original comment was already updated. Ignore this comment 😄

@trevnorris
Copy link
Contributor

PR created: #34938

@mhdawson
Copy link
Member

mhdawson commented Sep 2, 2020

PR #34938 landed @trevnorris I assume this can be closed now?

@trevnorris
Copy link
Contributor

Probably. :)

@jasnell jasnell closed this as completed Sep 18, 2020
@devSahinur
Copy link

image

why show this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Issues that request new features to be added to Node.js. perf_hooks Issues and PRs related to the implementation of the Performance Timing API. performance Issues and PRs related to the performance of Node.js.
Projects
None yet
Development

No branches or pull requests

9 participants