Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Event Loop Lag Metric #370

Open
ChristianBoehlke opened this issue May 4, 2020 · 10 comments
Open

Improve Event Loop Lag Metric #370

ChristianBoehlke opened this issue May 4, 2020 · 10 comments

Comments

@ChristianBoehlke
Copy link

Hello,

Thank you for all the work with this library!
We are using our own implementation of the event loop lag metric because the default metric did not work well for us and I was wondering whether this is something we could contribute back? Here are the two things we have changed:

Reset libuv-Histogram
The mean/max/percentiles that libuv provides (through perf_hooks.monitorEventLoopDelay) are never reset. After a couple of days all numbers are pretty much set and do not change anymore. There is no way to even distinguish between high load times during the day and quiet times during the night. I think it would be better to expose a moving average/min/max instead of the total-average/min/max.

By default the resolution that libuv uses is 10ms which means that libuv generates ~100 measurements per second. It would be possible to reset libuv's histogram every second (or so) and record its mean and max in circular buffers. The length of the buffers should be a bit longer than Prometheus' scape interval and get averaged when collected.
Instead of the circular buffer, a histogram could be used as well (proposed in #309).
There is some discussion about this in #278.

Subtract resolution from values
libuv uses a timer and records the time that has passed since the last invocation. This means that all values that libuv provides have the resolution value "added" to them. I think it would be better if this library could subtract this number before exposing the metrics. This could also be done in Prometheus but this is a rather specific implementation detail and could vary from process to process because it is configurable.
There is some discussion about changing how libuv measures the lag but this didn't get merged: nodejs/node#32018 and nodejs/node#32102. There is also some future work in libuv/libuv#2725.

I am happy to open a PR for this so that you can have a look.

@sam-github
Copy link
Contributor

Hi, appreciate the feedback, and reading through your comments in detail is on my list.

Any interest in making a PR?

Also, pending me looking more closely, do you think some of this should be reported directly to nodejs as issues with our core APIs?

@jomi-se
Copy link

jomi-se commented Aug 7, 2020

Hey @ChristianBoehlke I came across this issue. Maybe you can chime in in this other related issue I just opened: nodejs/node#34661

Not exactly the same thing, but I thought it might interest you

@johanneswuerbach
Copy link

@ChristianBoehlke I observed the same behavior and would be interested in your changes. Did you ever had the chance to open this PR?

@jmealo
Copy link

jmealo commented Feb 19, 2021

@ChristianBoehlke: Hey, I hope that you're well. Word on the block is that you have an awesome fix for event loop lag metrics. Is this true?

@ChristianBoehlke
Copy link
Author

Hello! Sorry I totally forgot about this. I have some time this weekend to open up a PR with our implementation.

@yarsky-tgz
Copy link
Contributor

Good evening, @ChristianBoehlke !

Are you able to check my PR and share your thoughts regarding it?

My solution is much simpler than proposed by you (one line added actually) and I am interested to know, why more complex solution can be needed, if it is actually needed.

@yarsky-tgz
Copy link
Contributor

Seems that @ChristianBoehlke is too busy to respond on this.

From our side i would like to share experience of using #459 fork

We have now exactly that information, that we want to see from beginning:

image

It would be possible to reset libuv's histogram every second (or so) and record its mean and max in circular buffers. The length of the buffers should be a bit longer than Prometheus' scape interval and get averaged when collected.
Instead of the circular buffer, a histogram could be used as well (proposed in #309).

I still have no idea, why not just use that libuv's histogram, provided by monitorEventLoopDelay, collect that precalculated stats and reset histogram. No data will be lost such way, and user can see accurate stats for measurement period, as you can see on screenshot above.

@zbjornson
Copy link
Collaborator

@yarsky-tgz's PR #459 looks reasonable and seems to address the very valid issue raised here.

The discussion in the Node.js issued linked in the OP are useful. In particular, it looks like we should expose an Event Loop Utilization default metric to help with interpretation (added in v12.19, 14.10, https://nodejs.org/dist/latest-v14.x/docs/api/perf_hooks.html#perf_hooks_performance_eventlooputilization_utilization1_utilization2).

@zbjornson
Copy link
Collaborator

Fixed in 0f444cd (v14 coming out tonight).

@andrew-demb
Copy link

Is it a resolved issue (as of #370 (comment))?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants