Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement advanced event loop monitoring #278

Merged
merged 8 commits into from
Feb 12, 2020

Conversation

yvasiyarov
Copy link
Contributor

Guys, I'm sorry for bothering you with one more pull request :-)
I've implemented more advanced event loop monitoring based on functionality provided by Node11.
perf_hooks.monitorEventLoopDelay() use libuv timer, so its measurements are more precise.

@yvasiyarov
Copy link
Contributor Author

@SimenB can you please have a look at this pool request ?

@zbjornson
Copy link
Collaborator

I'm +1 on this.

This PR has commits from your other PRs; I didn't look closely to see if this really depends on the other commits, but the last commit looks like it's the only relevant one.

Can you update the changelog please? (It's also worth mentioning in the changelog that this makes the existing event loop lag metric obey the timestamps config.)

Sorry for the slow reviews!

@zbjornson
Copy link
Collaborator

It's also worth mentioning in the changelog that this makes the existing event loop lag metric obey the timestamps config

Actually, per #177, it sounds like we shouldn't be doing this in any of the stock metrics.

@sam-github
Copy link
Contributor

Could you rebase this so it doesn't have #274 included? That would allow it to be more easily reviewed.

I took a look, but I'm confused. I thought I'd see calls to https://nodejs.org/api/perf_hooks.html#perf_hooks_perf_hooks_monitoreventloopdelay_options
since the topic says event loop monitoring, but the implementation is collecting the bootstrap times. I'm not convinced they are generally useful. They can be helpful to debug node internals, to see where startup time is, but slow startup, and a need to pinpoint it to a particular part of node's internal bootstrap process, doesn't seem to me to be a generally useful metric.

@SimenB
Copy link
Collaborator

SimenB commented Feb 1, 2020

#274 has been merged now, so this really needs a rebase 🙂

@yvasiyarov
Copy link
Contributor Author

@SimenB I've updated PR, please, have a look

@yvasiyarov
Copy link
Contributor Author

@sam-github I'm sorry for mixing up several PR's in one branch. There was no response to this PR for a while so I thought it was abandoned and reused this branch in one of my projects.

I've cleaned up the branch and removed everything except event loop monitoring.

@sam-github
Copy link
Contributor

Thanks @yvasiyarov -- just to let you all know, there's a Node.js sec release imminent, it'll be a couple days before I'm free to look at prom-client again.

Copy link
Contributor

@sam-github sam-github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple quick suggestions. Also, I wonder if given how different these metrics are, if the new ones can be called by the same name node, their source, calls them "event loop delay" (not lag)?

I think it would be helpful if the README somewhere pointed to the node docs and said that's what the events are (so people can go look there to see what they mean).

Its really unfortunate that these have to be gauges, but that has to be fixed upstream in node. Its on my TODO list, but not at the top :-(/.

lib/metrics/eventLoopLag.js Outdated Show resolved Hide resolved
lib/metrics/eventLoopLag.js Outdated Show resolved Hide resolved
lib/metrics/eventLoopLag.js Outdated Show resolved Hide resolved
lib/metrics/eventLoopLag.js Outdated Show resolved Hide resolved
@sam-github
Copy link
Contributor

sam-github commented Feb 7, 2020

I looked more closely, this will only report lag if node is v10.x, for 12.x and 13.x it won't report lag, and instead will report the others.

I don't know, perhaps it should either always report lag, calculated the old way?

Or perhaps, on 10.x it reports lag the old way but not the others, and on 12+ it reports the event loop delay mean as the lag?

Or perhaps it doesn't matter, people don't expect metrics to be consistent across node.js major bumps?

@yvasiyarov
Copy link
Contributor Author

I looked more closely, this will only report lag if node is v10.x, for 12.x and 13.x it won't report lag, and instead will report the others.

I don't know, perhaps it should either always report lag, calculated the old way?

Or perhaps, on 10.x it reports lag the old way but not the others, and on 12+ it reports the event loop delay mean as the lag?

Or perhaps it doesn't matter, people don't expect metrics to be consistent across node.js major bumps?

My intention was to keep reporting lag exactly as it was reported before (for the sake of backward compatibility). If lag is not reported, it's definitely a bug.
I've made a quick check on node v13.x and lag is reported on pair with newly introduced metrics.

@yvasiyarov
Copy link
Contributor Author

A couple quick suggestions. Also, I wonder if given how different these metrics are, if the new ones can be called by the same name node, their source, calls them "event loop delay" (not lag)?

I think it's definitely worth to mention in README that lag calculated differently than the rest of the event loop metrics and that lag should not be directly compared with them. But I'm not sure if we should go that deep into implementation details.

If we name metrics differently - then we should explain what is the difference between https://github.com/nodejs/node/blob/743c28104f3db504963409e8e91700e430131221/src/node_perf.cc#L508 and https://github.com/yvasiyarov/prom-client/blob/73083f542c3f4b78a3c7121c5378b289ccdf5a4d/lib/metrics/eventLoopLag.js#L22 . For many people "lag" and "delay" are synonymous, so we can not explain the difference between two implementations by using different names.

I think it would be helpful if the README somewhere pointed to the node docs and said that's what the events are (so people can go look there to see what they mean).

The general event loop description available here - https://nodejs.org/uk/docs/guides/event-loop-timers-and-nexttick/. But it does not has that much technical details.
There is better description here - https://github.com/libuv/libuv/blob/v1.x/docs/src/design.rst#the-io-loop, but it require some level of understanding node.js internals.

Which one would you prefer ?

Its really unfortunate that these have to be gauges, but that has to be fixed upstream in node. Its on my TODO list, but not at the top :-(/.
If you can give me a hint how proper event loop monitoring implementation can looks like I will be happy to submit a pull request to node upstream.

I can pass every measurement of event loop delay to JS word by using perf_hooks API.
So prom-client can use performanceObserver.observe() to subscribe to this events and pass it to histogram. But I'm afraid that it can be too expensive from performance perspective. What's why Node.js developers decided to aggregate events in C++ code and pass to JS word only aggregated values.

@sam-github
Copy link
Contributor

What's why Node.js developers decided to aggregate events in C++ code and pass to JS word only aggregated values.

Yes, but they chose to summarize, rather than just count into buckets (like a prom-style historgram).

I think per_hooks should allow the user of the API to supply buckets, and for the measured values to be simply counted into the buckets. This would allow prom-client to take it's bucket config, and basically pass it into Node.js C++-land where the counting would occur. That the api is notfication based rather than scraped is probably not such a big deal, accumulation could occur in prom-client.

I don't see that change as being more costly in C++, at least not obviously, it just replaces a bunch of math to calculate the percentiles with simple bucker level comparisons and increment of counters.

Copy link
Contributor

@sam-github sam-github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think you can make a more specific file-global eslint ignore, its more global, but its also clear what's being ignored. I wasn't sure if it was some style rule that you disagreed with, once I backed out the eslint I saw what it was. For the record, eslint hates perf_hooks:

error  The 'perf_hooks.monitorEventLoopDelay' is not supported until Node.js 11.10.0. The configured version range is '>=10'  node/no-unsupported-features/node-builtins

lib/metrics/eventLoopLag.js Outdated Show resolved Hide resolved
lib/metrics/eventLoopLag.js Outdated Show resolved Hide resolved
lib/metrics/eventLoopLag.js Outdated Show resolved Hide resolved
@yvasiyarov
Copy link
Contributor Author

Thank you, @sam-github !
Regarding changes in node.js - I will try to submit PR in a near future

@siimon siimon merged commit c63689b into siimon:master Feb 12, 2020
@sam-github
Copy link
Contributor

@yvasiyarov nodejs/node#32018 (FYI) Node.js default behaviour is changing, for the better IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants