Improve worker event loop lag monitoring #6720

jeluard · 2024-05-01T08:13:58Z

collectNodeJSMetrics relies on prom-client to collect eventLoopMonitoring. This in turns relies on node:perf_hooks/monitorEventLoopDelay that only works for the main thread or worker from the cluster module (as documented here).
woker_threads support is unclear, although we relies on this for worker_threads monitoring (network and discv5).

Proposed action items

Improve event loop monitoring, especially for workers. It looks preferable not to rely on prom-client for this specific monitoring and add support directly in lodestar.

add ELU monitoring for finer grain data (monitorEventLoopDelay relies on histogram and averages)
add new dashboards

The text was updated successfully, but these errors were encountered:

nflaig · 2024-05-01T10:40:25Z

node:perf_hooks that only works for the main thread

I would expect perf hooks to work inside worker threads, where did you read this?

wemeetagain · 2024-05-01T15:09:28Z

Specifically, woker_threads support is yet missing- siimon/prom-client#401
although we incorrectly relies on [prom-client] for workers monitoring

This isn't true, not in the way its suggested here. prom-client doesn't automagically instrument workers. We just do it manually, so we are not incorrectly relying prom-client.

The PR linked in that issue does this too. It has the main thread communicate with the worker to get data from the worker and include it in the metric registry of the main thread.

This is exactly what we do in lodestar. See https://github.com/ChainSafe/lodestar/blob/unstable/packages/beacon-node/src/node/nodejs.ts#L277

jeluard · 2024-05-02T07:40:23Z

monitorEventLoopDelay

worker_threads defines eventLoopUtilization in addition to the perf_hooks one, with this comment:

The same call as `perf_hooks eventLoopUtilization`, except the values of the worker instance are returned.

Its implementation itself required low-level changes. There is no equivalent to worker_threads for monitorEventLoopDelay so I would lean towards thinking it reports main thread info when called from within a worker. I couldn't find conclusive info on this though.

My main feeling is that eventLoopUtilization provides extra interesting information, and explicitly supports worker_threads.
I will update this PR so that to clarify this and correct imprecisions.

nflaig · 2024-05-02T08:09:15Z

I would lean towards thinking it reports main thread info when called from within a worker

This would be really bad if nodejs just does this without having a note in the docs, which they usually do a great job of noting limitations. This is really simple to reproduce as well, if this is true we might also wanna report this upstream to node.

Based on previous data #5604, the event loop lag seems to be reported correctly.

wemeetagain · 2024-05-02T14:12:52Z

eventLoopUtilization ... explicitly supports worker_threads

This is another case where "supports worker_threads" means "automagically handles communication between a worker and main thread".

Would agree with @nflaig that based on the data we've seen and used in the past, we can assume that monitorEventLoopDelay works properly in workers (just that the data must be manually communicated to the main thread).

eventLoopUtilization provides extra interesting information

Yeah, agree there, could be worth adding as an additional metric

jeluard · 2024-05-02T14:23:36Z

This is another case where "supports worker_threads" means "automagically handles communication between a worker and main thread".

By this I mean that there are 2 distinct API, performance.eventLoopUtilization and worker.performance.eventLoopUtilization
Both can be accessed from the main thread (forworker.performance.eventLoopUtilization, via the worker reference) and worker threads.

wemeetagain · 2024-05-02T14:33:23Z

Both can be accessed from the main thread

I don't think we want to use this feature. (it gets more complicated when workers spawn other workers, like we currently do)

Rather we can just use the current pattern of having each worker collect its own metrics (and as a byproduct, reuse the existing mechanism for communicating those to the reporting thread).

Recommend appending new metrics collection here, which will automatically be applied to all our workers: https://github.com/ChainSafe/lodestar/blob/unstable/packages/beacon-node/src/metrics/nodeJsMetrics.ts

jeluard added the scope-performance Performance issue and ideas to improve performance. label May 1, 2024

jeluard self-assigned this May 1, 2024

jeluard changed the title ~~Incorrect worker event loop lag monitoring~~ Improve worker event loop lag monitoring May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve worker event loop lag monitoring #6720

Improve worker event loop lag monitoring #6720

jeluard commented May 1, 2024 •

edited

Loading

nflaig commented May 1, 2024

wemeetagain commented May 1, 2024

jeluard commented May 2, 2024 •

edited

Loading

nflaig commented May 2, 2024

wemeetagain commented May 2, 2024

jeluard commented May 2, 2024

wemeetagain commented May 2, 2024

Improve worker event loop lag monitoring #6720

Improve worker event loop lag monitoring #6720

Comments

jeluard commented May 1, 2024 • edited Loading

Proposed action items

nflaig commented May 1, 2024

wemeetagain commented May 1, 2024

jeluard commented May 2, 2024 • edited Loading

nflaig commented May 2, 2024

wemeetagain commented May 2, 2024

jeluard commented May 2, 2024

wemeetagain commented May 2, 2024

jeluard commented May 1, 2024 •

edited

Loading

jeluard commented May 2, 2024 •

edited

Loading