Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting multiple trace providers: Trace Events, DTrace, Lttng, ETW #163

Closed
jasnell opened this issue Feb 15, 2018 · 41 comments
Closed

Supporting multiple trace providers: Trace Events, DTrace, Lttng, ETW #163

jasnell opened this issue Feb 15, 2018 · 41 comments
Labels

Comments

@jasnell
Copy link
Member

jasnell commented Feb 15, 2018

Node.js core supports multiple trace providers... we currently have to instrument for each individually, which increases maintenance burden. Before we consider making any changes here, however, we need to understand the critical use cases and requirements for the existing probes. Let's use this thread to discuss.

@jasnell
Copy link
Member Author

jasnell commented Feb 15, 2018

Some feedback received so far:

  • The ability to switch on tracing probe points on a running process is critical.
  • The data collected at each probe point is also critical. We have existing probe points with existing data collections that should be maintained. Whether or not there are more that should be included needs to be determined.
  • The semantics of the probe points must be maintained (e.g. gc_start must emit at the start of gc) but the specific function that triggers those is not as critical.

(will update this list as we go)

@jasnell
Copy link
Member Author

jasnell commented Feb 15, 2018

One key question that needs to be answered: does a single Node.js binary need to support multiple trace mechanisms? For instance, current builds on OSX support both DTrace and v8 Trace Events independently. If we include a single instrumentation mechanism, does that need to support multiple trace backends at runtime, or is it ok to pick one backend at compile time?

@jclulow
Copy link

jclulow commented Feb 15, 2018

As long as DTrace probes are included in the OS X and SmartOS (and FreeBSD?) binaries provided by the project, I would definitely be OK with leaving out the v8 Trace Events probes.

@jasnell
Copy link
Member Author

jasnell commented Feb 15, 2018

I have a distinct feeling that current users of the v8 trace event probes will say the exact same thing. I think we need to be resigned to the fact that the default node.js binary is simply going to be required to support multiple trace targets for the forseeable future. The trick will be finding a way of efficiently instrumenting once for those multiple targets.

@bnoordhuis
Copy link
Member

Trace events are a lot more flexible, maintainable and consumable than static probes so if we had to pick one or the other, it's trace events all the way. I don't think it has to be either/or, though.

That said, the static probes do suffer from technical debt; see e.g. nodejs/node#18074. Get involved if you want to ensure their bright future.

@danielkhan
Copy link
Contributor

I also thought that we agreed to think about deprecating DTrace in favor of trace events. If there is no feature gap to be expected, maybe we can initiate that process and think of EOLing DTrace with ~ v12?

@jasnell
Copy link
Member Author

jasnell commented Feb 19, 2018

Given feedback I've received I don't believe we're going to be able to deprecate any time soon

@bnoordhuis
Copy link
Member

I'm curious, feedback from whom? I can't imagine many people caring about our static probes; it might be a literal fingers-on-one-hand deal.

Data point: I quietly removed them from libuv a few years back and no one complained.

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Feb 20, 2018

Instead of supporting LTTng, could we look into using the Linux native trace events: https://lwn.net/Articles/379903/

If I remember correctly, they are also consumed by LTTng, so hopefully it will make little difference for the LTTng users. (edit: I remember correctly.)

@AndreasMadsen
Copy link
Member

new LTTng PR: nodejs/node#18945 - it suggests there is some use.

@bnoordhuis
Copy link
Member

Probably very little though. The build was broken for close to two years without anyone noticing.

@simark
Copy link

simark commented Feb 23, 2018

I opened the PR mentioned above, and was asked to provide some feedback here.

My problem is that we have a node process with an apparent memory leak (we let it run, the memory usage goes up endlessly). Inspecting using the devtools didn't show anything obvious, there wasn't a big difference before/after in the number or total size of allocated objects. But there was a difference of tens of gigabytes of mapped virtual memory (original issue: eclipse-theia/theia#1284).

I happen to know LTTng and saw that node had some LTTng instrumentation, so I wanted to try it, that's when I hit the issues addressed by my PR. I don't know if it will actually help, but I thought I'd give it a shot. I didn't look yet, but I assume there are some tracepoints in the memory allocation/deallocation functions. If there isn't, I thought I could add some. I am not familiar with the node code, so I'm in exploration mode :)

I think that properly instrumenting the binary with tracepoints placed at strategic places is useful. You won't use it often, but the day you need it you're happy it's there.

In the end I don't really mind if it's LTTng or another tracer. But one thing to keep in mind is that some tricky performance problems or rare crashes can only be reproduced in production, so it's sometimes necessary to enable tracing in production. LTTng is very efficient and designed for such use cases.

Tracing needs to have the smallest possible overhead. If it's possible to support multiple trace backends dynamically without any performance cost, then why not. But otherwise, if having to choose at compile time results in a more efficient solution, I would prefer that.

@ofrobots
Copy link
Contributor

@simark lttng based tracing would probably not help find the memory leak if it happens to be in the JS heap. V8 JIT generates allocation code inline at each allocation site – something that would be hard to intercept using a tracing mechanism. You might want to check out the experimental sampling heap profiler in V8 that is designed for in-production memory leak debugging.

@simark
Copy link

simark commented Feb 26, 2018

@simark lttng based tracing would probably not help find the memory leak if it happens to be in the JS heap. V8 JIT generates allocation code inline at each allocation site – something that would be hard to intercept using a tracing mechanism.

And that JITed code doesn't call a common allocation function?

You might want to check out the experimental sampling heap profiler in V8 that is designed for in-production memory leak debugging.

Thanks, I'll give it a try.

@simark
Copy link

simark commented Feb 26, 2018

@simark in the interest of keeping the thread on topic of tracing, let's move the memory leak discussion to a separate thread, perhaps in the https://github.com/nodejs/help repo. If the leak is suspected in Node core itself, the issue can be in https://github.com/nodejs/node.

Sorry, I had both issues opened, and posted in the wrong tab. You can delete all history that this ever happened :)

@mike-kaufman
Copy link
Contributor

mike-kaufman commented Feb 26, 2018

I feel it would be helpful (at least for me) if we had a matrix kinda' like the following, so we could look at different trace libraries we're discussing, and understand differences in capabilities. The rows aren't complete yet, so don't get too hung up on that.

Do others find this useful? If so, should we move into top of issue & iterate on filling out in more detail? Or move into a markdown doc somewhere so ppl can PR it?

chromium trace_events Linux Native TRACE_EVENTS DTrace LTTNG ETW
supported OS? all linux SmartOS, Mac OS X (partially),freeBSD linux windows
requires re-compile? no yes (no, assuming kernel >= v4.2 [2015]) no yes no
dynamic categories? yes no no (yes, with libusdt) no no
dynamic listeners? no (yes, with effort) yes yes yes yes
interrupt (e.g. stack trace) no yes yes yes yes
documented in Node.js Yes (poorly) No No No No
developer experience TDB TBD TBD TBD TBD
others?

edit: filled out by @AndreasMadsen
edit: minor clarification on ETW traces by @mike-kaufman
edit: added easiness of usage row by @mcollina
edit: added FreeBSD to Dtrace column by @mike-kaufman
edit: renamed "easiness of usage" to "developer experience" and tagged all columns as TBD by @mcollina

@mcollina
Copy link
Member

I would add another line that assess how easy is to use each tool.

I would say that using ETW is extremely hard (and requires a lot of knowledge of Windows internals). dtrace needs Solaris (but maybe not in the future). LTTNG is being removed.

@mike-kaufman
Copy link
Contributor

@mcollina - feel free to edit table above to add rows you think are useful.

"Easy to use" risks being a subjective thing, but it's valid, and would be great to see if there's consensus around different tools being "easy"/"hard".

The other angle here (not captured in table above), is what are the expectations for a runtime on the target OSs? E.g., for windows, expectation is probably ETW. Not saying that's a strong argument to support one library over another, but it warrants consideration IMO.

@mcollina
Copy link
Member

@mike-kaufman I've added the row

@jclulow
Copy link

jclulow commented Feb 28, 2018

Please remove the "easiness of usage" row. It's completely subjective, where at least the other rows seem like they could be definitively measured one way or another. Given that you've put "hard" against everything that isn't Chromium Trace Events, it feels fairly politically motivated as well.

You should also add FreeBSD to the list of operating systems providing DTrace support.

@mike-kaufman
Copy link
Contributor

mike-kaufman commented Feb 28, 2018

@jclulow - I agree the "easy"/"hard" classification subjective. However, usability is a key criteria. I'd like to get to a place where we have more detailed specifics about what's contributing to the easy/hard perception. Matteo's classification at least starts that conversation.

If you have specifics or details that you think contribute to usability of any these libraries, it would be really helpful if you could chime in with them.

@jclulow
Copy link

jclulow commented Feb 28, 2018

I can't stress enough how frustrating it is to have an entirely fictional criteria added, in which the features that I use every day are immediately on the back foot with no explanation. If you want to add some concrete criteria in the future, that's fine -- we can talk about that.

This is not that, though, and it needs to go immediately. Alternatively, you can change the entry for DTrace from "hard" to "easy", at which point obviously I agree! (See the problem?)

@mike-kaufman
Copy link
Contributor

mike-kaufman commented Feb 28, 2018

@mcollina - Can you help the thread understand where your "easy"/"hard" perceptions come from?
@jclulow - can you provide more color around the end-user will consume DTrace events?

@mcollina
Copy link
Member

@jclulow I used both dtrace and perf in the past. Setting up and using those tools is not something a typical Node.js user is able to do. I think developer experience is a key criteria for these tools, as most of those options are either a) poorly documented, b) not documented at all, c) not working at all (lttng).

dtrace, lttng, Linux TRACE_EVENT are completely not documented within our organization. We document Chromium trace events: https://nodejs.org/api/tracing.html. Assuming documentation as one of the key metrics for usability, all of them minus Chromium trace events are hard to use.
I'm not talking about the tool itself, but rather their usability with the Node.js project itself.
In most cases, adding some documentation could definitely improve usability.

@jclulow I kindly ask to use more calm language. No one is offending you here.

@jclulow
Copy link

jclulow commented Feb 28, 2018

I am completely calm. Like I said, if you want to talk about something more concrete where we can measure alternatives by some objective criteria, that's fine. But, again, that's not what's up there right now!

By way of example, if you wanted to put a row for "described in the Node documentation", that would be fine -- it's an objectively measurable fact. It also has the benefit of being something we could take action on to fix, if needed.

The criterion "easiness of usage" is, by contrast, simply playing favourites with no justification. It should be taken down, first, then (if it makes sense) replaced with objective criteria afterwards.

As an aside, I would note that you can hardly tell somebody else whether they are offended or not.

@mike-kaufman
Copy link
Contributor

mike-kaufman commented Feb 28, 2018

@jclulow - the point here (at least from my pov) is to have a conversation about the different solutions & try to come to some objective, shared understanding. The easy/hard thing is simply a starting point for that conversation, which I think we're having. I don't foresee that column lasting too long.

@mike-kaufman
Copy link
Contributor

RE developer experience, there are two different facets here:

  1. experience of someone adding traces
  2. experience of someone consuming traces

Complexity with the consumption can generally be addressed with additional tools if that is a concern.

@mcollina
Copy link
Member

I've put the "easiness of usage" as TBD, and added a documentation line.

I strongly argue that we need a measure for this, not with the tool themselves but with their usage with Node.js. perf as a tool can be simple or hard to use, but are we making things harder to use Node.js with perf? how can we improve the developer experience on the various platform?

@jasnell
Copy link
Member Author

jasnell commented Feb 28, 2018

One thing that I'd like to clarify... the "hard" / "easy" bit is not a criterion for removing or favoring one mechanism over another. We will be continuing to support Dtrace/eBPF/ETW style tracing along with v8 style tracing. Each has it's own complexities, limitations, strengths, and relative difficulty of use. For instance, v8 trace events are theoretically "easier" but suffer from a lack of comprehensive tooling.

@geek
Copy link
Member

geek commented Feb 28, 2018

@jasnell please expand with specifics on your claim

we currently have to instrument for each individually, which increases maintenance burden

Is this really true, are tracing providers getting in our way? What problems aren't we able to solve because of our various trace providers? Are developers getting stuck or slowed in some way because of our trace providers?

@misterdjules
Copy link

@jasnell The premise of this discussion seems to be:

Node.js core supports multiple trace providers... we currently have to instrument for each individually, which increases maintenance burden.

Can we be a bit more specific (maybe quantify somehow?) the "maintenance burden" of specific tracing frameworks that are currently implemented in node core? Otherwise I'm not sure why we need to have this discussion.

I think I'm familiar with what is and has been needed to maintain the DTrace support in node core, but I may have missed some things:

  1. Maintaining the ustack helper: this can be challenging and can block V8 upgrades, but has usually been fixed in a timely manner by either @davepacheco, @cjihrig, myself or others (I'm sure I'm forgetting people here, my apologies)

  2. Maintaining the build side of that support, which IIRC has not been a major challenge, or hasn't been a problem too often

Is there anything else? If so, it doesn't seem to be such a challenge to me, but again, I may be missing things.

@bnoordhuis mentioned nodejs/node#18074 as an example of maintenance burden for static probes in general (not just for DTrace), but unless I'm missing something those probes are not supported by DTrace, so I don't believe this example applies to it (not saying Ben was suggesting that).

Now, I can also provide my feedback as a user of the DTrace support in node: in general, I haven't had the need to have static probes in libuv, which is maybe a sign of why removing them a while ago hasn't generated any significant pushback.

On the question of:

understand[ing] the critical use cases and requirements for the existing probes

I would say that I use the existing DTrace static probes very often to understand the behavior of node HTTP clients and servers (latency, distribution of requests methods/paths/etc., etc.) while:

  1. having the guarantee that my observations do not have a significant impact on performance/stability of the observed system

  2. having access to the whole system's view when those probes fire (e.g native stack frames)

@davepacheco mentioned specific use cases in a previous Twitter thread at https://twitter.com/dapsays/status/963921954309730304 that are similar to mine. I can definitely help anyone understand those use cases better. If people are interested, just let me know.

However, just to make sure this was not lost in the length of this comment: I would really like us to be a bit more precise about what the "maintenance" burden is. If there's anyway users of the DTrace support can help with that, let's talk about that too.

@jasnell
Copy link
Member Author

jasnell commented Feb 28, 2018

@geek ... look at the lltng probes, for instance... apparently using --with-lltng has been broken in build for a while and no one noticed. We're now just going ahead and pulling those out because apparently no one is using them.

For the dtrace probes, I'm evaluating now the best way to instrument the new http2 code paths and I'm trying to avoid having to instrument for dtrace and v8 trace events separately. I also want to instrument more for the performance API. Part of this entire discussion is to see if it's possible to develop a single instrumentation approach that allows us to support multiple providers and ensures that we have appropriate test coverage across those.

@jasnell
Copy link
Member Author

jasnell commented Feb 28, 2018

To put folks at ease a bit: no one is considering removing the DTrace support :-) ... I floated the idea as a hypothetical test balloon to solicit feedback. As soon as I did, enough folks came back with panicked screams that it the very notion could be quickly discarded (as I suspected it would). The key point is: v8 trace events are not an alternative to dtrace probes and it would be fantastic to have an approach that supports both ... preferably an approach that makes things more useful for everyone.

@misterdjules
Copy link

@mcollina

dtrace, lttng, Linux TRACE_EVENT are completely not documented within our organization.

FWIW, @davepacheco had submitted a PR to document tracing at https://github.com/nodejs/node-v0.x-archive/pull/9206/files. That never landed, but after taking another brief look now, it looks that it would require at most minor changes to port that to the current documentation.

@mcollina
Copy link
Member

mcollina commented Feb 28, 2018

@misterdjules that would be fantastic to add. Possibly it best sits in https://github.com/nodejs/nodejs.org/tree/master/locale/en/docs/guides (https://nodejs.org/en/docs/guides/) considering the current structure of the project and documentation. I think we should open a separate discussion on this. Might be better to have in in nodejs/node as it can change between versions. I think this is a feature that we should document.

@misterdjules
Copy link

@mcollina I'll see if I can move that forward.

@bnoordhuis
Copy link
Member

Given that you've put "hard" against everything that isn't Chromium Trace Events, it feels fairly politically motivated as well.

Trace events integrate seamlessly with DevTools. That's the default debugging experience these days so I think it's a fair comparison, nothing political about it.

@jkrems
Copy link
Contributor

jkrems commented Mar 1, 2018

I think at the summit we had similar disagreements about "easy/hard" and it was brought up that "complex/simple to use" might be more to the point (how many moving pieces are there to setting it up and evaluating the data).

Worth noting that this is always a reflection of the current state. A tool that requires a lot of setup today might become one-click tomorrow. But with that restriction it seems relatively objective (and useful).

@mike-kaufman
Copy link
Contributor

RE motivation here, in addition to reducing maintenance, we want a consistent set of trace data across supported OS platforms. Bonus points if this can be done in a way that is idiomatic with the tools/practices of each platform.

@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

@github-actions github-actions bot added the stale label Jul 18, 2020
@mmarchini
Copy link
Contributor

Keeping the stale label for the same reason as #10 (this likely falls under the new tracing strategic initiative)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests