Skip to content
This repository has been archived by the owner on Aug 7, 2023. It is now read-only.

Triggering post-mortem diagnostics in Node core #44

Closed
rnchamberlain opened this issue Feb 24, 2017 · 19 comments
Closed

Triggering post-mortem diagnostics in Node core #44

rnchamberlain opened this issue Feb 24, 2017 · 19 comments

Comments

@rnchamberlain
Copy link
Contributor

This came up in the Diagnostics WG meeting 23 Feb 2017: nodejs/diagnostics#85

  1. Why doesn't Node.js build in more comprehensive diagnostics?
  2. If it did, would this functionality be in Node itself, or in /deps - libuv, V8 or other?

There are some diagnostic features in Node.js:

  • stdout/stderr messages - e.g. stack traces on uncaught exception, GC information on heap OOM
  • Ben Noordhuis recently added native stack traces to stderr for crash scenarios
  • --abort_on_uncaught_exception
  • debugger support: https://nodejs.org/api/debugger.html

The kind of things that other runtimes provide are:

  • options to produce selected/appropriate dump types on various failures e.g. heap dumps on OOM
  • signal handlers and exception handlers to catch native signals and exceptions in native code
  • loop and deadlock detection
  • APIs to allow configuration and triggering of diagnostic output
  • support for external triggering of diagnostic output

Some arguments for adding diagnostics into Node core are:

  • diagnostics available out-of-the-box
  • closely coupled diagnostics do a better job

Some arguments against are:

  • fundamental principle of keeping core small/lean/fast (re-iterated at NodeInterative 2016)
  • anything that can be provided as an add-on module should be
  • quicker to deliver new features as add-on modules, rather than in the Node.js release cycle

Note: post-mortem tooling is separate issue, the arguments for bundling it in core are likely to be weaker. However, there is discussion on upstreaming core dump tooling to V8: nodejs/llnode#64

@richardlau
Copy link
Member

cc @nodejs/diagnostics

@mhdawson
Copy link
Member

I think node core should include at least a minimal set of diagnostics out of the box. At the very least you want to be able to generate capture the dumps/information that you can then use with external tools. I'd start be making sure that out of the box you can easily generate the following on request, on uncaught exception or crash:

  1. human readable snapshot of key information (node-report)
  2. heapdump
  3. core dump

@sam-github
Copy link

Is a heapdump deriveable from a core dump? In theory, a heapdump is a walk of the in-memory v8 data structures, and those members got serialized to disk with the core dump.

@bnoordhuis
Copy link
Member

You can to a first approximation by interpreting the node_isolate global as a v8::internal::Isolate pointer and walking the heap_.roots_ array.

However, references to heap objects also exist on the stack, in registers, in handle scopes and in the global handles list (what stores v8::Persistent<T> references) so getting them all is pretty complicated. It's doubly hard when the heap isn't in a consistent state, such as during GC.

@jclulow
Copy link

jclulow commented Feb 24, 2017

That's why mdb_v8 scans the entire process address space and uses various heuristics to identify active objects, rather than starting a walk from a particular list of roots. The heuristics could, I suspect, be improved by including reachability from the roots, but it's not required right now to find all objects.

@bnoordhuis
Copy link
Member

You scan for anything that looks like a pointer into one of the JS heap ranges? I've tinkered with that approach but it wasn't reliable enough to my liking. What else do you do?

@davepacheco
Copy link

Among the heuristics that @jclulow alluded to: when mdb_v8 finds a candidate object, it attempts to enumerate its properties. (It needs to do this anyway to classify objects to allow users to search for objects by property name or constructor.) In order to do that, several data structures in different C++ objects need to be self-consistent, and their types need to match well-known values. (There's a detailed comment including diagrams of some of the data structures in the mdb_v8 source.) If any of these don't match up, mdb_v8 assumes this isn't a valid object and doesn't report it. For each property of the object, mdb_v8 also fetches the value (which itself sometimes requires traversing another data structure whose type has to match what's expected) and checks that its type corresponds to a valid type that it knows about. We have not found many objects that are consistent in all these ways but that are actually false positives, nor have we found too many that are pruned by these heuristics that are actually valid. (There are some exceptions, and there's more we could do here. For example, I'd like to build a dcmd that iterates the objects we pruned and makes sure that they're not reachable from the graph of objects that we found. That would help us smoke out any cases where we're pruning something incorrectly or not pruning something we should be.)

@rnchamberlain
Copy link
Contributor Author

llnode uses a similar brute force algorithm, see FindJSObjectsVisitor::Visit() in https://github.com/nodejs/llnode/blob/master/src/llscan.cc

@hhellyer had a close look at producing a JSON heap dump from a core dump, using llnode. It would provide a useful route to the DevTools GUI with its retained size analysis. However, we don't have enough metadata to find the heap roots, handles etc that Ben describes above, and that we'd need for the heap dump format.

I think you can get enough from line-mode mdb_v8 or lldb/llnode to solve memory leaks, just it's a bit harder work.

@mrkmarron
Copy link

I have been experimenting with some of these, and other related ideas, for improving the developer experience when dealing with remotely deployed applications (e.g., running in the cloud). In particular I was experimenting a system that provided:

  • An easy out of the box experience that provides value without extra configuration by automatically hooking:
    • Unhandled exceptions
    • Exit with a non-zero error code (optionally any error code).
    • SigInt (optional)
    • Assert failures
    • Writes to console.warn and console.error with configurable sampling rates and limits.
  • Support for sampling and binning reports (so we don't flood a developer with redundant reports or live-lock the application).
  • Provide hooks for automatically moving any error reports off the local machine or cloud server to a location of a developers choice for later offline analysis.
  • My initial implementation is hard-coded to work with/emit time-travel diagnostics traces that can be debugged offline in VSCode but the framework can easily be parametrized to handle other actions.

The prototype work is in my fork here (enabled using a --record flag) with the major changes being:

After playing around with how this works I am really excited about the great developer experience that can be created with this kind of approach (particularly if it includes heap dumps or execution trace information).

@rnchamberlain
Copy link
Contributor Author

Adding better hooks in node core would be a good starting point, and could perhaps get around the 'keep core small' rule. Improved hooks are needed in c/c++ code as used by node-report, as well as in .js code (as per @mrkmarron).

@mhdawson
Copy link
Member

mhdawson commented Apr 12, 2017

I still be believe that we can balance small core with what makes sense to be in the core. I believe diagnostics, particularly where native code is required, meets the bar as to what should be in core.

@misterdjules
Copy link

@mhdawson

I believe diagnostics, particularly where native code is required, meets the bar as to what should be in core.

I would think determining what meets the bar for what should be in core requires us to take a closer look at the "what" in question, and then answering why it needs to be added into node's core.

I also think that "requiring native code" is not a sufficient reason to mandate inclusion in core.

From @rnchamberlain's original post, I have the following questions:

options to produce selected/appropriate dump types on various failures e.g. heap dumps on OOM

What does happen if the node process runs out of memory (not only out of the V8 JS heap's space)? Would a mechanism that generates a heapdump still be able to find the resources needed to generate it? I would think that in some cases at least that could be problematic, and that an approach based on generating OS-level core dumps would not have these problems.

signal handlers and exception handlers to catch native signals and exceptions in native code

Can you expand on that? I don't understand what this describes. Can't node programs already catch signals and exceptions raised/thrown from native code?

loop and deadlock detection

Can you expand on that? Could you describe representative use cases and what tools/mechanisms you have in mind to deal with them?

The following argument for adding the above mentioned diagnostic facilities into node's core:

diagnostics available out-of-the-box

is not clear to me, since we could use this argument for pretty much anything that is useful for any group of node users. I would think the question to ask ourselves is closer to: "Do users of node need to have these diagnostic facilities available out of the box and why?".

The second argument for adding the above mentioned diagnostic facilities into node's core:

closely coupled diagnostics do a better job

is also not clear to me. How better would the above mentioned diagnostic tools be if they were included in core instead of available as a npm module?

In general, I would think that for the project to properly evaluate the pros and cons of what adding the above mentioned diagnostic tools means and why we'd want to do that, writing a detailed enhancement proposal in nodejs/node-eps would be a good way forward.

@mhdawson
Copy link
Member

mhdawson commented Apr 12, 2017

The idea of a proposal in nodejs/node-eps might be the way to go.

Not to answer all of your questions, but there was recent example we had an individual reach out to @MylesBorins, @ofrobots and myself on twitter asking how to debug a memory leak. When we suggested generating heapdumps with the heapdump module they initially had trouble getting the module compiled and working in their environment. It's a simple case where there would have been value to them by simply having a mechanism to generate heapdumps pre-built and available as part of the runtime.

@misterdjules
Copy link

Not to answer all of your questions, but there was recent example we had an individual reach out to @MylesBorins, @ofrobots and myself on twitter asking how to debug a memory leak.

I believe this is https://twitter.com/ofrobots/status/847260522982526976. Adding it so that people in this discussion can get a better idea of what that discussion was.

It's a simple case where there would have been value to them by simply having a mechanism to generate heapdumps pre-built and available as part of the runtime.

It's not clear to me how having a mechanism to generate heapdumps built into node's core would have improved the user experience in that specific case.

But most importantly it's not clear to me that not having that mechanism in core would not allow to address these issues at least as well.

In any case, I would expect that an "enhancement proposal" in nodejs/node-eps would document this use case and others in greater details than a Twitter thread.

@bnoordhuis
Copy link
Member

It's only a matter of time before the inspector grows programmatic hooks and then user-programmable heapdumps are a fait accompli and this whole discussion will be moot. :-)

@rnchamberlain
Copy link
Contributor Author

It's only a matter of time before the inspector grows programmatic hooks and then user-programmable heapdumps are a fait accompli and this whole discussion will be moot. :-)

That is what happened more-or-less in Java. After a few iterations of the debugger support, an architected API appeared https://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html, which a variety of tools could exploit.

@misterdjules
Copy link

@bnoordhuis

It's only a matter of time before the inspector grows programmatic hooks and then user-programmable heapdumps are a fait accompli and this whole discussion will be moot. :-)

The availability of that API doesn't imply that it should be used e.g to generate heap dumps on out of memory errors.

@rnchamberlain

That is what happened more-or-less in Java. After a few iterations of the debugger support, an architected API appeared https://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html, which a variety of tools could exploit.

The JVMTI document describes a broad set of APIs, some of which at least are not related to post-mortem debugging. This repository (and thus I assumed this discussion) is about post-mortem debugging. Thus it's not clear to me how the two relate to each other.

Can we scope this issue so that it's clear what needs to be discussed in terms of post-mortem debugging?

@rnchamberlain
Copy link
Contributor Author

So I think the scope of this issue, and what we are discussing as a first pass is now:

  1. Should Node.js have built-in support for producing heapdumps, core dumps and node-reports?
    and/or
  2. Should we add more APIs to Node.js/V8 to allow add-ons to produce dumps on a variety of events?

Possible events are for example: JS heap OOM or threshold, uncaught exception, slow response time, tracepoint etc.

If there is enthusiasm we could progress to writing the detail on https://github.com/nodejs/node-eps

@richardlau
Copy link
Member

Closing due to inactivity. If this work still needs to be tracked, please open a new issue over in https://github.com/nodejs/diagnostics.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants