Skip to content

Running tests in Sway repo creates zombi test processes that keep consuming memory #7449

@ironcev

Description

@ironcev

Description

This issue was originally noticed while working on parallel execution of E2E tests. Parallel execution worked fine and as expected brought significant improvement in execution speed. On my machine, it reduced the overall test execution from ~11 minutes to under 2 minutes. But it also created thousand of zombie test processes, one for each test.

The existence of zombie processes was not related to the changes done in parallel execution. Zombie processes are an existing behavior of test that we were not aware of before. Also, it is likely not related only to the Sway tests but to all forc tools. More on that in #7450.

To reproduce the issue, just run sway/test$ cargo run ideally with some small number of tests for quick execution. E.g.: cargo r -r -- --release language/match_.

During execution of tests, two entries can be seen in, e.g. System Monitor. One will consume ~8MB and the other ~37MB. After completion, the larger one will "survive", consuming zero CPU but still holding memory:

Image

Analysis and Potential Root-cause

After going through code and connecting the dots, I realized that the issue lays in the intersection of tokio, tracing_subscriber with fuel-telemetry, and inherited stdio in child processes/threads.

Telemetry is a layer of our tracing_subscriber configuration, that uses tracing_appender::non_blocking::WorkerGuard. That guard is kept in forc-tracing within a thread local storage TELEMETRY_GUARD:

thread_local! {
    static TELEMETRY_GUARD: std::cell::RefCell<Option<WorkerGuard>> = const { std::cell::RefCell::new(None) };
}

This is the first thing I've noticed as an issue. Namely, the init_tracing_subscriber will set up a global tracing subscriber for the whole process, that will has a layer, telemetry, that gets dropped when the local thread finishes. Essentially, we have a layer that potentially doesn't live long enough if the init_tracing_subscriber is called from a thread which is not the main process thread.

And that is exactly what happens, because the main of the test binary is a #[tokio::main] which means Tokio will by default run it on a separate worker thread.

Now, fuel-telemetry does a lot of manual interaction around stdout, stderr, piping, etc. E.g., see daemonise_with_helpers function. The comments and code related to file descriptors can be found everywhere. E.g., this comment:

// We need to close all file descriptors in `daemonise()`, but need to skip the
// first three (0 = STDIN, 1 = STDOUT, 2 = STDERR) as we deal with stdio later
const FIRST_NON_STDIO_FD: i32 = 3;

So my assumption is the following. When test's main finishes on the Tokio worker thread, the drop will be called for the telemetry WorkerGuard. The telemetry subscriber will want to flush all remaining input but will also wait for other stderr and stdout file handler to be dropped to ensure there are no more writes coming. At that moment, our global tracing subscriber is still alive. It will get out of scope only once the wrapping Tokio's main finishes. So the fmt layer still has file descriptors opened for stderr and stdout. And that keeps the telemetry's non-blocking worker thread from closing.

I didn't have time to debug this assumption in-depth, but I've tried three changes in code, that should, if the assumption holds, remove the zombie processes. The three changes were:

  • compiling forc-tracing without telemetry,
  • using std::io::empty() in the fmt layer,
  • using #[tokio::main(flavor = "current_thread") to force TELEMETRY_GUARD to live in the same thread as the rest of the subscribers.

As expected, all three changes resulted in no zombie processes once test terminates.

Fix

For test async execution in general does not bring benefits. And having Tokio runtime running on a worker thread neither. So, strictly for removing test zombies, using #[tokio::main(flavor = "current_thread") is an acceptable solution. Considering that multiple worker threads anyhow do not bring any benefits to test this should be done regardless of the zombies issue.

However, we still need to investigate and fix the underlying root-cause which is likely the one explained above. But that needs additional analysis and confirmation. For that, see #7450.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingteam:compilerCompiler TeamtestingGeneral testing

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions