Running tests in Sway repo creates zombi `test` processes that keep consuming memory

## Description

This issue was originally noticed while working on parallel execution of E2E tests. Parallel execution worked fine and as expected brought significant improvement in execution speed. On my machine, it reduced the overall test execution from ~11 minutes to under 2 minutes. But it also created thousand of zombie `test` processes, one for each test.

The existence of zombie processes was not related to the changes done in parallel execution. Zombie processes are an existing behavior of `test` that we were not aware of before. Also, it is likely not related only to the Sway tests but to all `forc` tools. More on that in #7450.

To reproduce the issue, just run `sway/test$ cargo run` ideally with some small number of tests for quick execution. E.g.: `cargo r -r -- --release language/match_`.

During execution of tests, two entries can be seen in, e.g. System Monitor. One will consume ~8MB and the other ~37MB. After completion, the larger one will "survive", consuming zero CPU but still holding memory: 

<img width="932" height="113" alt="Image" src="https://github.com/user-attachments/assets/d5f957d5-b404-4d79-9792-91c811c9bc8a" />

## Analysis and Potential Root-cause

After going through code and connecting the dots, I realized that the issue lays in the intersection of `tokio`, `tracing_subscriber` with `fuel-telemetry`, and inherited `stdio` in child processes/threads.

Telemetry is a layer of our `tracing_subscriber` configuration, that uses `tracing_appender::non_blocking::WorkerGuard`. That guard is kept in `forc-tracing` within a **thread local** storage `TELEMETRY_GUARD`:

```rust
thread_local! {
    static TELEMETRY_GUARD: std::cell::RefCell<Option<WorkerGuard>> = const { std::cell::RefCell::new(None) };
}
```

This is the first thing I've noticed as an issue. Namely, the `init_tracing_subscriber` will set up a **global** tracing subscriber for the whole process, that will has a layer, telemetry, that gets dropped when the **local** thread finishes. Essentially, we have a layer that potentially doesn't live long enough if the `init_tracing_subscriber` is called from a thread which is not the main process thread.

And that is exactly what happens, because the `main` of the `test` binary is a `#[tokio::main]` which means Tokio will by default run it on a separate worker thread.

Now, `fuel-telemetry` does a lot of manual interaction around `stdout`, `stderr`, piping, etc. E.g., see `daemonise_with_helpers` function. The comments and code related to file descriptors can be found everywhere. E.g., this comment:

```
// We need to close all file descriptors in `daemonise()`, but need to skip the
// first three (0 = STDIN, 1 = STDOUT, 2 = STDERR) as we deal with stdio later
const FIRST_NON_STDIO_FD: i32 = 3;
```

So my assumption is the following. When `test`'s `main` finishes on the Tokio worker thread, the drop will be called for the telemetry `WorkerGuard`. The telemetry subscriber will want to flush all remaining input but will also wait for other `stderr` and `stdout` file handler to be dropped to ensure there are no more writes coming. At that moment, our global tracing subscriber is still alive. It will get out of scope only once the wrapping Tokio's `main` finishes. So the `fmt` layer still has file descriptors opened for `stderr` and `stdout`. And that keeps the telemetry's non-blocking worker thread from closing.

I didn't have time to debug this assumption in-depth, but I've tried three changes in code, that should, if the assumption holds, remove the zombie processes. The three changes were:
- compiling `forc-tracing` without telemetry,
- using `std::io::empty()` in the `fmt` layer,
- using `#[tokio::main(flavor = "current_thread")` to force `TELEMETRY_GUARD` to live in the same thread as the rest of the subscribers.

As expected, all three changes resulted in no zombie processes once `test` terminates.

## Fix

For `test` async execution in general does not bring benefits. And having Tokio runtime running on a worker thread neither. So, strictly for removing `test` zombies, using `#[tokio::main(flavor = "current_thread")` is an acceptable solution. Considering that multiple worker threads anyhow do not bring any benefits to `test` this should be done regardless of the zombies issue.

However, we still need to investigate and fix the underlying root-cause which is likely the one explained above. But that needs additional analysis and confirmation. For that, see #7450.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running tests in Sway repo creates zombi `test` processes that keep consuming memory #7449

Description

Analysis and Potential Root-cause

Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running tests in Sway repo creates zombi test processes that keep consuming memory #7449

Description

Description

Analysis and Potential Root-cause

Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Running tests in Sway repo creates zombi `test` processes that keep consuming memory #7449