Currently, we parallelize the sweeping work by making one work packet for the global pool and one packet for each mutator. It is OK for multi-threaded work loads, but when there is only one mutator, it hits a pathological case where the Release stage is dominated by a single long-running ReleaseMutator work packet. Here is a timeline captured using eBPF when executing the Liquid benchmark using the Ruby binding (a single mutator, but multiple GC workers)

In comparison, here is the timeline for the lusearch benchmark in the DaCapo Chopin benchmark suite (with eager-sweeping force-enabled). The parallel sweeping of mutators is better, but the Release work packet is not parallelized with ReleaseMutator

We should parallelize it by making work packets, each releasing a reasonable amount blocks.