Stats: number of GC stuck at 2047

When using `harness_begin` and `harness_end` to enable statistics collection, the number of GC will be stuck at 2047 if GC is triggered more than that number.  It will print an error message in STDERR:

```
Warning: number of GC phases exceeds MAX_PHASES
```

The message is printed `Stats::start_gc`:

```rust
    pub fn start_gc(&self) {
        self.gc_count.fetch_add(1, Ordering::SeqCst);
        if !self.get_gathering_stats() {
            return;
        }
        if self.get_phase() < MAX_PHASES - 1 {
            let counters = self.counters.lock().unwrap();
            for counter in &(*counters) {
                counter.lock().unwrap().phase_change(self.get_phase());
            }
            self.shared.increment_phase();
        } else if !self.exceeded_phase_limit.load(Ordering::SeqCst) {
            eprintln!("Warning: number of GC phases exceeds MAX_PHASES");
            self.exceeded_phase_limit.store(true, Ordering::SeqCst);
        }
    }
```

And the number of GC printed in `harness_end` is actually `self.get_phase() / 2`, and it comes from the following function:

```
    pub fn print_stats<VM: VMBinding>(&self, mmtk: &'static MMTK<VM>) {
        println!(
            "============================ MMTk Statistics Totals ============================"
        );
        let scheduler_stat = mmtk.scheduler.statistics();
        self.print_column_names(&scheduler_stat);  // This prints "GC  time.other   time.stw  ....."
        print!("{}\t", self.get_phase() / 2); // This is the number under "GC"
        let counter = self.counters.lock().unwrap();
        // ...
```

Because the phase no longer increments when reached `MAX_PHASES` which is defined as `1 << 12` which is 4096, the number of GC is printed is 2047.

When running `lusearch` from DaCapo Chopin, if the plan is `GenCopy` or `GenImmix` and the heap size is set at 2.5x min heap size, the number of GC will exceed 2047.  This may happen to other benchmarks, too.  For example, if this happens when running `xalan`, the error message `Warning: number of GC phases exceeds MAX_PHASES` printed in STDERR will cause the benchmark validation to fail.

There is a comment near the `MAX_PHASES` constant:

```
// TODO: Increasing this number would cause JikesRVM die at boot time. I don't really know why.
// E.g. using 1 << 14 will cause JikesRVM segfault at boot time.
pub const MAX_PHASES: usize = 1 << 12;
```

In each `EventCounter`, there is an embedded array of `MAX_PHASES` elements:

```rust
pub struct EventCounter {
    name: String,
    pub implicitly_start: bool,
    merge_phases: bool,
    count: Box<[u64; MAX_PHASES]>, // This one
    current_count: u64,
    running: bool,
    stats: Arc<SharedStats>,
}
```

It looks like it intends to record the number for each individual STW and other phase.  This works fine if the benchmark never triggers more than 2047 GCs.  But actually it often does, especially with smaller heap and generational plan.  After the number of phases exceeded the threshold, subsequent counter values (such as `time.stw` and `time.other`) are ignored.  This explains some of the anomalies in https://github.com/mmtk/mmtk-core/pull/1062.  From the un-normalized plots, we can see the number of GC for GenCopy and GenImmix for `lusearch` is stuck at 2047.  This means the `time.other` and `time.stw` for subsequent GCs are ignored.  This explains the hard-to-explain increment and decrement in mutator time when run on different machines.

# What should we do?

We never look at the STW time and "other" time for individual phases.  We use the statistics counters mainly for printing the "GC", "time.other", "time.stw" when calling `harness_end`.  That only needs to record the accumulated time for all "STW" phases and all "other" phases.  It is more practical to only keep two numbers for each counter: the aggregated value for all odd phases and the aggregated value for all even phases.

If we want to inspect each phase (e.g. find the number of work packets executed), we have a better tool now.  We can use the eBPF tracing tool for that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stats: number of GC stuck at 2047 #1074

What should we do?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stats: number of GC stuck at 2047 #1074

Description

What should we do?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions