-
Notifications
You must be signed in to change notification settings - Fork 78
Description
When using harness_begin and harness_end to enable statistics collection, the number of GC will be stuck at 2047 if GC is triggered more than that number. It will print an error message in STDERR:
Warning: number of GC phases exceeds MAX_PHASES
The message is printed Stats::start_gc:
pub fn start_gc(&self) {
self.gc_count.fetch_add(1, Ordering::SeqCst);
if !self.get_gathering_stats() {
return;
}
if self.get_phase() < MAX_PHASES - 1 {
let counters = self.counters.lock().unwrap();
for counter in &(*counters) {
counter.lock().unwrap().phase_change(self.get_phase());
}
self.shared.increment_phase();
} else if !self.exceeded_phase_limit.load(Ordering::SeqCst) {
eprintln!("Warning: number of GC phases exceeds MAX_PHASES");
self.exceeded_phase_limit.store(true, Ordering::SeqCst);
}
}And the number of GC printed in harness_end is actually self.get_phase() / 2, and it comes from the following function:
pub fn print_stats<VM: VMBinding>(&self, mmtk: &'static MMTK<VM>) {
println!(
"============================ MMTk Statistics Totals ============================"
);
let scheduler_stat = mmtk.scheduler.statistics();
self.print_column_names(&scheduler_stat); // This prints "GC time.other time.stw ....."
print!("{}\t", self.get_phase() / 2); // This is the number under "GC"
let counter = self.counters.lock().unwrap();
// ...
Because the phase no longer increments when reached MAX_PHASES which is defined as 1 << 12 which is 4096, the number of GC is printed is 2047.
When running lusearch from DaCapo Chopin, if the plan is GenCopy or GenImmix and the heap size is set at 2.5x min heap size, the number of GC will exceed 2047. This may happen to other benchmarks, too. For example, if this happens when running xalan, the error message Warning: number of GC phases exceeds MAX_PHASES printed in STDERR will cause the benchmark validation to fail.
There is a comment near the MAX_PHASES constant:
// TODO: Increasing this number would cause JikesRVM die at boot time. I don't really know why.
// E.g. using 1 << 14 will cause JikesRVM segfault at boot time.
pub const MAX_PHASES: usize = 1 << 12;
In each EventCounter, there is an embedded array of MAX_PHASES elements:
pub struct EventCounter {
name: String,
pub implicitly_start: bool,
merge_phases: bool,
count: Box<[u64; MAX_PHASES]>, // This one
current_count: u64,
running: bool,
stats: Arc<SharedStats>,
}It looks like it intends to record the number for each individual STW and other phase. This works fine if the benchmark never triggers more than 2047 GCs. But actually it often does, especially with smaller heap and generational plan. After the number of phases exceeded the threshold, subsequent counter values (such as time.stw and time.other) are ignored. This explains some of the anomalies in #1062. From the un-normalized plots, we can see the number of GC for GenCopy and GenImmix for lusearch is stuck at 2047. This means the time.other and time.stw for subsequent GCs are ignored. This explains the hard-to-explain increment and decrement in mutator time when run on different machines.
What should we do?
We never look at the STW time and "other" time for individual phases. We use the statistics counters mainly for printing the "GC", "time.other", "time.stw" when calling harness_end. That only needs to record the accumulated time for all "STW" phases and all "other" phases. It is more practical to only keep two numbers for each counter: the aggregated value for all odd phases and the aggregated value for all even phases.
If we want to inspect each phase (e.g. find the number of work packets executed), we have a better tool now. We can use the eBPF tracing tool for that.