panic on underflowing subtraction during recording #26

pobrn · 2023-02-15T15:45:27Z

Version: 3b5f5e1 (built simply with cargo build)
Platform: Linux 6.1.11

Command that triggers the issue:

samply record gdb --version

gdb is from the Arch Linux repositories (version: 12.1-2).

thread '<unnamed>' panicked at 'attempt to subtract with overflow', samply/src/linux_shared/context_switch.rs:... (multiple places)

One particular panic:

Thread 2 "samply" hit Breakpoint 3, samply::linux_shared::context_switch::ContextSwitchHandler::handle_sample (self=0x7ffff7c8c2f8, timestamp=62712187706648, thread=0x7ffff0d96338)
    at samply/src/linux_shared/context_switch.rs:141
141	        let off_cpu_sample = match thread.state {
$605 = 62712187706648
$606 = samply::linux_shared::context_switch::ThreadContextSwitchData {
  state: samply::linux_shared::context_switch::ThreadState::On{last_observed_on_timestamp: 62712191834216},
  on_cpu_duration_since_last_sample: 0,
  off_cpu_duration_since_last_off_cpu_sample: 191576
}
thread '<unnamed>' panicked at 'attempt to subtract with overflow', samply/src/linux_shared/context_switch.rs:147:35

Instrumenting all ContextSwitchHandler::handle_{sample,context_in,context_out} calls gives the following:
call-log.txt

The text was updated successfully, but these errors were encountered:

ishitatsuyuki · 2023-02-15T15:56:08Z

Note that usually samply is built and used in release mode, to avoid debug slowdown that leads to missed samples. In release mode, integer overflows are silently replaced with wrapping arithmetic, and most of the overflows in samply ends up being harmless.

mstange · 2023-02-15T15:59:52Z

The unordered timestamps are probably due to the code around here, which gets records from the perf events of multiple CPUs and doesn't sort them:

samply/samply/src/linux/perf_group.rs

Line 262 in 3b5f5e1

self.event_buffer.extend(perf.iter());

It would be great if somebody could read up on what the guarantees are about the ordering of records from perf events from multiple CPUs. Is sorting just "the current batch" enough, or do new records from the current batch sometimes need to be ordered before records from the previous batch? I think the latter is the case, and I also needed to deal with it when reading records from a perf.data file. The code for the batch reordering is here: https://github.com/mstange/linux-perf-data/blob/f0c5b59781303eaf1c7e6aa523e711a4018f5c09/src/sorter.rs
However, if reordering across batches is necessary, it would be good to come up with a solution which doesn't require copying the pending unsorted records to the heap.

ishitatsuyuki · 2023-02-16T16:57:07Z

It looks like the reordering is also causing other troubles, for example, fork events can come before the execve event:

[samply/src/linux/profiler.rs:404] e = ForkOrExitRecord {
    pid: 1708112,
    ppid: 1708112,
    tid: 1708115,
    ptid: 1708112,
    timestamp: 38419594592932,
}
[samply/src/linux/profiler.rs:407] e = {
    "pid": 1708112,
    "tid": 1708112,
    "name": "fossilize-repla",
    "is_execve": true,
}

Which mess up memory map tracking and cause symbolization to fail.

Not sure what to do with reordering though, it does seem to be a legitimately hard and annoying problem.

ishitatsuyuki · 2023-02-22T04:42:59Z

It would be great if somebody could read up on what the guarantees are about the ordering of records from perf events from multiple CPUs. Is sorting just "the current batch" enough, or do new records from the current batch sometimes need to be ordered before records from the previous batch? I think the latter is the case, and I also needed to deal with it when reading records from a perf.data file. The code for the batch reordering is here: https://github.com/mstange/linux-perf-data/blob/f0c5b59781303eaf1c7e6aa523e711a4018f5c09/src/sorter.rs

I think it comes down to something similar to the "rounds" logic in linux-perf-data. There's no "guarantee" from perf, but for causally related events (e.g. fork → exec) it's possible to reason about conditions under the assumption that events are written synchronously wrt the syscall. When event B happens-after event A, and say we just read event B in one of the buffers, then acquire-release ordering guarantees we will be able to see (or already have seen) event A in one of the buffers as well.

Hence, if we read in the order of (CPU 0, Round x), (CPU 1, Round x), ..., (CPU N, Round x), (CPU 0, Round x+1), then we can commit (CPU 0, Round x) as soon as we have examined records up to (CPU N, Round x) for re-sorting. This is a slightly tighter bound compared to the "rounds" formulation.

This doesn't however guarantee anything about events that are not causally related to each other. In such case, event numbering can sometimes happen in different order than the samples write finishes, and timestamps have other reasons to drift (TSC synchronization issues can cause drift by a small amount). But those should never lead to logic errors and can be safely ignored.

Right now, a blocker for live event reordering is that we only read a limited number of events per EventIter call:

samply/samply/src/linux/perf_event.rs

Lines 542 to 563 in 4ab060c

    
           fn new(perf: &'a mut Perf) -> Self { 
        
               let mut locations = Vec::with_capacity(32); 
        
               { 
        
                   let state = Arc::get_mut(&mut perf.event_ref_state) 
        
                       .expect("Perf::iter called while the previous iterator hasn't finished processing"); 
        
                   let state = state.get_mut(); 
        
                   for _ in 0..31 { 
        
                       state.positions[locations.len()] = perf.position; 
        
                       let raw_event_location = 
        
                           match next_raw_event(perf.buffer, perf.size, &mut perf.position) { 
        
                               Some(location) => location, 
        
                               None => break, 
        
                           }; 
        
                       locations.push(raw_event_location); 
        
                   } 
        
                   state.positions[locations.len()] = perf.position; 
        
                   state.done = !0; 
        
               }

To exploit the causal relationship, all available events need to be pushed into the buffer at once. This will likely require some refactor. The bitmask tracking is the tricky part, so if we at least made the reordering part consume events in-order it will likely simplify tracking logic required for this.

mstange · 2023-02-22T20:48:59Z

Thank you so much for investigating this. Please feel free to do whatever refactoring is needed, I am not attached to this code at all. I imported it from https://github.com/koute/not-perf/ just to get a basic working version of samply running - this code has already survived for longer than I expected.

ishitatsuyuki · 2023-02-27T08:47:33Z

I've made some preliminary changes at #29 and #30. I have a binary heap based sorter at ishitatsuyuki@664e1ae, but I then realized that the sorter from linux-perf-data is also kind of usable (except that it releases buffers by round granularity instead of stream granularity).

So first, here's the question of which sorter to use. The pros of binary heap sorter is that it's more granular, but the con is that binary heap falls short when it comes to absolute performance (quicksort/pdqsort is extremely well optimized).

Then there's the question of where the sorter should reside in. Right now, the sorter in linux-perf-data is private, so if we want to use it in samply we'll need to make it public. Or we could put the sorter in samply, making it independent from linux-perf-data.

mstange · 2023-02-27T19:29:06Z

Minimizing the number of lost perf events is more important than minimizing the CPU overhead of the sampler. So if the more granular releasing of the binary heap approach lets us advance the read position in data_tail earlier and allows the kernel to write more records into the freed-up space, then I think it's preferable over the per-round sorting.

ishitatsuyuki mentioned this issue Feb 27, 2023

Migrate to 2021 edition. #29

Merged

ishitatsuyuki mentioned this issue Aug 24, 2023

Linux: Implement perf live event reordering. #62

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic on underflowing subtraction during recording #26

panic on underflowing subtraction during recording #26

pobrn commented Feb 15, 2023

ishitatsuyuki commented Feb 15, 2023

mstange commented Feb 15, 2023

ishitatsuyuki commented Feb 16, 2023

ishitatsuyuki commented Feb 22, 2023 •

edited

Loading

mstange commented Feb 22, 2023

ishitatsuyuki commented Feb 27, 2023

mstange commented Feb 27, 2023

panic on underflowing subtraction during recording #26

panic on underflowing subtraction during recording #26

Comments

pobrn commented Feb 15, 2023

ishitatsuyuki commented Feb 15, 2023

mstange commented Feb 15, 2023

ishitatsuyuki commented Feb 16, 2023

ishitatsuyuki commented Feb 22, 2023 • edited Loading

mstange commented Feb 22, 2023

ishitatsuyuki commented Feb 27, 2023

mstange commented Feb 27, 2023

ishitatsuyuki commented Feb 22, 2023 •

edited

Loading