stackcollapse-perf.pl ignores period of samples #165

hardc0der · 2018-03-20T05:44:11Z

It seems like stackcollapse-perf.pl ignores the period associated with each sample record so all recorded samples are given the same weight. This ends up in the generated flamegraph not being consistent with what perf report outputs for the overhead of the different call stacks. I usually sample by specifying a frequency and the hardware CPU cycles event with something like perf record -e cpu-cycles -F 997 and I can get pretty different output. I was wondering if you run into the same thing.

The text was updated successfully, but these errors were encountered:

guoshimin · 2020-09-16T18:06:17Z

Also ran into the same issue. @hardc0der did you somehow resolve it?

brendangregg · 2020-09-16T20:47:33Z

FWIW, practically all the sampling we do is using cpu-clock since cpu-cycles typically isn't available in cloud environments (perf warns: "The cycles event is not supported, trying to fall back to cpu-clock-ticks"), and with cpu-clock that field shows the frequency and can be ignored. If there's a problem with cpu-cycles then I may be unaware as I rarely get a chance to use it, so please let me know.

What do you think period means, anyway? E.g., I just tried on a bare metal system; what does this mean?:

# perf record -F 1 -a 
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 4.160 MB perf.data (41 samples) ]
# perf script
            bash 14404 [007] 2431612.075407: 2358327118 cycles:ppp:  ffffffff88dd479c security_inode_permission ([kernel.kallsyms])

(No, bash was not hot on-CPU.)

guoshimin · 2020-09-16T21:08:12Z

From https://perf.wiki.kernel.org/index.php/Tutorial#Event-based_sampling_overview:

The period is expressed as the number of occurrences of an event, not the number of timer ticks.

And later on the same page:

... the kernel is dynamically adjusting the sampling period to achieve the target average rate. The adjustment in period is reported in the raw profile data

I think it's the dynamically chosen sampling period, expressed in terms of number of events, for this sample.

In your example, the requested sampling frequency is 1 Hz, and the period turned out to be 2358327118 cycles. Does that match the clock frequency of your machine?

brendangregg · 2020-09-16T21:24:32Z

Sure, that roughly matches the clock frequency. But what do we do with this information? bash was not in this function for 1 second.

guoshimin · 2020-09-16T21:36:27Z

I take that to mean that this sample was taken after that many cycle events fired, not that the IP stayed there for that long.

brendangregg · 2020-09-16T21:37:19Z

Right, so what do we do with this information? What's the proposed change to stackcollapse-perf.pl?

guoshimin · 2020-09-16T21:40:12Z

I guess we should weight the samples by this amount. I'm going to check how perf report does it.

guoshimin · 2020-09-16T22:38:47Z

Yeah perf report weight the samples by period. The code is https://github.com/torvalds/linux/blob/5925fa68fe8244651b3f78a88c4af99190a88f0d/tools/perf/ui/hist.c#L36-L40

brendangregg · 2020-09-16T23:10:59Z

So in the same profile:

         swapper     0 [007] 2431604.697259:          1 cycles:ppp:  ffffffff88a6e448 native_write_msr ([kernel.kallsyms])

Currently it weights them equally since it ignores the period. But you're saying you'd weight the bash sample as 2358327118 and this sample as 1? I don't think bash consumed 2.3 billion times more CPU in security_inode_permission() than this sample in native_write_msr(). For this example that seems even more wrong than just ignoring the period.

If all the samples were roughly 2358327118, then I could understand that this is a problem of sampling granularity (since I choose -F 1). But that's not the case. I get periods from 1 through to 6281124533.

Usually perf is a great reference of doing things right, but in this case I don't understand it.

brendangregg · 2020-09-16T23:28:12Z

I'm looking at other profiles where if you sum the period, it shows more than >10x the real CPU time in the process (as shown by other tools).

I choose the -F 1 to magnify sampling and period issues. If someone can 100% explain the periods in an -F 1 profile, including why they vary from 1 to 1B+, and why their sum greatly exceeds measured CPU, then we should be in a position to fix stackcollapse-perf.pl if needed (and the fix may just be to use the period as a weight). I'm nervous about fixing something I don't completely understand.

guoshimin · 2020-09-17T20:27:03Z

Yeah I'm getting astronomical numbers that clearly don't make sense in the period field as well when I'm using the cycles event. I don't know if it's a kernel issue (I'm on 4.4.0), a bug in the perf tool, or a hardware issue.

I still think the period field is meant to give weights to samples, but it seems the field is currently not reliable.

I think for me the solution is to use cpu-clock for now.

brendangregg · 2020-09-21T21:24:21Z

It strikes me if I ignore the period=1 samples, then perhaps the other periods makes sense. Perhaps the period=1 events are due to some odd kernel code. And by weighting on period, they are effectively ignored anyway.

I'm still thinking about the consequences. Given mixed periods, it means the flame graph will have mixed resolution. You may be able to zoom into some towers and see detail, but other towers and not see detail where the frames are elided (and wrongly assume the CPU time is in the parent frame). I don't know of a good solution though, other than documenting it (or artificially ignoring extra resolution).

guoshimin · 2020-09-21T21:41:05Z

Yeah I think that's fair. Maybe recommend setting -c instead of -F when using cycles to get constant periods.

Knio · 2020-09-21T23:15:43Z

I may be totally wrong here, but just a note:

On any Intel or AMD processor, the cycle event does not count when the processor is idle, i.e., when it calls mwait(). 1

The kernel's automatic frequency adjustment seems to be designed for counters that mostly tick at a constant rate, and I think the behavior of cycles stopping entirely when the CPU is sleeping results in abnormally low periods right after a wakeup when the event tries to catch up, and then high periods once on-CPU. 1 is also the initial value for frequency-based counters.

brendangregg · 2020-11-19T02:27:45Z

@Knio thanks for the kernel code link: that sample_period = 1; sure sounds like the source of these period=1 events.

I think we understand this enough to justify the change, and pay attention to the period as @hardc0der suggested. Someone want to do the PR? (plus I have other PRs to catch up on, now I've finished another book.)

Summary: Disclaimer - I am not an expert on Perl, perf or FlameGraph. This is a hack for brendangregg#165 By hack I mean that it works for me but I didn't update all of the test files. The problem is that stackcollapse assumes that all stack traces have an equal weight but they don't based on based on current perf output. For example, there can be lines like this where 90689, 32298287 and 1 are the weights but stackcollapse assumes they all have weight 1, and by weight I mean the equivalent of the number of times this stack trace happens. Thus, the relative frequency of xpl_accept-3 is 1 / (90689 + 32289287 + 1). xpl_accept-1 2777710 12891973.429160: 90689 cycles: xpl_accept-2 2777710 12891975.431112: 32289287 cycles: xpl_accept-3 2777710 12891975.431112: 1 cycles: This is based on brendangregg#250 but I am not sure that PR is correct, it is also ~20 diffs behind current FlameGraph. Test Plan: works for me Reviewers: Subscribers: Tasks: Tags:

mdcallag · 2022-03-31T14:54:34Z

Are you open to another PR to fix this? Because I get extremely misleading flamegraphs today with:

perf record -g -p $pid -- sleep $perf_secs
perf script > o.scr
cat o.scr | ./stackcollapse-perf.pl > o.fold
./flamegraph.pl o.fold > o.svg

I am not sure the PR is correct and it is behind latest by ~20 diffs. My hack is here. It works for me and is current with latest.

kgibm · 2022-11-21T19:32:54Z

I found this issue because the SVG produced by flamegraph.pl reported astronomical "sample" counts for a short perf run:

Digging into this was very confusing, so I thought it would be useful to others to summarize. Please correct me if I'm wrong:

perf record tries to use -e cpu-cycles by default if it's available (as it was in my case) and if -e isn't otherwise specified.
If -e cpu-cycles is unavailable, as is common with many environments, then perf record will fall back to -e cpu-clock. This was the original use case stackcollapse-perf.pl was built for.
-e cpu-clock measures the passage of time.
-e cpu-cycles measures the number of CPU cycles executed.
When specifying sampling frequency (-F), as is often desired for flame graphs, since -e cpu-cycles is not directly time-based, the Linux kernel tries to constantly adjust the period of CPU cycles at which it samples to try to match the requested frequency:

the kernel is dynamically adjusting the sampling period to achieve the target average rate. The adjustment in period is reported in the raw profile data
stackcollapse-perf.pl was originally designed for -e cpu-clock, thus it treats every stack sample equally. In perf script, there is a value to the left of the cpu-clock event which is the clock frequency, and this was ignored.
If -e cpu-cycles is used, then, in perf script, there is a value to the left of the cpu-cycles event which is the last period (in CPU cycles) that the kernel used for that event.
If -e cpu-cycles is used, then, perf report "weights" the samples using this period. I still don't really understand what this means.
Before pull request take period into account when collapsing stacks #250, stackcollapse-perf.pl would simply count each cpu-cycles event as it did with -e cpu-clock, treating each event as a single sample.
With -e cpu-cycles, this seems to possibly over-count certain samples because the kernel may record many events within a short period of time (less than -F) with a period of 1, although I still don't understand under what conditions it does this. It might also not take into account other side effects of the period scaling, although I don't understand what this means.
Pull request take period into account when collapsing stacks #250 changed stackcollapse-perf.pl to "weight" the samples based on the value to the left of the event. Therefore, for both -e cpu-clock and -e cpu-cycles, it is expected to now get astronomical "sample" counts.
flamegraph.pl has an option --countname that defaults to "samples". This seems to now be incorrect, because the intuitive understanding of "samples" is a count of event samples, rather than sum(clock frequency|cycle periods).
If a user wants something more intuitive, they can either:
1. Explicitly use perf record -e cpu-clock
2. Use -c instead of -F with -e cpu-cycles, although it's not totally clear to me what exactly this means because although this is more clear on the period at which perf records, this counter only increments on CPU cycles, so it's still not wallclock-based.

I don't have an intuitive understanding of what it means to sample based on CPU cycles executed:

What does it mean that the kernel creates many events nearly at the same time with period=1?
Under what conditions is this data more accurate than sampling based on wallclock time?

If the answer to the last question is "unclear," then perhaps #250 should be reverted (given the confusing nature of the "samples" count and its interpretation), and users should just be encouraged, for this use case of FlameGraphs, to always explicitly specify perf record -e cpu-clock?

kgibm · 2022-11-23T14:30:54Z

If a user wants something more intuitive, they can either:

Explicitly use perf record -e cpu-clock

Not so simple as cpu-clock isn't always available:

Info 11/22/2022 15:25:32   cpu-clock: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'

Might be related to using --call-graph lbr.

Seems to have be because of this bug: brendangregg/FlameGraph#165. One workaround there was to record data with `-e cpu-clock`. I don't know what that means but that is what this commit does.

Knio mentioned this issue Sep 21, 2020

Does imprecise sample frequency affect correctness of flame graph analysis? #209

Open

guoshimin mentioned this issue Mar 11, 2021

take period into account when collapsing stacks #250

Merged

mhansen mentioned this issue Oct 8, 2021

stackcollapse-perf.pl: fails to read Simpleperf data because regex can't handle space inside "([JIT app cache])" #262

Open

vipwangerxiao mentioned this issue Sep 9, 2022

stackcollapse-perf.pl: Use event period to count folded stack #289

Closed

brendangregg closed this as completed in #250 Sep 18, 2022

kgibm mentioned this issue Nov 29, 2022

Add --counts option to ignore periods and just count stacks #297

Open

jg-linetco mentioned this issue Jan 5, 2024

CPU flamegraphs using perf show wrong sample counts #326

Open

This was referenced Jun 25, 2024

graph diffs from perl flamegraph jonhoo/inferno#318

Closed

take period into account when collapsing stacks jonhoo/inferno#319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stackcollapse-perf.pl ignores period of samples #165

stackcollapse-perf.pl ignores period of samples #165

hardc0der commented Mar 20, 2018

guoshimin commented Sep 16, 2020

brendangregg commented Sep 16, 2020

guoshimin commented Sep 16, 2020

brendangregg commented Sep 16, 2020

guoshimin commented Sep 16, 2020

brendangregg commented Sep 16, 2020

guoshimin commented Sep 16, 2020

guoshimin commented Sep 16, 2020

brendangregg commented Sep 16, 2020

brendangregg commented Sep 16, 2020

guoshimin commented Sep 17, 2020

brendangregg commented Sep 21, 2020

guoshimin commented Sep 21, 2020

Knio commented Sep 21, 2020

brendangregg commented Nov 19, 2020

mdcallag commented Mar 31, 2022

kgibm commented Nov 21, 2022

kgibm commented Nov 23, 2022

stackcollapse-perf.pl ignores period of samples #165

stackcollapse-perf.pl ignores period of samples #165

Comments

hardc0der commented Mar 20, 2018

guoshimin commented Sep 16, 2020

brendangregg commented Sep 16, 2020

guoshimin commented Sep 16, 2020

brendangregg commented Sep 16, 2020

guoshimin commented Sep 16, 2020

brendangregg commented Sep 16, 2020

guoshimin commented Sep 16, 2020

guoshimin commented Sep 16, 2020

brendangregg commented Sep 16, 2020

brendangregg commented Sep 16, 2020

guoshimin commented Sep 17, 2020

brendangregg commented Sep 21, 2020

guoshimin commented Sep 21, 2020

Knio commented Sep 21, 2020

brendangregg commented Nov 19, 2020

mdcallag commented Mar 31, 2022

kgibm commented Nov 21, 2022

kgibm commented Nov 23, 2022