off-cpu: Use a probability value for the threshold#460
off-cpu: Use a probability value for the threshold#460christos68k merged 8 commits intoopen-telemetry:mainfrom
Conversation
|
@open-telemetry/ebpf-profiler-approvers Can I get feedback please? |
| // configured off-cpu threshold. | ||
| // To not lose too many scheduling events but also not oversize sched_times, | ||
| // calculate a size based on an assumed upper bound of scheduler events per | ||
| // second (1000hz) multiplied by an average time a task remains off CPU (3s), |
There was a problem hiding this comment.
[..] an average time a task remains off CPU (3s) - is there some evidence for this number? From local workload I see significant different (lower) values.
There was a problem hiding this comment.
I didn't change the text, just moved it. Maybe @christos68k can give some background how exactly he measured this number.
There was a problem hiding this comment.
Otherwise, what is your preferred number here @florianl?
There was a problem hiding this comment.
I went for relaxed rather than tight sizing here.
| // Guarantee a minimal size of 16. | ||
| return 16 | ||
| } | ||
| if size > 4096 { |
There was a problem hiding this comment.
With a given probability value of 1.0 I would expect every scheduling event to show up. With this change, this is not possible, as the size of sched_times becomes the limiting factor.
There was a problem hiding this comment.
The behavior for 1.0 is the same as for 1000 before.
Before: adaption["sched_times"] = (4096 * cfg.OffCPUThreshold) / support.OffCPUThresholdMax results in 4096 for all (=1000).
This PR: User enters 1.0, which is threshold = math.MaxUint32, so the result is also 4096.
florianl
left a comment
There was a problem hiding this comment.
Use a probability value of [0..1] for the -off-cpu-threshold as suggested in #458.
This might not be correct, as #458 was closed with this comment:
Ok, that's my fault, I was testing a patch to increase OffCPUThresholdMax to 1 billion, and that was causing the problem. Thanks for help, closing the issue.
The issue description included "and BTW it would be better to have more granularity than per-mille for off cpu profiling which is giving me thousands of samples per sec". This was the reason why the OP (desperately!?) tried to patch the code. With this PR, nobody has to patch the code, but can go lower than 1 out of thousand samples. |
e8d7ece to
edaf5d9
Compare
Co-authored-by: Florian Lehner <florian.lehner@elastic.co>
Co-authored-by: Florian Lehner <florian.lehner@elastic.co>
Co-authored-by: Florian Lehner <florian.lehner@elastic.co>
PoC for using a probability value for the off-cpu threshold.
Use a probability value of [0..1] for the
-off-cpu-thresholdas suggested in #458.Fixes #459