Skip to content

Fix excessive memory allocation for sched_times (#320)#389

Merged
florianl merged 1 commit intomainfrom
ck/offcpu/size
Mar 19, 2025
Merged

Fix excessive memory allocation for sched_times (#320)#389
florianl merged 1 commit intomainfrom
ck/offcpu/size

Conversation

@christos68k
Copy link
Copy Markdown
Member

@christos68k christos68k commented Mar 11, 2025

Summary

On my 16 core local system (VM, possibleCPUs:128, onlineCPUs:16), the previous logic allocated more than 5GB of kernel memory as worst-case (with off-cpu-threshold set to 1000) and the agent took ~20 seconds to start due to swapping.

This PR reduces the worst-case down to 5MB.

Fixes #320.
Also see #320 (comment).

@christos68k christos68k requested review from a team as code owners March 11, 2025 13:16
@christos68k christos68k self-assigned this Mar 11, 2025
@christos68k christos68k added the bug Something isn't working label Mar 11, 2025
@christos68k christos68k requested a review from florianl March 11, 2025 13:22
@florianl florianl enabled auto-merge (squash) March 18, 2025 15:53
Comment thread tracer/tracer.go
Comment on lines +553 to +555
// second (1000hz) multiplied by an average time a task remains off CPU (3s),
// scaled by the probability of capturing a trace.
adaption["sched_times"] = (4096 * cfg.OffCPUThreshold) / support.OffCPUThresholdMax
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but it is not obvious to me how 4096 reflects the mentioned "average time a task remains off CPU (3s)". How exactly has the number 4096 been determined and is this the the best value or does it need further refinement?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pow2(1000*3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reconsider ebpf map size for off-cpu profiling

4 participants