kallsyms: update bpf addresses without full /proc/kallsyms reload#1198
Conversation
0f92ead to
ce31b1e
Compare
ce31b1e to
bbd29c6
Compare
|
Could you refactor the bpf symbolizer to be a separate package? It has nothing to do with kallsyms, and I am really hoping the kallsyms package does not get entangled with any bpf machinery. So best is the bpf symbols stuff is separate package and the tracer uses it in parallel with kallsyms package. |
What's the rough outline of how you see this working? Currently bpf code depends on |
2d0cfda to
5256651
Compare
|
I updated the code to separate it a bit from kallsyms, but it's still in the same package. It now also addresses #1199 for bpf symbols. Production testing shows a nice drop in CPU usage (red line machine has the new code):
Flamegraph comparison shows kallsyms parsing going poof (it is also a lot smoother):
|
5256651 to
09941d0
Compare
| continue | ||
| } | ||
|
|
||
| switch ksymbol := record.(type) { |
There was a problem hiding this comment.
As there is a <-ctx.Done() case in every case statement, should we have this check maybe before switch ksymbol ... instead?
There was a problem hiding this comment.
But they it will not be in the same select, so it wouldn't be able to break out of a blocked send.
Maybe I misunderstand what you're suggesting.
So the perf events are superior because they return the JITted program code length. This allows you to create mapping of [start-address,stop-address] for each symbol. The length of a symbol is not directly available from Perhaps, the baseline can be established with:
and then inspecting the program info data. I believe The If both the start/end is collected in baseline and from the perf symbol updates, you can just create independent symbolizer and accurately match the symbols. Also, since the Would this sound feasible approach to you? |
Establishing the baseline as you suggest would be a lot more expensive than just going through In practice, on modern kernels bpf symbols are in a contiguous block, but there's no guarantee that it will stay that way.
It's a one time thing, so I think it's fine to read and skip bpf rather than just stop. I don't think there's any promise that no non-bpf symbols will appear after bpf.
I would probably move that effort in a follow-up PR, unless you feel strongly about it. It would be good to address the existing slowness and #1199 here first. |
09941d0 to
11b0397
Compare
711b29e to
422fdb9
Compare
bobrik
left a comment
There was a problem hiding this comment.
I pushed a bunch of commits to resolve issues and added an integration test.
From my end it should be ready to go. I'm not really sure what to do with codeql complaints.
I can squash into one commit once it's approved.
| continue | ||
| } | ||
|
|
||
| switch ksymbol := record.(type) { |
There was a problem hiding this comment.
But they it will not be in the same select, so it wouldn't be able to break out of a blocked send.
Maybe I misunderstand what you're suggesting.
|
Linux v5.4 is giving me a hard time again |
More expensive in what sense? The problem you are solving that reading Yes, its a bit more code and syscalls. But I think it will be much more efficient in CPU usage. Also, you get the JIT code length data which helps a lot to not incorrectly symbolize random addresses as some bpf symbol.
Right. Which is another reason why collecting and matching with bpf code length will help.
Kernel code guarantees that the actual kernel and module symbols come first. After
Fair enough. Lets not mix that in at this time.
I would really like to not introduce something we want to change again. This applies mostly the initial synchronization.
Fixing #1199 could be a separate more self contained PR. Also something needs fixing since tests are failing. Are you able to determine and fix the issue? |
422fdb9 to
2a6279f
Compare
|
I updated the code to iterate bpf programs instead of parsing kallsyms for the initial pass. The tests only fail on v5.4. I'm not sure if it's worth worrying about if we're dropping it in #1178. |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
florianl
left a comment
There was a problem hiding this comment.
Thanks for the reminder!
Reading and testing the code again, i'm in favor of this approach as it simplifies things and uses the perf subsystem functionality without the overhead of getting triggered in eBPF space.
| case *perf.LostRecord: | ||
| // nil as a sentinel value to indicate lost events | ||
| select { | ||
| case s.records <- nil: |
There was a problem hiding this comment.
Can we handle lost records and KSymbolRecord separatly? With just reporting nil, we loose the information on how many events were actually lost from LostRecord.Lost.
There was a problem hiding this comment.
What exactly do you have in mind here? A separate struct with the number of lost events as a member or something else?
I'm not sure how useful it is to know how many events were lost. I can see the case for logging the number, but we can do it right here.
There was a problem hiding this comment.
My primary concern is the potential data loss regarding the number of dropped events when sending nil over the channel. It seems counterintuitive to signal an occurrence while losing the specific data associated with it.
I suggest we consider one of the following alternatives:
- Implement a metric to track these lost events, comparable to the
lostEventsCountused in startPerfEventMonitor() - Simply log the count locally and avoid sending any signal over the channel entirely.
There was a problem hiding this comment.
We send nil to trigger full re-scan to avoid data loss. Not sending nil would mean data loss.
There was a problem hiding this comment.
Sending nil to trigger a full re-scan is fine, but I think this should be documented better. So far we only have nil as a sentinel value to indicate lost events. And the information, on how many events are lost, is still also lost.
I'm thinking about asking for a dedicated channel to trigger a full re-scan. This could help separating both cases in a better way.
There was a problem hiding this comment.
I expanded the comment to make it clearer.
I deliberately avoided having two streams for updates as it makes it harder to reason about correctness. With one stream that can be re-synchronized between updates it's much clearer, as it cannot race with another stream.
9e2e2a5 to
594a0f7
Compare
|
I rebased and squashed the commits. CI is seeing weird issues. I've seen this on v6.8 and v6.12 on different runs: The latest re-run does not have it and it feels unrealted to the changes here. |
e15a602 to
6f347d6
Compare
fabled
left a comment
There was a problem hiding this comment.
lgtm! thanks!
Just one question about potential GC pressure point. But approving at this point.
|
|
||
| // Insert the new symbol into the right position to maintain sorting. | ||
| newSym := bpfSymbol{address: addr, size: size, name: name} | ||
| newSymbols := make([]bpfSymbol, len(oldSymbols)+1) |
There was a problem hiding this comment.
This (and the similar line in next function) always creates a new slice for the bpf symbol table for every individual bpf symbol change. I suspect the size of this slice can be fairly large.
I'm wondering how much this causes GC pressure in your system with large bpf program volatility.
Would it make sense to swap between two buffers and reallocate only if a larger capacity is needed? And when increading the size do it in larger increments than +1. Perhaps even use sync.Pool to store the other buffer?
There was a problem hiding this comment.
You need to squint really hard to even notice this stuff (hovering one of the greyed out columns):
For comparison, symbolizeKernelFrames is 60x more expensive.
If you zoom in, most of the time is spent moving things around, not allocating:
It doesn't seem to be worth worrying about this bit too much, there are bigger candidates that I intend to look into once this lands.
BPF programs come and go much more frequently than modules and doing a full re-parsing of `/proc/kallsyms` is very expensive, comparatively speaking. Here we subscribe to updates for both additions and removals of bpf symbols through `PERF_RECORD_KSYMBOL` mechanism of perf events. Instead of triggering full parsing, we update the pre-existing mapping for bpf pseudo-module whenever possible.
|
CI is at it again: #1198 (comment). Here's one failed build: |
fe4b243 to
a16794a
Compare
|
I added some error checking code and the error disappeared. |
florianl
left a comment
There was a problem hiding this comment.
Thanks for the update and sorry for the delay, as I was off-desk last week.


BPF programs come and go much more frequently than modules and doing a full re-parsing of
/proc/kallsymsis very expensive, comparatively speaking. Here we subscribe to updates for both additions and removals of bpf symbols throughPERF_RECORD_KSYMBOLmechanism of perf events. Instead of triggering full parsing, we update the pre-existing mapping for bpf pseudo-module whenever possible.See: #1151.