Skip to content

kallsyms: update bpf addresses without full /proc/kallsyms reload#1198

Merged
fabled merged 2 commits into
open-telemetry:mainfrom
bobrik:ivan/bpf-updates
May 11, 2026
Merged

kallsyms: update bpf addresses without full /proc/kallsyms reload#1198
fabled merged 2 commits into
open-telemetry:mainfrom
bobrik:ivan/bpf-updates

Conversation

@bobrik
Copy link
Copy Markdown
Contributor

@bobrik bobrik commented Feb 21, 2026

BPF programs come and go much more frequently than modules and doing a full re-parsing of /proc/kallsyms is very expensive, comparatively speaking. Here we subscribe to updates for both additions and removals of bpf symbols through PERF_RECORD_KSYMBOL mechanism of perf events. Instead of triggering full parsing, we update the pre-existing mapping for bpf pseudo-module whenever possible.

See: #1151.

@bobrik bobrik requested review from a team as code owners February 21, 2026 06:44
Comment thread go.mod Outdated
@fabled
Copy link
Copy Markdown
Contributor

fabled commented Feb 21, 2026

Could you refactor the bpf symbolizer to be a separate package?

It has nothing to do with kallsyms, and I am really hoping the kallsyms package does not get entangled with any bpf machinery.

So best is the bpf symbols stuff is separate package and the tracer uses it in parallel with kallsyms package.

Comment thread support/ebpf/kallsyms.ebpf.c Outdated
@bobrik
Copy link
Copy Markdown
Contributor Author

bobrik commented Feb 24, 2026

Could you refactor the bpf symbolizer to be a separate package?

What's the rough outline of how you see this working? Currently bpf code depends on /proc/kallsyms to provide the baseline mapping, which is updated by perf events in place.

@bobrik bobrik force-pushed the ivan/bpf-updates branch 3 times, most recently from 2d0cfda to 5256651 Compare February 24, 2026 22:37
Comment thread kallsyms/bpf.go Fixed
Comment thread kallsyms/bpf.go Fixed
Comment thread kallsyms/kallsyms.go Dismissed
@bobrik
Copy link
Copy Markdown
Contributor Author

bobrik commented Feb 24, 2026

I updated the code to separate it a bit from kallsyms, but it's still in the same package. It now also addresses #1199 for bpf symbols.

Production testing shows a nice drop in CPU usage (red line machine has the new code):

image

Flamegraph comparison shows kallsyms parsing going poof (it is also a lot smoother):

image

Comment thread kallsyms/bpf.go Fixed
Comment thread tracer/tracer.go Outdated
Comment thread kallsyms/kallsyms_test.go
Comment thread kallsyms/bpf.go
Comment thread kallsyms/bpf.go Outdated
Comment thread kallsyms/bpf.go
Comment thread kallsyms/bpf.go
continue
}

switch ksymbol := record.(type) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there is a <-ctx.Done() case in every case statement, should we have this check maybe before switch ksymbol ... instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they it will not be in the same select, so it wouldn't be able to break out of a blocked send.

Maybe I misunderstand what you're suggesting.

@fabled
Copy link
Copy Markdown
Contributor

fabled commented Feb 25, 2026

Could you refactor the bpf symbolizer to be a separate package?

What's the rough outline of how you see this working? Currently bpf code depends on /proc/kallsyms to provide the baseline mapping, which is updated by perf events in place.

So the perf events are superior because they return the JITted program code length. This allows you to create mapping of [start-address,stop-address] for each symbol. The length of a symbol is not directly available from kallsyms, so if possible it should not be used as baseline.

Perhaps, the baseline can be established with:

  • bpf(BPF_PROG_GET_NEXT_ID, ...)
  • fd = bpf(BPF_PROG_GET_FD_BY_ID, ...)
  • bpf(BPF_OBJ_GET_INFO_BY_FD, ...)

and then inspecting the program info data. I believe jited_ksyms in the info struct contains the kernel address and jited_func_lens the corresponding length.

The kallsyms could just completely ignore bpf, and in fact stop reading the kallsyms when bpf is seen (as you report it being really slow).

If both the start/end is collected in baseline and from the perf symbol updates, you can just create independent symbolizer and accurately match the symbols.

Also, since the tracer in startPerfEventMonitor already opens the event channel for all CPUs, could those same event pipes be used to get the symbol updates? This would reduce some resource overhead if a separate set is opened. This would mean the bpf symbolizer would need internal methods to use the bpf syscall to establish baseline when needed, and then rely on events being transported via a method called by the tracer package.

Would this sound feasible approach to you?

@bobrik
Copy link
Copy Markdown
Contributor Author

bobrik commented Feb 28, 2026

So the perf events are superior because they return the JITted program code length. This allows you to create mapping of [start-address,stop-address] for each symbol.

Establishing the baseline as you suggest would be a lot more expensive than just going through /proc/kallsyms. It is a one time set up, so maybe that's not a huge problem.

In practice, on modern kernels bpf symbols are in a contiguous block, but there's no guarantee that it will stay that way.

The kallsyms could just completely ignore bpf, and in fact stop reading the kallsyms when bpf is seen (as you report it being really slow).

It's a one time thing, so I think it's fine to read and skip bpf rather than just stop. I don't think there's any promise that no non-bpf symbols will appear after bpf.

Also, since the tracer in startPerfEventMonitor already opens the event channel for all CPUs, could those same event pipes be used to get the symbol updates?

probabilisticProfile disables these events and I don't think doing full re-initialization is a good tradeoff vs having separate events for bpf that are constantly open, especially if we make initialization more expensive.

Would this sound feasible approach to you?

I would probably move that effort in a follow-up PR, unless you feel strongly about it.

It would be good to address the existing slowness and #1199 here first.

Comment thread kallsyms/bpf.go Outdated
Comment thread kallsyms/bpf.go Fixed
@bobrik bobrik force-pushed the ivan/bpf-updates branch 2 times, most recently from 711b29e to 422fdb9 Compare March 19, 2026 00:26
Copy link
Copy Markdown
Contributor Author

@bobrik bobrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a bunch of commits to resolve issues and added an integration test.

From my end it should be ready to go. I'm not really sure what to do with codeql complaints.

I can squash into one commit once it's approved.

Comment thread kallsyms/bpf.go
Comment thread kallsyms/bpf.go
continue
}

switch ksymbol := record.(type) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they it will not be in the same select, so it wouldn't be able to break out of a blocked send.

Maybe I misunderstand what you're suggesting.

Comment thread kallsyms/kallsyms_test.go
Comment thread kallsyms/bpf.go
@bobrik
Copy link
Copy Markdown
Contributor Author

bobrik commented Mar 19, 2026

Linux v5.4 is giving me a hard time again ☹️

@fabled
Copy link
Copy Markdown
Contributor

fabled commented Mar 19, 2026

So the perf events are superior because they return the JITted program code length. This allows you to create mapping of [start-address,stop-address] for each symbol.

Establishing the baseline as you suggest would be a lot more expensive than just going through /proc/kallsyms. It is a one time set up, so maybe that's not a huge problem.

More expensive in what sense? The problem you are solving that reading kallsyms is really slow. And now you argue its better to read it instead of using dedicated fast API?

Yes, its a bit more code and syscalls. But I think it will be much more efficient in CPU usage. Also, you get the JIT code length data which helps a lot to not incorrectly symbolize random addresses as some bpf symbol.

In practice, on modern kernels bpf symbols are in a contiguous block, but there's no guarantee that it will stay that way.

Right. Which is another reason why collecting and matching with bpf code length will help.

The kallsyms could just completely ignore bpf, and in fact stop reading the kallsyms when bpf is seen (as you report it being really slow).

It's a one time thing, so I think it's fine to read and skip bpf rather than just stop. I don't think there's any promise that no non-bpf symbols will appear after bpf.

Kernel code guarantees that the actual kernel and module symbols come first. After bpf might still come __builtin__kprobes symbols. Though I am not sure if those can be handled currently in any sensible way. I'd probably just ignore those at this time.

Also, since the tracer in startPerfEventMonitor already opens the event channel for all CPUs, could those same event pipes be used to get the symbol updates?

probabilisticProfile disables these events and I don't think doing full re-initialization is a good tradeoff vs having separate events for bpf that are constantly open, especially if we make initialization more expensive.

Fair enough. Lets not mix that in at this time.

Would this sound feasible approach to you?

I would probably move that effort in a follow-up PR, unless you feel strongly about it.

I would really like to not introduce something we want to change again. This applies mostly the initial synchronization.

It would be good to address the existing slowness and #1199 here first.

Fixing #1199 could be a separate more self contained PR.

Also something needs fixing since tests are failing. Are you able to determine and fix the issue?

@bobrik bobrik force-pushed the ivan/bpf-updates branch from 422fdb9 to 2a6279f Compare March 20, 2026 06:21
@bobrik
Copy link
Copy Markdown
Contributor Author

bobrik commented Mar 20, 2026

I updated the code to iterate bpf programs instead of parsing kallsyms for the initial pass.

The tests only fail on v5.4. I'm not sure if it's worth worrying about if we're dropping it in #1178.

@christos68k christos68k mentioned this pull request Apr 14, 2026
@github-actions
Copy link
Copy Markdown

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions Bot added the Stale label Apr 22, 2026
@bobrik
Copy link
Copy Markdown
Contributor Author

bobrik commented Apr 27, 2026

@fabled, @florianl, could you have another look?

@github-actions github-actions Bot removed the Stale label Apr 27, 2026
Copy link
Copy Markdown
Member

@florianl florianl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder!
Reading and testing the code again, i'm in favor of this approach as it simplifies things and uses the perf subsystem functionality without the overhead of getting triggered in eBPF space.

Comment thread kallsyms/bpf.go
Comment thread kallsyms/bpf.go
case *perf.LostRecord:
// nil as a sentinel value to indicate lost events
select {
case s.records <- nil:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we handle lost records and KSymbolRecord separatly? With just reporting nil, we loose the information on how many events were actually lost from LostRecord.Lost.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do you have in mind here? A separate struct with the number of lost events as a member or something else?

I'm not sure how useful it is to know how many events were lost. I can see the case for logging the number, but we can do it right here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My primary concern is the potential data loss regarding the number of dropped events when sending nil over the channel. It seems counterintuitive to signal an occurrence while losing the specific data associated with it.

I suggest we consider one of the following alternatives:

  • Implement a metric to track these lost events, comparable to the lostEventsCount used in startPerfEventMonitor()
  • Simply log the count locally and avoid sending any signal over the channel entirely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We send nil to trigger full re-scan to avoid data loss. Not sending nil would mean data loss.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending nil to trigger a full re-scan is fine, but I think this should be documented better. So far we only have nil as a sentinel value to indicate lost events. And the information, on how many events are lost, is still also lost.

I'm thinking about asking for a dedicated channel to trigger a full re-scan. This could help separating both cases in a better way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded the comment to make it clearer.

I deliberately avoided having two streams for updates as it makes it harder to reason about correctness. With one stream that can be re-synchronized between updates it's much clearer, as it cannot race with another stream.

Comment thread kallsyms/bpf.go
@bobrik bobrik force-pushed the ivan/bpf-updates branch 4 times, most recently from 9e2e2a5 to 594a0f7 Compare April 28, 2026 04:05
@bobrik
Copy link
Copy Markdown
Contributor Author

bobrik commented Apr 28, 2026

I rebased and squashed the commits.

CI is seeing weird issues. I've seen this on v6.8 and v6.12 on different runs:

[            ] stdout: === RUN   TestAllTracers
[            ] stderr: time=2026-04-28T04:03:39.850Z level=INFO msg="Using binary analysis (BTF not available: open /sys/kernel/btf/vmlinux: no such file or directory)"
[            ] stdout:     ebpf_integration_test.go:276: 
[            ] stdout:         	Error Trace:	go.opentelemetry.io/ebpf-profiler/tracer/ebpf_integration_test.go:276
[            ] stdout:         	Error:      	Received unexpected error:
[            ] stdout:         	            	failed to load eBPF code: failed to set RODATA variables: failed to determine system configs: tp base not found: 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
[            ] stdout:         	            	00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
[            ] stdout:         	            	00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
[            ] stdout:         	            	00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
[            ] stdout:         	            	00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
[            ] stdout:         	            	00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
[            ] stdout:         	            	00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
[            ] stdout:         	            	00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
[            ] stdout:         	Test:       	TestAllTracers
[            ] stdout: --- FAIL: TestAllTracers (0.78s)
[            ] stdout: FAIL

The latest re-run does not have it and it feels unrealted to the changes here.

@bobrik bobrik force-pushed the ivan/bpf-updates branch 2 times, most recently from e15a602 to 6f347d6 Compare April 30, 2026 05:17
Copy link
Copy Markdown
Contributor

@fabled fabled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks!

Just one question about potential GC pressure point. But approving at this point.

Comment thread kallsyms/bpf.go

// Insert the new symbol into the right position to maintain sorting.
newSym := bpfSymbol{address: addr, size: size, name: name}
newSymbols := make([]bpfSymbol, len(oldSymbols)+1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the similar line in next function) always creates a new slice for the bpf symbol table for every individual bpf symbol change. I suspect the size of this slice can be fairly large.

I'm wondering how much this causes GC pressure in your system with large bpf program volatility.

Would it make sense to swap between two buffers and reallocate only if a larger capacity is needed? And when increading the size do it in larger increments than +1. Perhaps even use sync.Pool to store the other buffer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to squint really hard to even notice this stuff (hovering one of the greyed out columns):

Image

For comparison, symbolizeKernelFrames is 60x more expensive.

If you zoom in, most of the time is spent moving things around, not allocating:

Image

It doesn't seem to be worth worrying about this bit too much, there are bigger candidates that I intend to look into once this lands.

@bobrik bobrik force-pushed the ivan/bpf-updates branch from 6f347d6 to a1a5feb Compare May 4, 2026 20:00
BPF programs come and go much more frequently than modules and doing
a full re-parsing of `/proc/kallsyms` is very expensive, comparatively
speaking. Here we subscribe to updates for both additions and removals
of bpf symbols through `PERF_RECORD_KSYMBOL` mechanism of perf events.
Instead of triggering full parsing, we update the pre-existing mapping
for bpf pseudo-module whenever possible.
@bobrik bobrik force-pushed the ivan/bpf-updates branch from a1a5feb to 6ebb576 Compare May 4, 2026 20:05
@bobrik
Copy link
Copy Markdown
Contributor Author

bobrik commented May 4, 2026

@bobrik bobrik force-pushed the ivan/bpf-updates branch 2 times, most recently from fe4b243 to a16794a Compare May 4, 2026 22:29
@bobrik bobrik force-pushed the ivan/bpf-updates branch from a16794a to 10615dc Compare May 4, 2026 22:34
@bobrik
Copy link
Copy Markdown
Contributor Author

bobrik commented May 4, 2026

I added some error checking code and the error disappeared.

@bobrik bobrik requested a review from florianl May 6, 2026 16:06
Copy link
Copy Markdown
Member

@florianl florianl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update and sorry for the delay, as I was off-desk last week.

@fabled fabled merged commit 9731da9 into open-telemetry:main May 11, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants