Skip to content

refactor kernel module symbolization#502

Merged
fabled merged 17 commits intoopen-telemetry:mainfrom
fabled:tt-kallsyms-reload
Jun 13, 2025
Merged

refactor kernel module symbolization#502
fabled merged 17 commits intoopen-telemetry:mainfrom
fabled:tt-kallsyms-reload

Conversation

@fabled
Copy link
Copy Markdown
Contributor

@fabled fabled commented Jun 3, 2025

Implement a new kallsyms module for storing kernel symbols more efficiently and allowing reloading symbols on per-module basis.

Since the lookup for symbols by name is ever done for very few symbols at the startup, omit the reverse symbol map for further memory reduction, and do a linear search for these lookups.

Rework the code that uses kernel symbols.

Add a kobject listener to tracer to invoke re-reading symbols.

fixes #211

Significant reduction of memory usage for kernel symbols:

  • the kernel has 200+k symbols, this makes the per-symbol over head 8 bytes (instead of libpf.Symbol which is 32 bytes)
  • reverse symbol map is removed reducing memory usage more (and even more the pressure due to not having dynamic map creation at all)
  • if further reduction is needed, we could consider using https://github.com/tmthrgd/shoco or similar fast string compressor which can be trained with kallsyms data

Comment thread kallsyms/kallsyms.go Fixed
Comment thread kallsyms/kallsyms.go Fixed
Comment thread kallsyms/kallsyms.go Fixed
Comment thread kallsyms/kallsyms.go Dismissed
Comment thread kallsyms/kallsyms.go Dismissed
Comment thread kallsyms/kallsyms.go Dismissed
Comment thread kallsyms/kallsyms.go Dismissed
@fabled fabled force-pushed the tt-kallsyms-reload branch 2 times, most recently from 0ea9695 to 22e88d0 Compare June 3, 2025 10:43
Implement a new kallsyms module for storing kernel symbols more
efficiently and allowing reloading symbols on per-module basis.

Since the lookup for symbols by name is ever done for very few
symbols at the startup, omit the reverse symbol map for further
memory reduction, and do a linear search for these lookups.

Rework the code that uses kernel symbols.

Add a kobject listener to tracer to invoke re-reading symbols.
@fabled fabled force-pushed the tt-kallsyms-reload branch from 22e88d0 to 8d098ff Compare June 3, 2025 10:46
// isPIDLive checks if a PID belongs to a live process. It will never produce a false negative but
// may produce a false positive (e.g. due to permissions) in which case an error will also be
// returned.
func isPIDLive(pid libpf.PID) (bool, error) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved straight from the old proc package. But I wonder why it has two methods?
@athre0z any ideas why its kill + stat?
I suppose kill is faster, but perhaps does not work in certain containerized setups?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, kill is assumed to be faster and tried first. stat is a fallback for when kill returns EPERM (should rarely if ever happen).

Comment thread stringutil/stringutil.go Outdated
@fabled fabled marked this pull request as ready for review June 3, 2025 16:50
@fabled fabled requested review from a team as code owners June 3, 2025 16:50
Copy link
Copy Markdown
Member

@christos68k christos68k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do another pass

// isPIDLive checks if a PID belongs to a live process. It will never produce a false negative but
// may produce a false positive (e.g. due to permissions) in which case an error will also be
// returned.
func isPIDLive(pid libpf.PID) (bool, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, kill is assumed to be faster and tried first. stat is a fallback for when kill returns EPERM (should rarely if ever happen).

Comment thread stringutil/stringutil.go Outdated
Comment thread kallsyms/kallsyms.go Outdated
Comment thread kallsyms/kallsyms.go Outdated
Comment thread kallsyms/kallsyms.go Outdated
Comment thread kallsyms/kallsyms.go
Comment thread tracer/tracer.go Outdated
Comment thread tracer/tracer.go Outdated
Comment thread tracer/tracer.go Outdated
Comment thread tracer/tracer.go Outdated
Comment thread kallsyms/kallsyms.go Outdated
Comment thread kallsyms/kallsyms.go Outdated
Comment thread kallsyms/kallsyms.go Fixed
fabled added 2 commits June 4, 2025 17:33
- add test case for the ftrace cloned init function
- make sure the test suite does not load sysfs data
- fix handling of things if sysfs is not available
- store mtime as int64 to reduce memory usage
Comment thread kallsyms/kallsyms.go Fixed
Comment thread kallsyms/kallsyms.go Fixed
Copy link
Copy Markdown
Member

@florianl florianl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works fine here and local testing showed some improvements, in terms of accurate kernel module symbols.

Comment thread kallsyms/kallsyms.go Outdated
}

// Refresh will reload kernel symbols if they are not yet loaded or invalidated.
func (s *Symbolizer) Refresh() error {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently Refresh is not fully re-entrant. This probably needs to be fixed. Questions:

  1. what to do if two Refresh calls are concurrent and update is needed? Should we synchronize them with a Mutex to avoid reading kallsyms concurrently? I suppose yes, reading kallsyms is surprisingly slow. Looking at kernel code it actually is fairly CPU intensive.
  2. what to do if a reload of symbols fails? use old symbols or start returning errors?
  3. probably also reloading after failed reload should be throttled. perhaps to once per minute or so.

Copy link
Copy Markdown
Member

@christos68k christos68k Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does refreshing take on your machine? On mine it's around 300ms with around 110 kernel modules loaded. Before I measured, I was going to say that we should probably do the kallsym processing asynchronously so that we don't introduce delays into GetModuleByAddress and GetModuleByName (which can use the old version while processing is taking place) but maybe it's not worth the complexity.

If we do asynchronous processing of symbols, then for 1. we won't have concurrent Refresh calls, as there'll only be a single goroutine periodically calling it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine its about 150ms with around 250 kernel modules loaded. And that is CPU time. Typically that amount of data takes only a few milliseconds to read. I pinpointed this in the kernel code: reading the kallsyms is implemented with kernel seq implementation. During the module symbol reading stage, each line will do linear O(current-module) search of the module followed with O(current-symbol-within-module) search. So if you have modules with large amount symbols in the end of the module chain, it'll take lot of CPU cycles.

Given your note, I think doing the kallsyms read in the other goroutine is probably doable without too much complexity. The readers could just RCU grab the []modules locally and do the search. The only draw back is that if new module is loaded and it starts doing immediately lot of work, the translation fails until the kallsyms refresh is done.

And for kernel modules its somewhat important - the kallsyms is used to 1) translate the VMA to the fileID+RVA and 2) send the fallback symbol. The point 2 is not too important. But because of point 1, we lose all information of the trace. Then again. The convergence time is only some hundreds of milliseconds, so the potentially lost time is not too much.

The other option would be to just synchronize symbolization requests with the invalidate call. But that would be equal to current approach of having the first symbolization call do the heavy lifting.

No strong opinion here. Slightly leaning towards keeping the current approach. Mostly because the kobject netlink listener cannot do select. So in current code base we can do multiple Invalidate calls from that goroutine. And all those get batched up until it comes time to do the first symbolization.

Copy link
Copy Markdown
Member

@christos68k christos68k Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't really know all the details re: corner-cases of parsing symbols, theoretically since I'm already seeing 300ms locally, it's not far fetched to expect a worst case of 1s or more. I don't like that GetModuleByAddress and GetModuleByName have non-deterministic runtime and could be delayed accordingly. It introduces runtime complexity where if we want to understand the dynamic behavior of that part of the codebase, we have to propagate that indeterminism in runtime to all the callers of GetModuleByAddress and GetModuleByName. Semantically it seems better to have these run as fast as possible and NOT depend on the runtime of processing symbols.

Since most of the work in Symbolizer.updateSymbolsFrom takes place in newly allocated slices and doesn't touch pre-existing data, we can add fine-grained locking (for example, RLock around s.getModuleByAddress and in GetModuleByName).

Then we can call Refresh from a single goroutine as-needed, also doing throttling. For example, the goroutine can check s.valid.Load() once a minute and call Refresh or not. I think we should pick a Refresh frequency we're happy with and use that as the baseline for throttle, meaning we should always throttle not just on errors. Maybe we increase that period on errors.

This scheme will cause GetModuleByAddress and GetModuleByName never to block for indeterminate amounts of time and seems semantically simpler to me as there's only a single place managing all Refresh logic (the goroutine).

If we call Refresh once on agent startup, then possibilities for missing symbols only arise when new modules are loaded and there are associated incoming stack traces. I'm OK with missing the occasional symbol in these cases especially since this is the model the agent works under (eventual consistency). If that code is hot, the agent will see the associated stack traces again after symbols have been extracted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Reworked this a bit differently. kallsyms.Reload now does the work, and gets called from the kobject poller goroutine. Calling this is rate limited and triggered using the netlink socket deadline.

Things are made concurrent safe: any number of reads can read concurrently, and one Refresh can concurrently run with the readers. All this is done with atomic/sync.Value mechanism.

Interface refactored to do initial kallsyms read in kallsyms.NewSymbolizer so we are known to have valid state.

I'm OK with missing the occasional symbol in these cases especially since this is the model the agent works under (eventual consistency). If that code is hot, the agent will see the associated stack traces again after symbols have been extracted.

So yes and no. The new symbols will start to show up once kallsyms is reloaded. However, the traces missed before reload containing new symbols (from newly loaded module) will not ever show up. This is because the module's fileID is not available, and those temporarily unknown symbols end up mapped to libpf.UnknownKernelFileID. We could potentially improve this by keeping separate module listing by walking /sys/modules and scan the module vma range and fileID first in a fast run, and then do the slow kallsyms reading. But perhaps worth a follow up ticket/PR?

But I guess this OK, since the gap is expected to be small (kallsyms reading should not really fail), and I agree that making the symbolizer potentially slow is a bad design choice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially improve this by keeping separate module listing by walking /sys/modules and scan the module vma range and fileID first in a fast run, and then do the slow kallsyms reading.

I just implemented this in the latest commit. Should be in pretty good state now.

Comment thread kallsyms/kallsyms.go Fixed
Comment thread kallsyms/kallsyms.go Fixed
- move the autoreload logic kallsyms package
- first load module metadata to produce faster valid
  fileID:address mappings
- throttle kallsyms reloading
Comment thread kallsyms/kallsyms.go Dismissed
Comment thread kallsyms/kallsyms.go Dismissed
Comment thread kallsyms/kallsyms.go Dismissed
Comment thread kallsyms/kallsyms.go
Comment thread kallsyms/kallsyms.go
Comment thread kallsyms/kallsyms.go Outdated
Comment thread kallsyms/kallsyms.go Outdated
Comment thread kallsyms/kallsyms.go Outdated
Comment thread kallsyms/kallsyms.go Outdated
Comment thread kallsyms/kallsyms.go
Comment thread kallsyms/kallsyms.go
Comment thread kallsyms/kallsyms.go
@fabled fabled merged commit a243f85 into open-telemetry:main Jun 13, 2025
25 checks passed
Comment thread kallsyms/kallsyms.go
index := len(m.names)
l := len(name)
// Cap the length to 255 bytes so it fits a byte. Longest seen
// symbol so far is 83 bytes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Frame metadata for kernel modules is only read once at startup

6 participants