refactor kernel module symbolization#502
Conversation
0ea9695 to
22e88d0
Compare
Implement a new kallsyms module for storing kernel symbols more efficiently and allowing reloading symbols on per-module basis. Since the lookup for symbols by name is ever done for very few symbols at the startup, omit the reverse symbol map for further memory reduction, and do a linear search for these lookups. Rework the code that uses kernel symbols. Add a kobject listener to tracer to invoke re-reading symbols.
22e88d0 to
8d098ff
Compare
| // isPIDLive checks if a PID belongs to a live process. It will never produce a false negative but | ||
| // may produce a false positive (e.g. due to permissions) in which case an error will also be | ||
| // returned. | ||
| func isPIDLive(pid libpf.PID) (bool, error) { |
There was a problem hiding this comment.
This is moved straight from the old proc package. But I wonder why it has two methods?
@athre0z any ideas why its kill + stat?
I suppose kill is faster, but perhaps does not work in certain containerized setups?
There was a problem hiding this comment.
That's right, kill is assumed to be faster and tried first. stat is a fallback for when kill returns EPERM (should rarely if ever happen).
| // isPIDLive checks if a PID belongs to a live process. It will never produce a false negative but | ||
| // may produce a false positive (e.g. due to permissions) in which case an error will also be | ||
| // returned. | ||
| func isPIDLive(pid libpf.PID) (bool, error) { |
There was a problem hiding this comment.
That's right, kill is assumed to be faster and tried first. stat is a fallback for when kill returns EPERM (should rarely if ever happen).
- add test case for the ftrace cloned init function - make sure the test suite does not load sysfs data - fix handling of things if sysfs is not available - store mtime as int64 to reduce memory usage
florianl
left a comment
There was a problem hiding this comment.
Works fine here and local testing showed some improvements, in terms of accurate kernel module symbols.
| } | ||
|
|
||
| // Refresh will reload kernel symbols if they are not yet loaded or invalidated. | ||
| func (s *Symbolizer) Refresh() error { |
There was a problem hiding this comment.
So currently Refresh is not fully re-entrant. This probably needs to be fixed. Questions:
- what to do if two
Refreshcalls are concurrent and update is needed? Should we synchronize them with a Mutex to avoid reading kallsyms concurrently? I suppose yes, readingkallsymsis surprisingly slow. Looking at kernel code it actually is fairly CPU intensive. - what to do if a reload of symbols fails? use old symbols or start returning errors?
- probably also reloading after failed reload should be throttled. perhaps to once per minute or so.
There was a problem hiding this comment.
How long does refreshing take on your machine? On mine it's around 300ms with around 110 kernel modules loaded. Before I measured, I was going to say that we should probably do the kallsym processing asynchronously so that we don't introduce delays into GetModuleByAddress and GetModuleByName (which can use the old version while processing is taking place) but maybe it's not worth the complexity.
If we do asynchronous processing of symbols, then for 1. we won't have concurrent Refresh calls, as there'll only be a single goroutine periodically calling it.
There was a problem hiding this comment.
On my machine its about 150ms with around 250 kernel modules loaded. And that is CPU time. Typically that amount of data takes only a few milliseconds to read. I pinpointed this in the kernel code: reading the kallsyms is implemented with kernel seq implementation. During the module symbol reading stage, each line will do linear O(current-module) search of the module followed with O(current-symbol-within-module) search. So if you have modules with large amount symbols in the end of the module chain, it'll take lot of CPU cycles.
Given your note, I think doing the kallsyms read in the other goroutine is probably doable without too much complexity. The readers could just RCU grab the []modules locally and do the search. The only draw back is that if new module is loaded and it starts doing immediately lot of work, the translation fails until the kallsyms refresh is done.
And for kernel modules its somewhat important - the kallsyms is used to 1) translate the VMA to the fileID+RVA and 2) send the fallback symbol. The point 2 is not too important. But because of point 1, we lose all information of the trace. Then again. The convergence time is only some hundreds of milliseconds, so the potentially lost time is not too much.
The other option would be to just synchronize symbolization requests with the invalidate call. But that would be equal to current approach of having the first symbolization call do the heavy lifting.
No strong opinion here. Slightly leaning towards keeping the current approach. Mostly because the kobject netlink listener cannot do select. So in current code base we can do multiple Invalidate calls from that goroutine. And all those get batched up until it comes time to do the first symbolization.
There was a problem hiding this comment.
Since we don't really know all the details re: corner-cases of parsing symbols, theoretically since I'm already seeing 300ms locally, it's not far fetched to expect a worst case of 1s or more. I don't like that GetModuleByAddress and GetModuleByName have non-deterministic runtime and could be delayed accordingly. It introduces runtime complexity where if we want to understand the dynamic behavior of that part of the codebase, we have to propagate that indeterminism in runtime to all the callers of GetModuleByAddress and GetModuleByName. Semantically it seems better to have these run as fast as possible and NOT depend on the runtime of processing symbols.
Since most of the work in Symbolizer.updateSymbolsFrom takes place in newly allocated slices and doesn't touch pre-existing data, we can add fine-grained locking (for example, RLock around s.getModuleByAddress and in GetModuleByName).
Then we can call Refresh from a single goroutine as-needed, also doing throttling. For example, the goroutine can check s.valid.Load() once a minute and call Refresh or not. I think we should pick a Refresh frequency we're happy with and use that as the baseline for throttle, meaning we should always throttle not just on errors. Maybe we increase that period on errors.
This scheme will cause GetModuleByAddress and GetModuleByName never to block for indeterminate amounts of time and seems semantically simpler to me as there's only a single place managing all Refresh logic (the goroutine).
If we call Refresh once on agent startup, then possibilities for missing symbols only arise when new modules are loaded and there are associated incoming stack traces. I'm OK with missing the occasional symbol in these cases especially since this is the model the agent works under (eventual consistency). If that code is hot, the agent will see the associated stack traces again after symbols have been extracted.
There was a problem hiding this comment.
Ok. Reworked this a bit differently. kallsyms.Reload now does the work, and gets called from the kobject poller goroutine. Calling this is rate limited and triggered using the netlink socket deadline.
Things are made concurrent safe: any number of reads can read concurrently, and one Refresh can concurrently run with the readers. All this is done with atomic/sync.Value mechanism.
Interface refactored to do initial kallsyms read in kallsyms.NewSymbolizer so we are known to have valid state.
I'm OK with missing the occasional symbol in these cases especially since this is the model the agent works under (eventual consistency). If that code is hot, the agent will see the associated stack traces again after symbols have been extracted.
So yes and no. The new symbols will start to show up once kallsyms is reloaded. However, the traces missed before reload containing new symbols (from newly loaded module) will not ever show up. This is because the module's fileID is not available, and those temporarily unknown symbols end up mapped to libpf.UnknownKernelFileID. We could potentially improve this by keeping separate module listing by walking /sys/modules and scan the module vma range and fileID first in a fast run, and then do the slow kallsyms reading. But perhaps worth a follow up ticket/PR?
But I guess this OK, since the gap is expected to be small (kallsyms reading should not really fail), and I agree that making the symbolizer potentially slow is a bad design choice.
There was a problem hiding this comment.
We could potentially improve this by keeping separate module listing by walking
/sys/modulesand scan the module vma range and fileID first in a fast run, and then do the slow kallsyms reading.
I just implemented this in the latest commit. Should be in pretty good state now.
…ng thread use the deadline mechanism to determine when reload should happen
- move the autoreload logic kallsyms package - first load module metadata to produce faster valid fileID:address mappings - throttle kallsyms reloading
| index := len(m.names) | ||
| l := len(name) | ||
| // Cap the length to 255 bytes so it fits a byte. Longest seen | ||
| // symbol so far is 83 bytes. |
Implement a new kallsyms module for storing kernel symbols more efficiently and allowing reloading symbols on per-module basis.
Since the lookup for symbols by name is ever done for very few symbols at the startup, omit the reverse symbol map for further memory reduction, and do a linear search for these lookups.
Rework the code that uses kernel symbols.
Add a kobject listener to tracer to invoke re-reading symbols.
fixes #211
Significant reduction of memory usage for kernel symbols: