Profiles: simplify profile stack trace representation#645
Profiles: simplify profile stack trace representation#645felixge wants to merge 1 commit intoopen-telemetry:mainfrom
Conversation
florianl
left a comment
There was a problem hiding this comment.
Especially mixed reports of profiles, e.g. on- and off-CPU profiles, will benefit from this approach.
In open-telemetry#645 a simplified stack trace representation is proposed. A discussion there asks for a more efficient way to represent stacks that slightly differ at the leaf. A specialized solution proposed there would put the leaf location as a separate field. That seems partial and potentially error-prone. This PR proposes an alternative stack trace representation, shaped as a tree in the form of two arrays. See the code for details and an example. The advantage of this approach is that stacks with the same prefix are encoded more compactly. Also, since we use only two arrays for the encoding, the in-memory representation is also more efficient in terms of the number of allocations required. The main disadvantage is that this approach is more complex semantically and for this reason can be more error-prone.
|
See open-telemetry/opentelemetry-ebpf-profiler#524 for benchmarks that also cover this proposal. |
- Introduce a first-class Stack message type and lookup table. - Replace location index range based stack trace encoding on Sample with a single stack_index reference. - Remove the location_indices lookup table. The primary motivation is laying the ground work for [timestamp based profiling][timestamp proposal] where the same stack trace needs to be referenced much more frequently compared to aggregation based on low cardinality attributes. Timestamp based profiling is also expected to be used with the the upcoming [Off-CPU profiling][off-cpu pr] feature in the eBPF profiler. Off-CPU stack traces have a different distribution compared to CPU samples. In particular stack traces are much more repetitive because they only occur at call sites such as syscalls. For the same reason it is also uncommon to see a stack trace are a root-prefix of a previously observed stack trace. We might need to revisit the previous [previous benchmarks][benchmarks] to confirm these claims. The secondary motivation is simplicitly. Arguably the proposed change here will make it easier to write exporters, processors as well as receivers. It seems like we had rough consensus around this change in previous SIG meetings, and it seems like a good incremental step to make progress on the timestamp proposal. [timestamp proposal]: open-telemetry#594 [off-cpu pr]: open-telemetry/opentelemetry-ebpf-profiler#196 [benchmarks]: https://docs.google.com/spreadsheets/d/1Q-6MlegV8xLYdz5WD5iPxQU2tsfodX1-CDV1WeGzyQ0/edit?gid=2069300294#gid=2069300294
52ad266 to
d972082
Compare
|
This PR has been rebased and the description has been updated. It's ready for another round of review. |
| bool has_inline_frames = 9; | ||
| } | ||
|
|
||
| // A Stack represents a stack trace as a list of locations. The first location |
There was a problem hiding this comment.
We should also add this to and update the helper ascii graph at the beginning of the file.
|
Please resolve comments so that I can merge. |
| bool has_inline_frames = 9; | ||
| } | ||
|
|
||
| // A Stack represents a stack trace as a list of locations. The first location |
There was a problem hiding this comment.
nit: "The first location is the leaf frame" - would it make sense to put this in the field comment rather than message?
| repeated AttributeUnit attribute_units = 7; | ||
|
|
||
| // Stacks referenced by samples via Sample.stack_index. | ||
| repeated Stack stack_table = 8; |
There was a problem hiding this comment.
Maybe group stack_table with location_table and friends and keep attribute_units last since it's somewhat special in semantics?
|
We are planning to release 1.8.0. Do you need this merged before that? |
Before if possible, otherwise we'll need to trigger another release very soon. |
OK, in that case please address / close open comments and resolve the merge conflicts so that this PR can be merged. We can wait a day or so but if it is going to take much longer then it will have to wait until the release after that. |
@felixge will probably be busy until the end of this week so I'm guessing we won't be able to resolve conflicts and merge #645, but I've opened #708 which is essentially #645 with updates and hopefully we can get that in before |
If you have a particular reason this change has to go together with some previous ones we can wait a few more days. If it doesn't need to be released together then we can release 1.8.0 now and then will do 1.9.0 immediate after your change is merged. Releases are not hard and we don't time them to any particular cadence, we can release when we want. |
There's no special reason other than speeding up our roadmap (and also updating the profiling agent to include all recent protocol changes) but being able to do another |
|
@christos68k, should this PR be closed given #708 is merged? |
|
Closing, @tigrannajaryan has merged #708 which contains the changes in this PR. |
Changes
Motivation
tl;dr: Let's KISS. The stack trace encoding proposed in this PR is competitive with more advanced stack trace encoding formats, but much simpler to implement. See table below or @christos68k's benchmark for details.
The current stack trace encoding is somewhat complex as it tries to optimize for CPU profiling stack traces where it's relatively common to see prefixes of a stack trace.
For example, a program that looks like this ...
... will produce a CPU profile with stack traces like the ones shown below. The number indicate samples taken at different locations (program counters) within the same function:
This makes it attractive to optimize away the prefix repetition (
a;b;c) in the stack trace dictionary table. But the current format doesn't quite succeed in this, as it would only allow us to get rid of the last stack trace in the table. This probably explains why it doesn't end up performing very differently from the encoding proposed in this PR.The alternative double array proposal from @aalexand achieves over 12% reduction in profile size before compression, because it can exploit this pattern effectively. However, for some reason the resulting data patterns end up being much harder to compress, and the encoding loses by almost 10% after compression has been applied.
Last but not least, not all profiling data will benefit from prefix-encoding as much as CPU profiles. For example memory allocations and especially Off-CPU stack traces have significantly less prefix repetition. We don't have a good benchmark for this, but I hope this argument is easy to follow regardless.
Given the above, I'm proposing to chose the most trivial stack trace encoding format as outlined in this PR as it seems to hit a sweet spot between complexity and efficiency. The main tasks for reviewers is to validate benchmarks from @christos68k's and confirm if they're okay with this direction.
It seems like we had rough consensus around this change in previous SIG meetings.