-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MVP: Breakdown graphs #70
Comments
After a conversation with @roncohen, we're going to change the approach a bit. Up until now, the concept was to base the breakdown on individual transaction documents (or rather Therefore, the metric-based approach seems more viable as it will be more space efficient (transaction documents don't get bigger, fixed cost per agent as opposed to throughput-dependant) and even more accurate compared to storing the breakdown data only on sampled transactions. Note that we don't need percentiles/histrograms for this use case. It's still possible to use certain dimensions to drill down. These dimensions are everything that's available in metadata, as well as Implications for agents: Update: agents can actually also start with just collecting timing information for sampled spans. Although eventually it's better to also track breakdown metrics for all transactions. The reason is that it works better when we have non-percentage-based sampling in place. For example, when there are different sampling rates for different transactions or if we rate limit per unique transaction name the breakdown would be skewed unfairly towards transactions with a lower throughput (because we don't sample as much transactions with a higher throughput). I'm currently working out the details of this approach and will update this issue accordingly. |
This the revised, metric-based concept, which can be copy/pasted into Kibana Dev Tools to try it out with example data: |
both approaches for top labels sound fine to me, but i'd prefer to update the spec |
I understand the use case for wanting to record timings for all spans here, but I'd currently consider this a blocker for the Node.js agent as I think the overhead of instantiating non-sampled spans would currently be too large. For this to be efficient, the Node.js agent would first have to implement an object-reuse-pool similar to the one that the Java agent has. That would be really nice to have in the Node.js agent for other reasons as well of course, so maybe now is the time we should prioritize to implement this. |
I'd prefer to extend the metricset spec. Feels like it'll grow better with age, if/when we add other well-defined fields. It would also avoid surprising behaviour when moving from one version of the server to another, causing labels that were previously stored as labels to start being stored as some other field. (Perhaps a vanishingly unlikely scenario?) I'm still not too sure how we're going to be able to calculate self_time well in the Go agent, particularly for custom instrumentation. @felixbarny do you have an idea of what you'll do for custom spans? My current idea is to start out assuming all spans are synchronous, and provide the user with a means of flagging asynchronous spans. |
Thinking about it again, it's actually not strictly required to also time non-sampled spans. However, it would have to be consistent so that the I think it's ok that some agents only time sampled transactions and some all. Currently, we don't have a view which mixes data from different agents. But if we want to break down a complete trace, for example, might have to align on one approach. But that is much further out. It would require propagating the trace name downstream and the transaction duration upstream. That requires significant agent-alignment which might require a major bump anyway. Also, we might still get away with agents choosing different approaches (timing only sampled vs all transactions) as we could do a breakdown per service, calculate the percentages and then aggregate the percentages, using the average duration per service and time bucket as a weight. |
If my understanding of |
Could we define self-time as
With this definition, it would not matter if a child is async or not. Example A:
Span s1 has two async children: s1-1 and s1-2. The self-time of Span 1 is [s1.start..s1-1.start] + [s1-2.end..s1.end]. Example B:
In this example, the self time of Span 1 is [s1.start..s1-1.start] The downside of that definition is that this may wrongly assume that a parent span is idle waiting for their child spans to complete when it actually is not. Consider these scenarios in a reactive environments like Node.js, where there's only one thread and all spans in these example scenarios are async:
I conclude that even for async child spans, we can't assume that the parent span is not blocked waiting for the child, This can be implemented by a "reentrant timer" present on each transaction and span. |
Another important note is that agents don't have to implement it by 7.1. But the UI and APM Server should be ready to accept and process the data. |
Meeting summary from the Node.js agent perspective: For now, we can calculate these metrics based on sampled transactions/spans only. As long as we're only doing random sampling, this should be ok. |
Also discussed RUM agent's possible use-cases for the breakdown graph, so
far there's nothing backend specific in our approach and it is possible to
calculate some breakdown values based on the spans currently collected by
the RUM agent. However, we need to investigate more to see how useful these
graphs are for Real User Monitoring considering the available data for
breakdown graphs.
|
Edit: update moved to #70 (comment) |
Checked the design issue and added links to the two implementation issues for the graph and the KPI stats component. |
This meta issue tracks the development of a minimum viable product for breakdown graphs. The goal is to get the PRs merged into the respective repo. As some of the technical detail may still change in the course of implementing this MVP, it makes sense to mark these features as incubating and to also create internal feature toggles for it. That will make it easier to test the features in combination, to minimize merge conflicts, to work in small iterations and to decouple merge from release. But that's only a suggestion.
Concept: https://docs.google.com/document/d/1-_LuC9zhmva0VvLgtI0KcHuLzNztPHbcM0ZdlcPUl64
The MVP phase should at least involve the UI, the APM Server and the Java agent. Other agents are welcome to join the party as well :)
The text was updated successfully, but these errors were encountered: