MVP: Breakdown graphs #70

felixbarny · 2019-04-02T14:49:52Z

This meta issue tracks the development of a minimum viable product for breakdown graphs. The goal is to get the PRs merged into the respective repo. As some of the technical detail may still change in the course of implementing this MVP, it makes sense to mark these features as incubating and to also create internal feature toggles for it. That will make it easier to test the features in combination, to minimize merge conflicts, to work in small iterations and to decouple merge from release. But that's only a suggestion.

Concept: https://docs.google.com/document/d/1-_LuC9zhmva0VvLgtI0KcHuLzNztPHbcM0ZdlcPUl64

The MVP phase should at least involve the UI, the APM Server and the Java agent. Other agents are welcome to join the party as well :)

Design: [APM] Breakdown graphs design kibana#34716
UI Implementation
- Breakdown graph [APM] Transaction breakdown graph: Add stacked area time series graph kibana#36441
- Breakdown KPI component [APM] Transactions breakdown graph: Add breakdown KPI component kibana#36445
Agents: Agent Breakdown Metrics #78
APM Server issue: Top-level tags/labels for metrics apm-server#2086
Docs: [docs] breakdown graphs #91

felixbarny · 2019-04-05T13:52:02Z

After a conversation with @roncohen, we're going to change the approach a bit. Up until now, the concept was to base the breakdown on individual transaction documents (or rather nested-mapped breakdown documents within transaction) so that users could use the query bar to drill down by things like user.id and labels/tags. Turns out that the drill down is not an important use-case (source: @makwarth). It's more about getting a quick overview of where time is generally spent, on a service-level and transaction-group-level.

Therefore, the metric-based approach seems more viable as it will be more space efficient (transaction documents don't get bigger, fixed cost per agent as opposed to throughput-dependant) and even more accurate compared to storing the breakdown data only on sampled transactions. Note that we don't need percentiles/histrograms for this use case.

It's still possible to use certain dimensions to drill down. These dimensions are everything that's available in metadata, as well as transaction.name and transaction.type. Searching for a particular user.id yields an empty graph - that's ok.

Implications for agents:
It will not be possible anymore to have true noop spans. With this approach, every span has to be timed, even spans for non-sampled transactions. Non-sampled spans are still not reported to the APM Server but metrics will be tracked for them. @elastic/apm-agent-devs If that's a blocker, now is the time to raise your voice. Another note: the MVP excludes the RUM agent as it does not seem feasible that it track metrics in the browser. In later stages, the APM Server may track the metrics for RUM but that's another discussion.

Update: agents can actually also start with just collecting timing information for sampled spans. Although eventually it's better to also track breakdown metrics for all transactions. The reason is that it works better when we have non-percentage-based sampling in place. For example, when there are different sampling rates for different transactions or if we rate limit per unique transaction name the breakdown would be skewed unfairly towards transactions with a lower throughput (because we don't sample as much transactions with a higher throughput).

I'm currently working out the details of this approach and will update this issue accordingly.

felixbarny · 2019-04-05T15:49:11Z

This the revised, metric-based concept, which can be copy/pasted into Kibana Dev Tools to try it out with example data:

https://docs.google.com/document/d/1-_LuC9zhmva0VvLgtI0KcHuLzNztPHbcM0ZdlcPUl64#heading=h.nmgmsomgvucn

jalvz · 2019-04-08T14:44:26Z

both approaches for top labels sound fine to me, but i'd prefer to update the spec

watson · 2019-04-08T20:04:28Z

I understand the use case for wanting to record timings for all spans here, but I'd currently consider this a blocker for the Node.js agent as I think the overhead of instantiating non-sampled spans would currently be too large. For this to be efficient, the Node.js agent would first have to implement an object-reuse-pool similar to the one that the Java agent has. That would be really nice to have in the Node.js agent for other reasons as well of course, so maybe now is the time we should prioritize to implement this.

axw · 2019-04-09T01:38:48Z

transaction.name and transaction.type have to be on the top-level as opposed to be nested under tags/labels.
That is because it has to match the field names for regular transaction documents
so that there is a unified behavior when using the query bar to filter for a particular transaction type or name.
I'm not yet sure how to bring those to the top level. Some ideas:

APM Server knows about "well-known" tags like transaction_name and transaction_type and brings them to top-level

Extend the metricset spec (https://github.com/elastic/apm-server/blob/master/docs/spec/metricsets/metricset.json) to allow for top-level key/value pairs (dots in the keys have to be allowed)

I'd prefer to extend the metricset spec. Feels like it'll grow better with age, if/when we add other well-defined fields. It would also avoid surprising behaviour when moving from one version of the server to another, causing labels that were previously stored as labels to start being stored as some other field. (Perhaps a vanishingly unlikely scenario?)

I'm still not too sure how we're going to be able to calculate self_time well in the Go agent, particularly for custom instrumentation. @felixbarny do you have an idea of what you'll do for custom spans? My current idea is to start out assuming all spans are synchronous, and provide the user with a means of flagging asynchronous spans.

felixbarny · 2019-04-09T08:06:29Z

I understand the use case for wanting to record timings for all spans here, but I'd currently consider this a blocker for the Node.js agent as I think the overhead of instantiating non-sampled spans would currently be too large.

Thinking about it again, it's actually not strictly required to also time non-sampled spans. However, it would have to be consistent so that the self_time for non-sampled transactions is also not timed.

I think it's ok that some agents only time sampled transactions and some all. Currently, we don't have a view which mixes data from different agents.

But if we want to break down a complete trace, for example, might have to align on one approach. But that is much further out. It would require propagating the trace name downstream and the transaction duration upstream. That requires significant agent-alignment which might require a major bump anyway. Also, we might still get away with agents choosing different approaches (timing only sampled vs all transactions) as we could do a breakdown per service, calculate the percentages and then aggregate the percentages, using the average duration per service and time bucket as a weight.

watson · 2019-04-09T08:10:20Z

If my understanding of self_time is correct, I don't think we can support that in the Node agent either unfortunately

felixbarny · 2019-04-09T08:50:59Z

I'm still not too sure how we're going to be able to calculate self_time well in the Go agent, particularly for custom instrumentation.

Could we define self-time as

The time which can't be explained by child spans

With this definition, it would not matter if a child is async or not.

Example A:

[ s1            ]
|--[ s1-1  ]
`----[ s1-2  ]

Span s1 has two async children: s1-1 and s1-2. The self-time of Span 1 is [s1.start..s1-1.start] + [s1-2.end..s1.end].
As both s1-1 and s1-2 have no children, their self-time is equal to their duration.

Example B:

[ s1            ]
`--[ s1-1             ]

In this example, the self time of Span 1 is [s1.start..s1-1.start]

The downside of that definition is that this may wrongly assume that a parent span is idle waiting for their child spans to complete when it actually is not.
But I don't think there is a generic way to find that out anyway, even when taking the sync flag into consideration so it may be a good enough compromise.
The problem is that sync=false does not mean the parent is not blocked.

Consider these scenarios in a reactive environments like Node.js, where there's only one thread and all spans in these example scenarios are async:

A child span blocks the progress of the parent because its blocking the event thread by doing a long-running, CPU heavy operation
The child span only starts an external call. When the response comes in, the span is ended and the parent can resume.
- Scenario a) the parent didn't have anything meaningful to do in the mean time
- Scenario b) the parent did some CPU heavy work in the mean time

I conclude that even for async child spans, we can't assume that the parent span is not blocked waiting for the child,
even though it could potentially do useful work (intuitively contributing to self-time) in the meantime.
If that is problematic for a particular use case, users could just create a custom span which has its own self-time.

This can be implemented by a "reentrant timer" present on each transaction and span.
It starts with a nesting level of 0. When a child span is started, it notifies its parent about the start event.
The parent increases the nesting level and captures the current time.
If there is another, async child, the nesting level increases but the current time is not captured.
Whenever a span ends, it notifies its parent which decreases the nesting level.
If the nesting level is 0 again or if a span ends, the elapsed time since taking the start timestamp is calculated.
This value increments a childDurations span instance variable. The self-time of a span is calculated as duration - childDurations.

felixbarny · 2019-04-09T09:09:31Z

Another important note is that agents don't have to implement it by 7.1. But the UI and APM Server should be ready to accept and process the data.

watson · 2019-04-09T14:12:42Z

Meeting summary from the Node.js agent perspective: For now, we can calculate these metrics based on sampled transactions/spans only. As long as we're only doing random sampling, this should be ok.

hmdhk · 2019-04-10T09:14:03Z

Also discussed RUM agent's possible use-cases for the breakdown graph, so far there's nothing backend specific in our approach and it is possible to calculate some breakdown values based on the spans currently collected by the RUM agent. However, we need to investigate more to see how useful these graphs are for Real User Monitoring considering the available data for breakdown graphs.

felixbarny · 2019-04-10T13:25:09Z

Edit: update moved to #70 (comment)

formgeist · 2019-05-10T14:13:07Z

Checked the design issue and added links to the two implementation issues for the graph and the KPI stats component.

felixbarny self-assigned this Apr 2, 2019

felixbarny added the meta label Apr 2, 2019

felixbarny mentioned this issue Apr 4, 2019

Extend Intake API to allow for breakdown graphs elastic/apm-server#2065

Closed

felixbarny mentioned this issue Apr 8, 2019

[APM] Service-level breakdown graphs UI elastic/kibana#34352

Closed

felixbarny added this to the 7.1 milestone Apr 8, 2019

felixbarny mentioned this issue Apr 8, 2019

Top-level tags/labels for metrics elastic/apm-server#2086

Closed

felixbarny mentioned this issue Apr 22, 2019

Agent Breakdown Metrics #78

Closed

felixbarny modified the milestones: 7.1, 7.2 Apr 22, 2019

felixbarny modified the milestones: 7.2, 7.3 May 3, 2019

This was referenced May 10, 2019

[APM] Transaction breakdown graph: Add stacked area time series graph elastic/kibana#36441

Closed

[APM] Transactions breakdown graph: Add breakdown KPI component elastic/kibana#36445

Closed

felixbarny mentioned this issue Jun 14, 2019

Metric names starting with span and transaction are dropped elastic/apm-server#2286

Closed

felixbarny closed this as completed Nov 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MVP: Breakdown graphs #70

MVP: Breakdown graphs #70

felixbarny commented Apr 2, 2019 •

edited

Loading

felixbarny commented Apr 5, 2019 •

edited

Loading

felixbarny commented Apr 5, 2019 •

edited

Loading

jalvz commented Apr 8, 2019

watson commented Apr 8, 2019

axw commented Apr 9, 2019

felixbarny commented Apr 9, 2019

watson commented Apr 9, 2019

felixbarny commented Apr 9, 2019 •

edited

Loading

felixbarny commented Apr 9, 2019

watson commented Apr 9, 2019 •

edited

Loading

hmdhk commented Apr 10, 2019 via email

felixbarny commented Apr 10, 2019 •

edited

Loading

formgeist commented May 10, 2019

MVP: Breakdown graphs #70

MVP: Breakdown graphs #70

Comments

felixbarny commented Apr 2, 2019 • edited Loading

felixbarny commented Apr 5, 2019 • edited Loading

felixbarny commented Apr 5, 2019 • edited Loading

jalvz commented Apr 8, 2019

watson commented Apr 8, 2019

axw commented Apr 9, 2019

felixbarny commented Apr 9, 2019

watson commented Apr 9, 2019

felixbarny commented Apr 9, 2019 • edited Loading

felixbarny commented Apr 9, 2019

watson commented Apr 9, 2019 • edited Loading

hmdhk commented Apr 10, 2019 via email

felixbarny commented Apr 10, 2019 • edited Loading

formgeist commented May 10, 2019

felixbarny commented Apr 2, 2019 •

edited

Loading

felixbarny commented Apr 5, 2019 •

edited

Loading

felixbarny commented Apr 5, 2019 •

edited

Loading

felixbarny commented Apr 9, 2019 •

edited

Loading

watson commented Apr 9, 2019 •

edited

Loading

felixbarny commented Apr 10, 2019 •

edited

Loading