Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP: Breakdown graphs #70

Closed
7 tasks done
felixbarny opened this issue Apr 2, 2019 · 13 comments
Closed
7 tasks done

MVP: Breakdown graphs #70

felixbarny opened this issue Apr 2, 2019 · 13 comments
Assignees
Labels
Milestone

Comments

@felixbarny
Copy link
Member

felixbarny commented Apr 2, 2019

This meta issue tracks the development of a minimum viable product for breakdown graphs. The goal is to get the PRs merged into the respective repo. As some of the technical detail may still change in the course of implementing this MVP, it makes sense to mark these features as incubating and to also create internal feature toggles for it. That will make it easier to test the features in combination, to minimize merge conflicts, to work in small iterations and to decouple merge from release. But that's only a suggestion.

Concept: https://docs.google.com/document/d/1-_LuC9zhmva0VvLgtI0KcHuLzNztPHbcM0ZdlcPUl64

The MVP phase should at least involve the UI, the APM Server and the Java agent. Other agents are welcome to join the party as well :)

@felixbarny
Copy link
Member Author

felixbarny commented Apr 5, 2019

After a conversation with @roncohen, we're going to change the approach a bit. Up until now, the concept was to base the breakdown on individual transaction documents (or rather nested-mapped breakdown documents within transaction) so that users could use the query bar to drill down by things like user.id and labels/tags. Turns out that the drill down is not an important use-case (source: @makwarth). It's more about getting a quick overview of where time is generally spent, on a service-level and transaction-group-level.

Therefore, the metric-based approach seems more viable as it will be more space efficient (transaction documents don't get bigger, fixed cost per agent as opposed to throughput-dependant) and even more accurate compared to storing the breakdown data only on sampled transactions. Note that we don't need percentiles/histrograms for this use case.

It's still possible to use certain dimensions to drill down. These dimensions are everything that's available in metadata, as well as transaction.name and transaction.type. Searching for a particular user.id yields an empty graph - that's ok.

Implications for agents:
It will not be possible anymore to have true noop spans. With this approach, every span has to be timed, even spans for non-sampled transactions. Non-sampled spans are still not reported to the APM Server but metrics will be tracked for them. @elastic/apm-agent-devs If that's a blocker, now is the time to raise your voice. Another note: the MVP excludes the RUM agent as it does not seem feasible that it track metrics in the browser. In later stages, the APM Server may track the metrics for RUM but that's another discussion.

Update: agents can actually also start with just collecting timing information for sampled spans. Although eventually it's better to also track breakdown metrics for all transactions. The reason is that it works better when we have non-percentage-based sampling in place. For example, when there are different sampling rates for different transactions or if we rate limit per unique transaction name the breakdown would be skewed unfairly towards transactions with a lower throughput (because we don't sample as much transactions with a higher throughput).

I'm currently working out the details of this approach and will update this issue accordingly.

@felixbarny
Copy link
Member Author

felixbarny commented Apr 5, 2019

This the revised, metric-based concept, which can be copy/pasted into Kibana Dev Tools to try it out with example data:

https://docs.google.com/document/d/1-_LuC9zhmva0VvLgtI0KcHuLzNztPHbcM0ZdlcPUl64#heading=h.nmgmsomgvucn

@jalvz
Copy link
Contributor

jalvz commented Apr 8, 2019

both approaches for top labels sound fine to me, but i'd prefer to update the spec

@watson
Copy link
Contributor

watson commented Apr 8, 2019

I understand the use case for wanting to record timings for all spans here, but I'd currently consider this a blocker for the Node.js agent as I think the overhead of instantiating non-sampled spans would currently be too large. For this to be efficient, the Node.js agent would first have to implement an object-reuse-pool similar to the one that the Java agent has. That would be really nice to have in the Node.js agent for other reasons as well of course, so maybe now is the time we should prioritize to implement this.

@axw
Copy link
Member

axw commented Apr 9, 2019

transaction.name and transaction.type have to be on the top-level as opposed to be nested under tags/labels.
That is because it has to match the field names for regular transaction documents
so that there is a unified behavior when using the query bar to filter for a particular transaction type or name.
I'm not yet sure how to bring those to the top level. Some ideas:

I'd prefer to extend the metricset spec. Feels like it'll grow better with age, if/when we add other well-defined fields. It would also avoid surprising behaviour when moving from one version of the server to another, causing labels that were previously stored as labels to start being stored as some other field. (Perhaps a vanishingly unlikely scenario?)

I'm still not too sure how we're going to be able to calculate self_time well in the Go agent, particularly for custom instrumentation. @felixbarny do you have an idea of what you'll do for custom spans? My current idea is to start out assuming all spans are synchronous, and provide the user with a means of flagging asynchronous spans.

@felixbarny
Copy link
Member Author

I understand the use case for wanting to record timings for all spans here, but I'd currently consider this a blocker for the Node.js agent as I think the overhead of instantiating non-sampled spans would currently be too large.

Thinking about it again, it's actually not strictly required to also time non-sampled spans. However, it would have to be consistent so that the self_time for non-sampled transactions is also not timed.

I think it's ok that some agents only time sampled transactions and some all. Currently, we don't have a view which mixes data from different agents.

But if we want to break down a complete trace, for example, might have to align on one approach. But that is much further out. It would require propagating the trace name downstream and the transaction duration upstream. That requires significant agent-alignment which might require a major bump anyway. Also, we might still get away with agents choosing different approaches (timing only sampled vs all transactions) as we could do a breakdown per service, calculate the percentages and then aggregate the percentages, using the average duration per service and time bucket as a weight.

@watson
Copy link
Contributor

watson commented Apr 9, 2019

If my understanding of self_time is correct, I don't think we can support that in the Node agent either unfortunately

@felixbarny
Copy link
Member Author

felixbarny commented Apr 9, 2019

I'm still not too sure how we're going to be able to calculate self_time well in the Go agent, particularly for custom instrumentation.

Could we define self-time as

The time which can't be explained by child spans

With this definition, it would not matter if a child is async or not.

Example A:

[ s1            ]
|--[ s1-1  ]
`----[ s1-2  ]

Span s1 has two async children: s1-1 and s1-2. The self-time of Span 1 is [s1.start..s1-1.start] + [s1-2.end..s1.end].
As both s1-1 and s1-2 have no children, their self-time is equal to their duration.

Example B:

[ s1            ]
`--[ s1-1             ]

In this example, the self time of Span 1 is [s1.start..s1-1.start]

The downside of that definition is that this may wrongly assume that a parent span is idle waiting for their child spans to complete when it actually is not.
But I don't think there is a generic way to find that out anyway, even when taking the sync flag into consideration so it may be a good enough compromise.
The problem is that sync=false does not mean the parent is not blocked.

Consider these scenarios in a reactive environments like Node.js, where there's only one thread and all spans in these example scenarios are async:

  • A child span blocks the progress of the parent because its blocking the event thread by doing a long-running, CPU heavy operation
  • The child span only starts an external call. When the response comes in, the span is ended and the parent can resume.
    • Scenario a) the parent didn't have anything meaningful to do in the mean time
    • Scenario b) the parent did some CPU heavy work in the mean time

I conclude that even for async child spans, we can't assume that the parent span is not blocked waiting for the child,
even though it could potentially do useful work (intuitively contributing to self-time) in the meantime.
If that is problematic for a particular use case, users could just create a custom span which has its own self-time.

This can be implemented by a "reentrant timer" present on each transaction and span.
It starts with a nesting level of 0. When a child span is started, it notifies its parent about the start event.
The parent increases the nesting level and captures the current time.
If there is another, async child, the nesting level increases but the current time is not captured.
Whenever a span ends, it notifies its parent which decreases the nesting level.
If the nesting level is 0 again or if a span ends, the elapsed time since taking the start timestamp is calculated.
This value increments a childDurations span instance variable. The self-time of a span is calculated as duration - childDurations.

@felixbarny
Copy link
Member Author

Another important note is that agents don't have to implement it by 7.1. But the UI and APM Server should be ready to accept and process the data.

@watson
Copy link
Contributor

watson commented Apr 9, 2019

Meeting summary from the Node.js agent perspective: For now, we can calculate these metrics based on sampled transactions/spans only. As long as we're only doing random sampling, this should be ok.

@hmdhk
Copy link
Contributor

hmdhk commented Apr 10, 2019 via email

@felixbarny
Copy link
Member Author

felixbarny commented Apr 10, 2019

Edit: update moved to #70 (comment)

@formgeist
Copy link
Contributor

Checked the design issue and added links to the two implementation issues for the graph and the KPI stats component.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants