-
Notifications
You must be signed in to change notification settings - Fork 540
Description
Motivation / summary
Currently the way Elastic APM works is by recording every single transaction as an individual document - including "unsampled transactions". In order to compute statistics, we use Elasticsearch's aggregations framework to query over the documents in real time.
This is a simple approach, and permits aggregation over arbitrary dimensions (filtering criteria), but also comes with some downsides such as higher storage cost and aggregation/query performance.
See also: elastic/apm#104.
We will introduce an option to record pre-aggregated transaction duration histograms, using the histogram field type introduced in Elasticsearch 7.6.
Approach
The APM Server will take responsibility for aggregating and producing these histogram metrics. We may later also support agents producing the metrics in order to avoid sending unsampled transactions to the APM Server, but this is initially out of scope. To support that we would have the agents set a flag on events indicating that they have already been aggregated into a histogram.
Enabling transaction histogram metrics will not initially cause unsampled transactions to be dropped, so there will be no storage reduction -- only improved query performance. We will introduce a separate option for dropping unsampled transactions, and in a future major version (e.g. 8.0) we may start dropping unsampled transactions by default.
The histogram field will be used to power most if not all aggregations used in the APM UI, when the search bar is not in use. When the search bar is in use, we will fall back to the existing approach of querying over the individual documents. For the case where the histogram fields are used, the histogram metric documents must also include the context fields used by the default filters in the APM UI.
To support identifying trace groups (i.e. root transaction groups), we will flag the documents relating to root transactions with transaction.root: true.
Thus, the APM Server must record histograms for each observed combination of the following fields:
- agent.name
- service.name
- service.version (for deployment annotations)
- service.environment
- transaction.name
- transaction.type
- transaction.result
- transaction.root
- host.hostname
- container.id
- kubernetes.pod.name
The histogram field itself will be called transaction.duration.histogram. The exact algorithm and parameters to be used is TBD, but following suit with Elasticsearch and using HDRHistogram is likely.
RUM-specific support
For RUM, the server will also need to perform GeoIP lookup and User-Agent parsing, and include their results to power RUM-specific visualisations. Specifically we would also need these fields:
- client.geo.country_iso_code
- user_agent.name
These might be tackled in a second phase, which would require the RUM visualisations to continue using the existing aggregation approach in the mean time. However, it may be too difficult to switch the UI over to using the metrics before RUM support exists, so we should aim to implement this as soon as possible to avoid delaying the UI implementation.
ML Anomaly Detection support
If one were to drop unsampled transactions, then the existing ML Anomaly Detection jobs would no longer be accurate. We will need to update the jobs to use aggregations based on the histogram field: https://www.elastic.co/guide/en/machine-learning/current/ml-configuring-aggregation.html
Support/ramifications for SIEM
SIEM currently displays APM transactions as "Events", with two visualisations:
- a bar chart of the number of number of events (presumably, a date_histogram)
- a list of events/transactions
Dropping unsampled transactions will naturally lead to the event list only listing events for sampled transactions. For the chart we could go one of two ways: either have it match up with the list (i.e. count only sampled transactions), or base the aggregation on the histogram field.
Proposal: keep it simple and continue to base SIEM events off transaction documents. This means that if unsampled transactions are dropped, they will not show up in SIEM's event counts or event list.
Configuration
For various reasons, this feature will be opt-in when we introduce it, and in a later major version (e.g. 8.0) we would enable it by default for the default distribution. The reasons for initially making it opt-in are:
- backwards compatibility for older versions of APM UI/Kibana
- the histogram field type is available only under the Elastic license
- using this approach effectively may require deployment changes, moving the APM Server to the edge machines to avoid high cardinality aggregation dimensions such as hostname, container name, etc.
As mentioned, we will provide separate configuration for dropping unsampled transaction documents. This is separately configurable in order to maintain the ability to use the search bar to search over both sampled and unsampled transactions, and to support the RUM-specific map visualisation.
The exact configuration names are TBD.