Introduce transaction histogram metrics

## Motivation / summary

Currently the way Elastic APM works is by recording every single transaction as an individual document - including "unsampled transactions". In order to compute statistics, we use Elasticsearch's aggregations framework to query over the documents in real time.

This is a simple approach, and permits aggregation over arbitrary dimensions (filtering criteria), but also comes with some downsides such as higher storage cost and aggregation/query performance.

See also: https://github.com/elastic/apm/issues/104.

We will introduce an option to record pre-aggregated transaction duration histograms, using the [histogram field type](https://www.elastic.co/guide/en/elasticsearch/reference/current/histogram.html) introduced in Elasticsearch 7.6.

## Approach

The APM Server will take responsibility for aggregating and producing these histogram metrics. We may later also support agents producing the metrics in order to avoid sending unsampled transactions to the APM Server, but this is initially out of scope. To support that we would have the agents set a flag on events indicating that they have already been aggregated into a histogram.

Enabling transaction histogram metrics will not initially cause unsampled transactions to be dropped, so there will be no storage reduction -- only improved query performance. We will introduce a separate option for dropping unsampled transactions, and in a future major version (e.g. 8.0) we may start dropping unsampled transactions by default.

The histogram field will be used to power most if not all aggregations used in the APM UI, when the search bar is _not_ in use. When the search bar is in use, we will fall back to the existing approach of querying over the individual documents. For the case where the histogram fields are used, the histogram metric documents must also include the context fields used by the default filters in the APM UI.

To support identifying trace groups (i.e. root transaction groups), we will flag the documents relating to root transactions with `transaction.root: true`.

Thus, the APM Server must record histograms for each observed combination of the following fields:

 - agent.name
 - service.name
 - service.version (for deployment annotations)
 - service.environment
 - transaction.name
 - transaction.type
 - transaction.result
 - transaction.root
 - host.hostname
 - container.id
 - kubernetes.pod.name

The histogram field itself will be called `transaction.duration.histogram`. The exact algorithm and parameters to be used is TBD, but following suit with Elasticsearch and using HDRHistogram is likely.

### RUM-specific support

For RUM, the server will also need to perform GeoIP lookup and User-Agent parsing, and include their results to power RUM-specific visualisations. Specifically we would also need these fields:

 - client.geo.country_iso_code
 - user_agent.name

These might be tackled in a second phase, which would require the RUM visualisations to continue using the existing aggregation approach in the mean time. However, it may be too difficult to switch the UI over to using the metrics before RUM support exists, so we should aim to implement this as soon as possible to avoid delaying the UI implementation.

## ML Anomaly Detection support

If one were to drop unsampled transactions, then the existing ML Anomaly Detection jobs would no longer be accurate. We will need to update the jobs to use aggregations based on the histogram field: https://www.elastic.co/guide/en/machine-learning/current/ml-configuring-aggregation.html

## Support/ramifications for SIEM

SIEM currently displays APM transactions as "Events", with two visualisations:

 - a bar chart of the number of number of events (presumably, a date_histogram)
 - a list of events/transactions

Dropping unsampled transactions will naturally lead to the event list only listing events for sampled transactions. For the chart we could go one of two ways: either have it match up with the list (i.e. count only sampled transactions), or base the aggregation on the histogram field.

Proposal: keep it simple and continue to base SIEM events off transaction documents. This means that if unsampled transactions are dropped, they will not show up in SIEM's event counts or event list.

## Configuration

For various reasons, this feature will be opt-in when we introduce it, and in a later major version (e.g. 8.0) we would enable it by default for the default distribution. The reasons for initially making it opt-in are:

 - backwards compatibility for older versions of APM UI/Kibana
 - the histogram field type is available only under the Elastic license
 - using this approach effectively may require deployment changes, moving the APM Server to the edge machines to avoid high cardinality aggregation dimensions such as hostname, container name, etc.

As mentioned, we will provide separate configuration for dropping unsampled transaction documents. This is separately configurable in order to maintain the ability to use the search bar to search over both sampled and unsampled transactions, and to support the RUM-specific map visualisation.

The exact configuration names are TBD.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce transaction histogram metrics #3485

Motivation / summary

Approach

RUM-specific support

ML Anomaly Detection support

Support/ramifications for SIEM

Configuration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Introduce transaction histogram metrics #3485

Description

Motivation / summary

Approach

RUM-specific support

ML Anomaly Detection support

Support/ramifications for SIEM

Configuration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions