[GCP] Metrics are not grouped by dimension #6568

tommyers-elastic · 2023-06-14T08:47:43Z

We have noticed occurrences of metric documents being sent to ES which have the same timestamp and dimensions (and metadata). As far as we can see, each GCP metric document contains only a single metric.

There is code in the GCP metrics module in metricbeat which claims to stop this happening. https://github.com/elastic/beats/blob/main/x-pack/metricbeat/module/gcp/metrics/timeseries.go. It appears there is a bug in that logic that is not performing the grouping at all.

This was tested on the gcp.loadbalancing metricset, which uses gcp.metrics metricset under the hood.

The text was updated successfully, but these errors were encountered:

zmoog · 2023-06-16T14:57:54Z

Side note after some reading about GCP metrics and TSDB.

I noticed the default value for index.mapping.dimension_fields.limit is 16. Some integrations ¹ bumped this value to 32. This is also mentioned in the developer_tsdb_migration_guidelines.md.

Some GCP metrics have a lot of labels. For example, the gcp.loadbalancing_metrics.https.total_latencies.value has 18 labels:

6 gcp.labels.metrics.* labels
12 gcp.labels.resource.* labels

For some reasons, there is no gcp.labels.resource.project_id field in Elastisearch, so I can only find 17 fields there:

InfluxDB and CockroachDB ↩

zmoog · 2023-06-20T17:12:33Z

The grouping mechanism works as designed.

The grouping is handled by the timeSeriesGrouped() function. The ID() function defines the criteria used to group by the values.

For the loadbalancing metrics, the metricset uses x-pack/metricbeat/module/gcp/timeseries_metadata_collector.go as the implementation of the ID() function.

By default, the ID() function generates ID values like this one:

{
  "metric": {
    "labels": {
      "cache_result": "DISABLED",
      "client_country": "United States",
      "protocol": "HTTP/1.1",
      "proxy_continent": "America",
      "response_code": "200",
      "response_code_class": "200"
    },
    "type": "loadbalancing.googleapis.com/https/request_count" <——— 👀 —————
  },
  "resource": {
    "labels": {
      "backend_name": "giz-2-http-lb-0",
      "backend_scope": "us-central1-a",
      "backend_scope_type": "ZONE",
      "backend_target_name": "giz-2-http-lb",
      "backend_target_type": "BACKEND_SERVICE",
      "backend_type": "INSTANCE_GROUP",
      "forwarding_rule_name": "giz-2-https-lb",
      "matched_url_path_rule": "UNMATCHED",
      "project_id": "elastic-obs-integrations-dev",
      "region": "global",
      "target_proxy_name": "giz-2-https-lb",
      "url_map_name": "giz-2-https-lb"
    },
    "type": "https_lb_rule"
  },
  "timestamp": 1687277100000000000
}

Since the ID value contains the metric type loadbalancing.googleapis.com/https/request_count, there is no way it would group two different metrics.

As a quick experiment, I commented the metric.type definition in the ID function, generating ID values like this one:

{
  "metric": {
    "labels": {
      "cache_result": "DISABLED",
      "client_country": "United States",
      "protocol": "HTTP/1.1",
      "proxy_continent": "America",
      "response_code": "200",
      "response_code_class": "200"
    }
  },
  "resource": {
    "labels": {
      "backend_name": "giz-2-http-lb-0",
      "backend_scope": "us-central1-a",
      "backend_scope_type": "ZONE",
      "backend_target_name": "giz-2-http-lb",
      "backend_target_type": "BACKEND_SERVICE",
      "backend_type": "INSTANCE_GROUP",
      "forwarding_rule_name": "giz-2-https-lb",
      "matched_url_path_rule": "UNMATCHED",
      "project_id": "elastic-obs-integrations-dev",
      "region": "global",
      "target_proxy_name": "giz-2-https-lb",
      "url_map_name": "giz-2-https-lb"
    },
    "type": "https_lb_rule"
  },
  "timestamp": 1687279560000000000
}

With this ID, I expect a group by resource type with single metric types to go together.

Here's the final result in the debugger:

Now https.request.bytes and https.request.count go together.

zmoog · 2023-06-20T17:53:10Z

Here is the final events list:

tommyers-elastic · 2023-06-23T08:31:02Z

thanks for the updates @zmoog.

"works as designed" seems partially accurate. Do you have any idea why the metric type would have been intentionally added to the ID document?

zmoog · 2023-06-23T16:01:55Z

Do you have any idea why the metric type would have been intentionally added to the ID document?

Not yet.

By adding the "metric type" in the ID used for grouping, the result is https.request.bytes and https.request.count will never be in the same document.

@constanca-m, did run some tests on the loadbalancing metrics, and removing the metric type resulted in no document overwrites on TSDB.

I see the GCP metricset generates the IDs used for grouping using a component called metadata collector. There are several implementations of such collectors, depending on the metricsets:

compute: gcp/metrics/compute/identity.go
redis: gcp/metrics/redis/metadata.go
cloudsql: gcp/metrics/cloudsql/metadata.go
loadbalancing and probably others: gcp/timeseries_metadata_collector.go

The default collector used by loadbalancing metrics is the richer and the closest to what we need to support TSDB. The other collectors are much simpler.

zmoog · 2023-06-23T16:09:27Z

By looking at the ID value for loadbalancing:

{
  "metric": {
    "labels": {
      "cache_result": "DISABLED",
      "client_country": "United States",
      "protocol": "HTTP/1.1",
      "proxy_continent": "America",
      "response_code": "200",
      "response_code_class": "200"
    }
  },
  "resource": {
    "labels": {
      "backend_name": "giz-2-http-lb-0",
      "backend_scope": "us-central1-a",
      "backend_scope_type": "ZONE",
      "backend_target_name": "giz-2-http-lb",
      "backend_target_type": "BACKEND_SERVICE",
      "backend_type": "INSTANCE_GROUP",
      "forwarding_rule_name": "giz-2-https-lb",
      "matched_url_path_rule": "UNMATCHED",
      "project_id": "elastic-obs-integrations-dev",
      "region": "global",
      "target_proxy_name": "giz-2-https-lb",
      "url_map_name": "giz-2-https-lb"
    },
    "type": "https_lb_rule"
  },
  "timestamp": 1687279560000000000
}

@constanca-m, can consider the fields metric.labels.* and resource.label.* as dimensions?

If this is the case, documents like this work on TSDB because they are made of timestamp + dimensions, and we should adjust the other metadata collectors to match this schema.

@tommyers-elastic @constanca-m, please let me know what you think.

constanca-m · 2023-06-23T16:55:19Z

@constanca-m, can consider the fields metric.labels.* and resource.label.* as dimensions?

I believe so. Every field that is under gcp.dimensions.* works for the ID. But I don't know if there is a specific set of fields that identify each data stream.

zmoog · 2023-06-23T17:02:10Z

I believe so. Every field that is under gcp.dimensions.* works for the ID. But I don't know if there is a specific set of fields that identify each data stream.

Thanks, this is useful: I'll try to align these collectors with loadbalancing.

agithomas · 2023-06-28T11:59:00Z

This problem is relevant to compute_metrics as well.

In a zone, I have 2 compute instances. The number of documents created is 9 (because there exists no grouping).

Only two documents have metadata that could relate to the compute instance (gcp.labels.metrics.device_name) . The other documents have metric data which cannot be associated with the compute instance.

Even though this problem is identified as part of TSDB migration, i believe, without having the metadata (gcp.labels.metrics.device_name) in all documents (either by grouping or by copying), the documents does not serve its purpose.

tommyers-elastic · 2023-06-28T12:58:06Z

@agithomas thanks for the heads up on that. does the metadata issue you refer to relate to this issue at all? as a side conversation, what are your thoughts on the linked issue - should we alias the fields to ECS and leave the originals in place?

agithomas · 2023-06-28T14:20:37Z

does the metadata issue you refer to relate to this issue at all?

The issue I mentioned is not related to it. Sorry that i didn't explain it clearly. Assume, we have a set of metrics, labels and its values such as below.

{
  metric1 : value1,
  label_metric1: label_value1,
  metric2: value2,
  label_metric2: label_value2,
  metric3: value,
  label_metric3 : label_value3,
  device : <vm-name>
}

what I observed is we have 3 documents
First:

{
  metric1 : value1
  metric-label1: label_value1,
  device : <vm-name>
}

Second

{
  metric2 : value2,
  metric-label2: label_value2
}

Third one

{
  metric3 :value3,
  metric-label3: label_value3
}

When the documents are stored as 3 documents, what i observed is, field device is missing from document 2 and document 3.

Either we must target the metric labels copied into all documents or we must group related documents into one with all labels within it.

For TSDB enablement, it is important to uniquely separate one resource from another. Most likely, resource labels and metric labels are going to be super useful here. So, if we are removing the metric labels, going with the details of the this issue, we might be losing some important information.

agithomas · 2023-06-28T14:28:15Z

it is important to uniquely separate one resource from another.

One resource from another: if we have one document/resource in a time series

If one resource (or asset) produces multiple documents, dimensions (labels) must separate one set of documents from the first resource from another set of documents from the second resource. So, grouping data is always the best approach , thinking from a TSDB point of view.

agithomas · 2023-06-28T14:47:39Z

The alternate option I can consider here is to use cloud.* fields as dimensions and get myself unblocked from this issue.

zmoog · 2023-06-28T15:09:58Z

As noted before, compute, redis, and cloudsql metricset have custom metadata collectors.

It is relatively easy to change the ID() function to get a different metrics grouping; however, it's not obvious why the dimensions are designed in the Metadata() function. For example, setting the instance ID and name in cloud.instance instead of using gcp.labels is intentional.

zmoog · 2023-06-28T15:44:16Z

@agithomas, what fields are we considering as dimensions for the compute metricset?

zmoog · 2023-06-28T15:51:54Z

In general, it seems the metadata collectors move some fields out of resource.labels into the equivalent ECS field.

In the previous example, compute moves resource.labels.instance_id into cloud.instance.id.

agithomas · 2023-06-30T05:21:30Z

@agithomas, what fields are we considering as dimensions for the compute metricset?

cloud.account.id
cloud.availability_zone
cloud.instance.id
cloud.provider
cloud.region

If needed,

fingerprint processor applied on all gcp.labels, as new field, can be considered as dimensions

agithomas · 2023-06-30T11:20:07Z

Under gcp.compute metrics, two documents share the same set of labels. Please see the screenshot below. With above-mentioned cloud.* fields and fingerprint() processor applied on all gcp.labels, these two documents would overlap if TSDB is enabled, leading to losing one of these two documents in a time-series

Document 1: This document captures uptime information

Document 2: This document captures firewall stats.

zmoog · 2023-06-30T15:30:18Z

cloud.account.id
cloud.availability_zone
cloud.instance.id
cloud.provider
cloud.region

With these fields as dimensions, the only way to avoid overlaps is group the data returned by these IDs.

To put this is a more general terms, for GCP we probably always have:

cloud.account.id
cloud.availability_zone
cloud.provider
cloud.region

and a cloud resource ID; in this case, the cloud resource ID is cloud.instance.id.

This is what we get from the Monitoring API:

constanca-m · 2023-07-12T12:11:56Z

Is there any update on this issue @zmoog?

tommyers-elastic · 2023-07-12T12:48:56Z

@constanca-m, maurizio is out on PTO - i said i would take over on this. i'm going to write some guidelines for dimensions across the whole package, then when we have something to check against, we can make sure all the grouping logic is working correctly at the beats level.

constanca-m · 2023-08-23T06:06:14Z

@zmoog or @tommyers-elastic, is this still on progress?

zmoog · 2023-08-23T06:21:21Z

@constanca-m, yes, it is. You can expect an update this week.

zmoog · 2023-09-26T21:26:24Z

I am updating this issue after a while.

I want to share how the collection, grouping, and events emitting work for GCP metrics.

Overview

The GCP metricset processes data in three main phases:

Collect time series data from Monitoring API
Group time series data using the Metadata Collector ID() function
Emit events

I will describe the behavior using the https.backend_request_bytes_count.value / https.backend_request.bytes metric from the load balancing metrics as an example.

Collection

The GCP metricset collects time series data for each one of the 29 metrics configured.

Here's an excerpt from the serialized version of the API response with the https.backend_request_bytes_count.value metric data only:

| 21 |   2 |                                                                                                      |
|    |     | ---------------------------------------------------------------------------------------------------  |
|    |     | metric:                                                                                              |
|    |     |     type: loadbalancing.googleapis.com/https/backend_request_bytes_count                             |
|    |     |     labels:                                                                                          |
|    |     |         cache_result: DISABLED                                                                       |
|    |     |         proxy_continent: Europe                                                                      |
|    |     |         response_code: "502"                                                                         |
|    |     |         response_code_class: "500"                                                                   |
|    |     | resource:                                                                                            |
|    |     |     type: https_lb_rule                                                                              |
|    |     |     labels:                                                                                          |
|    |     |         backend_name: INVALID_BACKEND                                                                |
|    |     |         backend_scope: INVALID_BACKEND                                                               |
|    |     |         backend_scope_type: INVALID_BACKEND                                                          |
|    |     |         backend_target_name: aman-2-http-lb                                                          |
|    |     |         backend_target_type: BACKEND_SERVICE                                                         |
|    |     |         backend_type: INVALID_BACKEND                                                                |
|    |     |         forwarding_rule_name: aman-2-https-lb                                                        |
|    |     |         matched_url_path_rule: UNMATCHED                                                             |
|    |     |         project_id: elastic-obs-integrations-dev                                                     |
|    |     |         region: global                                                                               |
|    |     |         target_proxy_name: aman-2-https-lb                                                           |
|    |     |         url_map_name: aman-2-https-lb                                                                |
|    |     | metadata: null                                                                                       |
|    |     | metrickind: 2                                                                                        |
|    |     | valuetype: 2                                                                                         |
|    |     | points:                                                                                              |
|    |     |     - interval:                                                                                      |
|    |     |         endtime:                                                                                     |
|    |     |             seconds: 1695760380                                                                      |
|    |     |             nanos: 0                                                                                 |
|    |     |         starttime:                                                                                   |
|    |     |             seconds: 1695760320                                                                      |
|    |     |             nanos: 1000000                                                                           |
|    |     |       value:                                                                                         |
|    |     |         value:                                                                                       |
|    |     |             int64value: 194                                                                          |
|    |     | unit: ""                                                                                             |
|    |     |                                                                                                      |
|    |     | ------------------------------------                                                                 |
|    |     | metric:                                                                                              |
|    |     |     type: loadbalancing.googleapis.com/https/backend_request_bytes_count                             |
|    |     |     labels:                                                                                          |
|    |     |         cache_result: DISABLED                                                                       |
|    |     |         proxy_continent: Europe                                                                      |
|    |     |         response_code: "502"                                                                         |
|    |     |         response_code_class: "500"                                                                   |
|    |     | resource:                                                                                            |
|    |     |     type: https_lb_rule                                                                              |
|    |     |     labels:                                                                                          |
|    |     |         backend_name: INVALID_BACKEND                                                                |
|    |     |         backend_scope: INVALID_BACKEND                                                               |
|    |     |         backend_scope_type: INVALID_BACKEND                                                          |
|    |     |         backend_target_name: giz-2-http-lb                                                           |
|    |     |         backend_target_type: BACKEND_SERVICE                                                         |
|    |     |         backend_type: INVALID_BACKEND                                                                |
|    |     |         forwarding_rule_name: giz-2-https-lb                                                         |
|    |     |         matched_url_path_rule: UNMATCHED                                                             |
|    |     |         project_id: elastic-obs-integrations-dev                                                     |
|    |     |         region: global                                                                               |
|    |     |         target_proxy_name: giz-2-https-lb                                                            |
|    |     |         url_map_name: giz-2-https-lb                                                                 |
|    |     | metadata: null                                                                                       |
|    |     | metrickind: 2                                                                                        |
|    |     | valuetype: 2                                                                                         |
|    |     | points:                                                                                              |
|    |     |     - interval:                                                                                      |
|    |     |         endtime:                                                                                     |
|    |     |             seconds: 1695760380                                                                      |
|    |     |             nanos: 0                                                                                 |
|    |     |         starttime:                                                                                   |
|    |     |             seconds: 1695760320                                                                      |
|    |     |             nanos: 1000000                                                                           |
|    |     |       value:                                                                                         |
|    |     |         value:                                                                                       |
|    |     |             int64value: 300                                                                          |
|    |     | unit: ""                                                                                             |
|    |     |                                                                                                      |
|    |     | ------------------------------------                                                                 |
|    |     |                                                                                                      |

The response contains two time series values for the https/backend_request_bytes_count metric.

Note that each time series value has:

one metric field with type and labels.
one resource field with type and labels.
Values are stored in the points field.

The labels are metadata. We can consider all metric labels dimensions; see https://cloud.google.com/load-balancing/docs/metrics for details.

In addition to metrics labels, many resource labels should also be considered dimensions.

Please consider the resource.labels.backend_target_name field values for the two time:

resource.labels.backend_target_name: aman-2-http-lb
resource.labels.backend_target_name: giz-2-http-lb

Grouping (current implementation)

By default, the Metadata Collector ID() functions uses the content of metric.labels.* and resource.labels.* fields (and a few other) for grouping. Since the two resource.labels.backend_target_name fields contain different values, the default ID() implementation will not group these two time series values.

We can change the Metadata Collector ID() implementation to match our requirements, but we need to check the resource label and identify all the labels that are dimensions.

Events

Ultimately, the metricset will create one event for each time series group.

What's next?

We must check the time series data and identify all the dimensions, including the values of resource.labels.* fields.

zmoog · 2023-09-27T11:21:17Z

@constanca-m @gpop63, and I had a sync over Zoom to share our findings and ideas about the grouping issue with TSDB.

Summary

Let's use a time series example from GCP as a reference:

metric:                                                                                             
    type: loadbalancing.googleapis.com/https/backend_request_bytes_count                            
    labels:                                                                                         
        cache_result: DISABLED                                                                      
        proxy_continent: Europe                                                                     
        response_code: "502"                                                                        
        response_code_class: "500"                                                                  
resource:                                                                                           
    type: https_lb_rule                                                                             
    labels:                                                                                         
        backend_name: INVALID_BACKEND                                                               
        backend_scope: INVALID_BACKEND                                                              
        backend_scope_type: INVALID_BACKEND                                                         
        backend_target_name: giz-2-http-lb                                                          
        backend_target_type: BACKEND_SERVICE                                                        
        backend_type: INVALID_BACKEND                                                               
        forwarding_rule_name: giz-2-https-lb                                                        
        matched_url_path_rule: UNMATCHED                                                            
        project_id: elastic-obs-integrations-dev                                                    
        region: global                                                                              
        target_proxy_name: giz-2-https-lb                                                           
        url_map_name: giz-2-https-lb                                                                
metadata: null                                                                                      
metrickind: 2                                                                                       
valuetype: 2                                                                                        
points:                                                                                             
    - interval:                                                                                     
        endtime:                                                                                    
            seconds: 1695760380                                                                     
            nanos: 0                                                                                
        starttime:                                                                                  
            seconds: 1695760320                                                                     
            nanos: 1000000                                                                          
      value:                                                                                        
        value:                                                                                      
            int64value: 300                                                                         
unit: ""

We consider the following fields as dimensions:

All fields metric.labels.*
All fields resource.labels.*
A selection of the ECS fields, like the ones at [GCP] Metrics are not grouped by dimension #6568 (comment).

Action Plan

Here are the next steps:

Update the GCP metricset to group time series by labels (metric and resource) and a subset of ECS fields.
Update the integration to fingerprint labels and the subset of ECS fields.

We opt for fingerprinting to avoid defining any existing label as a dimension and maintain the list as Google Cloud adds new labels.

(1) Update the GCP metricset

@gpop63 has already put together a working implementation in this PR: https://github.com/elastic/beats/pull/36682/files (kudos, Gabriel!)

We'll work on the PR to review and test the changes. We'll build a custom Agent to be able to execute the TSDB-migration-test-kit suite that Constança built.

(2) Update the integration to fingerprint labels and ECS fields

@constanca-m will look into the integration to add the fingerprinting and run the TSDB-migration-test-kit with the custom agent.

zmoog · 2023-09-27T16:04:43Z

I am testing @gpop63 PR using load balancing metrics.

The metricset collected 44 time series and it grouped them into 16 group.

Here one group with three time series:

|     11 |             3 | - key: https.backend_request.bytes                             |
|        |               |   value: 110                                                   |
|        |               |   labels:                                                      |
|        |               |     metrics:                                                   |
|        |               |         cache_result: DISABLED                                 |
|        |               |         proxy_continent: America                               |
|        |               |         response_code: "502"                                   |
|        |               |         response_code_class: "500"                             |
|        |               |     resource:                                                  |
|        |               |         backend_name: INVALID_BACKEND                          |
|        |               |         backend_scope: INVALID_BACKEND                         |
|        |               |         backend_scope_type: INVALID_BACKEND                    |
|        |               |         backend_target_name: aman-2-http-lb                    |
|        |               |         backend_target_type: BACKEND_SERVICE                   |
|        |               |         backend_type: INVALID_BACKEND                          |
|        |               |         forwarding_rule_name: aman-2-http-lb                   |
|        |               |         matched_url_path_rule: UNMATCHED                       |
|        |               |         region: global                                         |
|        |               |         target_proxy_name: aman-2-http-lb                      |
|        |               |         url_map_name: aman-2-https-lb                          |
|        |               |   ecs:                                                         |
|        |               |     cloud:                                                     |
|        |               |         account:                                               |
|        |               |             id: elastic-obs-integrations-dev                   |
|        |               |             name: elastic-obs-integrations-dev                 |
|        |               |         provider: gcp                                          |
|        |               |   timestamp: 2023-09-27T15:54:00Z                              |
|        |               | - key: https.backend_response.bytes                            |
|        |               |   value: 258                                                   |
|        |               |   labels:                                                      |
|        |               |     metrics:                                                   |
|        |               |         cache_result: DISABLED                                 |
|        |               |         proxy_continent: America                               |
|        |               |         response_code: "502"                                   |
|        |               |         response_code_class: "500"                             |
|        |               |     resource:                                                  |
|        |               |         backend_name: INVALID_BACKEND                          |
|        |               |         backend_scope: INVALID_BACKEND                         |
|        |               |         backend_scope_type: INVALID_BACKEND                    |
|        |               |         backend_target_name: aman-2-http-lb                    |
|        |               |         backend_target_type: BACKEND_SERVICE                   |
|        |               |         backend_type: INVALID_BACKEND                          |
|        |               |         forwarding_rule_name: aman-2-http-lb                   |
|        |               |         matched_url_path_rule: UNMATCHED                       |
|        |               |         region: global                                         |
|        |               |         target_proxy_name: aman-2-http-lb                      |
|        |               |         url_map_name: aman-2-https-lb                          |
|        |               |   ecs:                                                         |
|        |               |     cloud:                                                     |
|        |               |         account:                                               |
|        |               |             id: elastic-obs-integrations-dev                   |
|        |               |             name: elastic-obs-integrations-dev                 |
|        |               |         provider: gcp                                          |
|        |               |   timestamp: 2023-09-27T15:54:00Z                              |
|        |               | - key: https.backend_request.count                             |
|        |               |   value: 1                                                     |
|        |               |   labels:                                                      |
|        |               |     metrics:                                                   |
|        |               |         cache_result: DISABLED                                 |
|        |               |         proxy_continent: America                               |
|        |               |         response_code: "502"                                   |
|        |               |         response_code_class: "500"                             |
|        |               |     resource:                                                  |
|        |               |         backend_name: INVALID_BACKEND                          |
|        |               |         backend_scope: INVALID_BACKEND                         |
|        |               |         backend_scope_type: INVALID_BACKEND                    |
|        |               |         backend_target_name: aman-2-http-lb                    |
|        |               |         backend_target_type: BACKEND_SERVICE                   |
|        |               |         backend_type: INVALID_BACKEND                          |
|        |               |         forwarding_rule_name: aman-2-http-lb                   |
|        |               |         matched_url_path_rule: UNMATCHED                       |
|        |               |         region: global                                         |
|        |               |         target_proxy_name: aman-2-http-lb                      |
|        |               |         url_map_name: aman-2-https-lb                          |
|        |               |   ecs:                                                         |
|        |               |     cloud:                                                     |
|        |               |         account:                                               |
|        |               |             id: elastic-obs-integrations-dev                   |
|        |               |             name: elastic-obs-integrations-dev                 |
|        |               |         provider: gcp                                          |
|        |               |   timestamp: 2023-09-27T15:54:00Z                              |
|        |               |                                                                |
|        |               |                                                                |

In the last step, the metricset collapsed these three time series into one event:

|      3 | rootfields:                                                        |
|        |     cloud:                                                         |
|        |         account:                                                   |
|        |             id: elastic-obs-integrations-dev                       |
|        |             name: elastic-obs-integrations-dev                     |
|        |         provider: gcp                                              |
|        | modulefields:                                                      |
|        |     labels:                                                        |
|        |         metrics:                                                   |
|        |             cache_result: DISABLED                                 |
|        |             proxy_continent: America                               |
|        |             response_code: "502"                                   |
|        |             response_code_class: "500"                             |
|        |         resource:                                                  |
|        |             backend_name: INVALID_BACKEND                          |
|        |             backend_scope: INVALID_BACKEND                         |
|        |             backend_scope_type: INVALID_BACKEND                    |
|        |             backend_target_name: aman-2-http-lb                    |
|        |             backend_target_type: BACKEND_SERVICE                   |
|        |             backend_type: INVALID_BACKEND                          |
|        |             forwarding_rule_name: aman-2-http-lb                   |
|        |             matched_url_path_rule: UNMATCHED                       |
|        |             region: global                                         |
|        |             target_proxy_name: aman-2-http-lb                      |
|        |             url_map_name: aman-2-https-lb                          |
|        | metricsetfields:                                                   |
|        |     https:                                                         |
|        |         backend_request:                                           |
|        |             bytes: 110                                             |
|        |             count: 1                                               |
|        |         backend_response:                                          |
|        |             bytes: 258                                             |
|        | index: ""                                                          |
|        | id: ""                                                             |
|        | namespace: ""                                                      |
|        | timestamp: 2023-09-27T15:54:00Z                                    |
|        | error: null                                                        |
|        | host: ""                                                           |
|        | service: ""                                                        |
|        | took: 0s                                                           |
|        | period: 0s                                                         |
|        | disabletimeseries: false                                           |

cc @constanca-m @agithomas

zmoog · 2023-09-29T15:34:32Z

After some tests using the GKE metrics, @constanca-m discovered an unexpected behavior of the GCP metricset.

The Problem

The time series grouping worked as expected, but the TSDB-migration-test-kit detected data loss nonetheless. @constanca-m boiled down the problem to an uncomplicated case: she found two documents with identical timestamps and dimensions that were the cause of data loss when TSDB was enabled.

Analysis

The clue to understanding what was happening was that the two documents had the same @timestamp but different event.ingested, with values two minutes apart. This difference suggested that the metricset has not gathered the two documents in the same data collection.

tl;dr — ingest delay is the cause of this behavior.

Due to ingest delay, the metricset collects documents for the same @timestamp across several data collections, so there is no way the metricset can group them. Gathering documents for the same @timestamp over multiple collections is fine for standard data streams, but it is a problem for TSDB-enabled data streams where @timestamp + dimensions combination must be unique.

We learned that GCP metrics individually have an ingest delay; GCP describes ingest delay as:

Ingest Delay: Data points older than this value are guaranteed to be available to be read, excluding data loss due to errors. The delay does not include any time spent waiting until the next sampling period. The delay, if available, appears at the end of the description text in a sentence of the form "After sampling, data is not visible for up to y seconds."
The GCP docs includes the above text "After sampling, data is not visible for up to y seconds." for each metric with an ingest delay.

So, during each collection cycle, the metricset collects data on a different time window for each metric.

Here's how it works.

For example, given t as a point in time, the following metric:

container/memory.limit.bytes (no ingest delay) collects data in the "t-1m <—> t" window
container/memory/request_bytes (2 min ingest delay) collects data in the "t-3m <—> t-2m" window

If we have one "data point" in the "t-1m <—> t" time window, the metricset will collect container/memory.limit.bytes during "collection 1", and then container/memory/request_bytes after two collections, 2 minutes later during "collection 3"

If the data point in the "t-1m <—> t" time window has a constant timestamp, this would explain the two documents with the same @timestamp and different event.ingested values that @constanca.manteigas found yesterday.

Solutions

To test this hypothesis, I updated the custom agent image (image id 33ab08e90977) to add a unique event.collection_id field for each collection. This is intended for testing only; better solutions may exist.

We can avoid data loss by adding an event.collection_id field as a dimension.

@constanca-m is testing the GCP metrics integration and custom agent with these changes, and she hasn't detected data losses so far.

zmoog · 2023-10-04T11:01:10Z

The tests with the custom agent image ID 33ab08e90977 have been successful. After defining event.collection_id field as a dimension, the TSDB-migration-test-kit reported no data loss.

We moved to the next phase to move the PR elastic/beats#36682 to the finish line.

Since the event.collection_id is a non-standard field, I replaced it with event.created because it:

is an ECS field
has a proper semantics
works for our purpose

I pushed a new version of the custom Agent with image ID 92eb8c2b4f30.

@constanca-m, could you replace event.collection_id with event.created as a dimension and run the TSDB-migration-test-kit with the Agent image ID 92eb8c2b4f30?

@timestamp

# Update grouping key The dimensionsKey contains all dimension fields values we want to use to group the time series. We need to add the timestamp to the key, so we only group time series with the same timestamp. # Add `event.created` field We need to add an extra dimension to avoid data loss on TSDB since GCP metrics with the same @timestamp become visible with different "ingest delay". For the full context, read elastic/integrations#6568 (comment) # Drop ID() function Remove the `ID()` function from the Metadata Collector. Since we are unifying the metric grouping logic for all metric types, we don't need to keep the `ID()` function anymore. # Renaming I also renamed some structs, functions, and variables with the purpose of making their role and purpose more clear. We can remove this part if it does not improve clarity.

@timestamp

# Update grouping key The dimensionsKey contains all dimension fields values we want to use to group the time series. We need to add the timestamp to the key, so we only group time series with the same timestamp. # Add `event.created` field We need to add an extra dimension to avoid data loss on TSDB since GCP metrics with the same @timestamp become visible with different "ingest delay". For the full context, read elastic/integrations#6568 (comment) # Drop ID() function Remove the `ID()` function from the Metadata Collector. Since we are unifying the metric grouping logic for all metric types, we don't need to keep the `ID()` function anymore. # Renaming I also renamed some structs, functions, and variables with the purpose of making their role and purpose more clear. We can remove this part if it does not improve clarity.

@timestamp

# Update grouping key The dimensionsKey contains all dimension fields values we want to use to group the time series. We need to add the timestamp to the key, so we only group time series with the same timestamp. # Add `event.created` field We need to add an extra dimension to avoid data loss on TSDB since GCP metrics with the same @timestamp become visible with different "ingest delay". For the full context, read elastic/integrations#6568 (comment) # Drop ID() function Remove the `ID()` function from the Metadata Collector. Since we are unifying the metric grouping logic for all metric types, we don't need to keep the `ID()` function anymore. # Renaming I also renamed some structs, functions, and variables with the purpose of making their role and purpose more clear. We can remove this part if it does not improve clarity.

zmoog · 2023-10-04T18:52:30Z

TSDB does not support the date field type for dimensions. Reverted to an ID field; this time, we picked the event.batch_id.

Custom agent image 21c80fd7566b is available for testing; add event.batch_id as a dimension.

zmoog · 2023-10-16T21:38:20Z

A new custom agent build version 8.10.4 image 5bbbdae4689a) is available for testing

In this version, we replaeced event.batch_id with event.metric_names as a dimension to avoid data loss from metrics collected with different ingest delay.

See elastic/beats@71c0aa0 for more details.

zmoog · 2023-10-17T15:38:01Z

A new custom agent build version 8.10.5 (image 0e77fd312f4b ) is available for testing.

This is just a build after a rebase from main with no other changes. The metric names are not hashed yet, but we can handle this is the pipeline until the next build.

zmoog · 2023-10-19T09:13:09Z

A new custom agent build version 8.10.5 (image 985aaeb1a00f ) is available for testing.

Changes:

Replace event.metric_names with event.metric_names_hash

zmoog · 2023-10-24T17:34:52Z

A new custom agent build version 8.10.5 (image 9c9a3e196ba5 ) is available for testing.

Changes:

Replace event.metric_names_hash with gcp.metric_names_fingerprint (elastic/beats@02be686)

@timestamp

# Update grouping key The dimensionsKey contains all dimension fields values we want to use to group the time series. We need to add the timestamp to the key, so we only group time series with the same timestamp. # Add `event.created` field We need to add an extra dimension to avoid data loss on TSDB since GCP metrics with the same @timestamp become visible with different "ingest delay". For the full context, read elastic/integrations#6568 (comment) # Drop ID() function Remove the `ID()` function from the Metadata Collector. Since we are unifying the metric grouping logic for all metric types, we don't need to keep the `ID()` function anymore. # Renaming I also renamed some structs, functions, and variables with the purpose of making their role and purpose more clear. We can remove this part if it does not improve clarity.

zmoog · 2023-11-02T22:01:30Z

@gpop63, a new custom agent build version 8.10.5 (image 178af9f385a4 ) is available for testing the latest change we discussed during the PR review.

Changes:

Collect metric values after they're all available; the mericset use the largest ingest delay available.
Drop the gcp.metric_names_fingerprint because it's no longer needed.

This build sets the event.batch_id field. It's for debug purpose only, it is not included in the PR.

zmoog · 2023-11-14T05:37:46Z

We resolved the issue with the merge of PR elastic/beats#36682 and the change will ship in 8.12.0.

tommyers-elastic added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Jun 14, 2023

bturquet assigned zmoog Jun 15, 2023

zmoog mentioned this issue Jun 22, 2023

[GCP] Group metrics by dimensions elastic/beats#35882

Closed

6 tasks

This was referenced Jun 28, 2023

[ Google Cloud Platform (GCP) Compute metrics ] TSDB enablement #6711

Closed

[Meta] Observability TSDB packages migration #5233

Closed

tetianakravchenko mentioned this issue Jul 20, 2023

[Azure] Metrics are not grouped #7027

Closed

constanca-m mentioned this issue Aug 28, 2023

[GCP] Enable TSDB #7555

Open

4 tasks

zmoog mentioned this issue Oct 4, 2023

[metricbeat] [gcp] group metrics by dimensions elastic/beats#36682

Merged

6 tasks

felixbarny mentioned this issue Oct 13, 2023

Missing metric name in tsid results in TSDB duplicate detection dropping data elastic/elasticsearch#99123

Open

zmoog closed this as completed Nov 14, 2023

[GCP] Metrics are not grouped by dimension #6568

[GCP] Metrics are not grouped by dimension #6568

Comments

tommyers-elastic commented Jun 14, 2023

zmoog commented Jun 16, 2023

Footnotes

zmoog commented Jun 20, 2023

zmoog commented Jun 20, 2023

tommyers-elastic commented Jun 23, 2023

zmoog commented Jun 23, 2023 • edited Loading

zmoog commented Jun 23, 2023

constanca-m commented Jun 23, 2023

zmoog commented Jun 23, 2023

agithomas commented Jun 28, 2023

tommyers-elastic commented Jun 28, 2023

agithomas commented Jun 28, 2023

agithomas commented Jun 28, 2023

agithomas commented Jun 28, 2023

zmoog commented Jun 28, 2023 • edited Loading

zmoog commented Jun 28, 2023

zmoog commented Jun 28, 2023

agithomas commented Jun 30, 2023

agithomas commented Jun 30, 2023 • edited Loading

zmoog commented Jun 30, 2023

constanca-m commented Jul 12, 2023

tommyers-elastic commented Jul 12, 2023

constanca-m commented Aug 23, 2023

zmoog commented Aug 23, 2023

zmoog commented Sep 26, 2023

Overview

Collection

Grouping (current implementation)

Events

What's next?

zmoog commented Sep 27, 2023 • edited Loading

Summary

Action Plan

(1) Update the GCP metricset

(2) Update the integration to fingerprint labels and ECS fields

zmoog commented Sep 27, 2023

zmoog commented Sep 29, 2023 • edited Loading

The Problem

Analysis

Solutions

zmoog commented Oct 4, 2023

zmoog commented Oct 4, 2023

zmoog commented Oct 16, 2023

zmoog commented Oct 17, 2023 • edited Loading

zmoog commented Oct 19, 2023

zmoog commented Oct 24, 2023

zmoog commented Nov 2, 2023 • edited Loading

zmoog commented Nov 14, 2023

zmoog commented Jun 23, 2023 •

edited

Loading

zmoog commented Jun 28, 2023 •

edited

Loading

agithomas commented Jun 30, 2023 •

edited

Loading

zmoog commented Sep 27, 2023 •

edited

Loading

zmoog commented Sep 29, 2023 •

edited

Loading

zmoog commented Oct 17, 2023 •

edited

Loading

zmoog commented Nov 2, 2023 •

edited

Loading