Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] Metrics are not grouped by dimension #6568

Closed
tommyers-elastic opened this issue Jun 14, 2023 · 35 comments
Closed

[GCP] Metrics are not grouped by dimension #6568

tommyers-elastic opened this issue Jun 14, 2023 · 35 comments
Assignees
Labels
Team:Cloud-Monitoring Label for the Cloud Monitoring team

Comments

@tommyers-elastic
Copy link
Contributor

We have noticed occurrences of metric documents being sent to ES which have the same timestamp and dimensions (and metadata). As far as we can see, each GCP metric document contains only a single metric.

There is code in the GCP metrics module in metricbeat which claims to stop this happening. https://github.com/elastic/beats/blob/main/x-pack/metricbeat/module/gcp/metrics/timeseries.go. It appears there is a bug in that logic that is not performing the grouping at all.

This was tested on the gcp.loadbalancing metricset, which uses gcp.metrics metricset under the hood.

@tommyers-elastic tommyers-elastic added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Jun 14, 2023
@zmoog
Copy link
Contributor

zmoog commented Jun 16, 2023

Side note after some reading about GCP metrics and TSDB.

I noticed the default value for index.mapping.dimension_fields.limit is 16. Some integrations 1 bumped this value to 32. This is also mentioned in the developer_tsdb_migration_guidelines.md.

Some GCP metrics have a lot of labels. For example, the gcp.loadbalancing_metrics.https.total_latencies.value has 18 labels:

  • 6 gcp.labels.metrics.* labels
  • 12 gcp.labels.resource.* labels

CleanShot 2023-06-16 at 16 50 07@2x

For some reasons, there is no gcp.labels.resource.project_id field in Elastisearch, so I can only find 17 fields there:

CleanShot 2023-06-16 at 16 57 06@2x

Footnotes

  1. InfluxDB and CockroachDB

@zmoog
Copy link
Contributor

zmoog commented Jun 20, 2023

The grouping mechanism works as designed.

The grouping is handled by the timeSeriesGrouped() function. The ID() function defines the criteria used to group by the values.

For the loadbalancing metrics, the metricset uses x-pack/metricbeat/module/gcp/timeseries_metadata_collector.go as the implementation of the ID() function.

By default, the ID() function generates ID values like this one:

{
  "metric": {
    "labels": {
      "cache_result": "DISABLED",
      "client_country": "United States",
      "protocol": "HTTP/1.1",
      "proxy_continent": "America",
      "response_code": "200",
      "response_code_class": "200"
    },
    "type": "loadbalancing.googleapis.com/https/request_count" <——— 👀 —————
  },
  "resource": {
    "labels": {
      "backend_name": "giz-2-http-lb-0",
      "backend_scope": "us-central1-a",
      "backend_scope_type": "ZONE",
      "backend_target_name": "giz-2-http-lb",
      "backend_target_type": "BACKEND_SERVICE",
      "backend_type": "INSTANCE_GROUP",
      "forwarding_rule_name": "giz-2-https-lb",
      "matched_url_path_rule": "UNMATCHED",
      "project_id": "elastic-obs-integrations-dev",
      "region": "global",
      "target_proxy_name": "giz-2-https-lb",
      "url_map_name": "giz-2-https-lb"
    },
    "type": "https_lb_rule"
  },
  "timestamp": 1687277100000000000
}

Since the ID value contains the metric type loadbalancing.googleapis.com/https/request_count, there is no way it would group two different metrics.

As a quick experiment, I commented the metric.type definition in the ID function, generating ID values like this one:

{
  "metric": {
    "labels": {
      "cache_result": "DISABLED",
      "client_country": "United States",
      "protocol": "HTTP/1.1",
      "proxy_continent": "America",
      "response_code": "200",
      "response_code_class": "200"
    }
  },
  "resource": {
    "labels": {
      "backend_name": "giz-2-http-lb-0",
      "backend_scope": "us-central1-a",
      "backend_scope_type": "ZONE",
      "backend_target_name": "giz-2-http-lb",
      "backend_target_type": "BACKEND_SERVICE",
      "backend_type": "INSTANCE_GROUP",
      "forwarding_rule_name": "giz-2-https-lb",
      "matched_url_path_rule": "UNMATCHED",
      "project_id": "elastic-obs-integrations-dev",
      "region": "global",
      "target_proxy_name": "giz-2-https-lb",
      "url_map_name": "giz-2-https-lb"
    },
    "type": "https_lb_rule"
  },
  "timestamp": 1687279560000000000
}

With this ID, I expect a group by resource type with single metric types to go together.

Here's the final result in the debugger:

CleanShot 2023-06-20 at 18 44 32@2x

Now https.request.bytes and https.request.count go together.

@zmoog
Copy link
Contributor

zmoog commented Jun 20, 2023

Here is the final events list:

CleanShot 2023-06-20 at 19 51 04@2x

@tommyers-elastic
Copy link
Contributor Author

thanks for the updates @zmoog.

"works as designed" seems partially accurate. Do you have any idea why the metric type would have been intentionally added to the ID document?

@zmoog
Copy link
Contributor

zmoog commented Jun 23, 2023

Do you have any idea why the metric type would have been intentionally added to the ID document?

Not yet.

By adding the "metric type" in the ID used for grouping, the result is https.request.bytes and https.request.count will never be in the same document.

@constanca-m, did run some tests on the loadbalancing metrics, and removing the metric type resulted in no document overwrites on TSDB.

I see the GCP metricset generates the IDs used for grouping using a component called metadata collector. There are several implementations of such collectors, depending on the metricsets:

The default collector used by loadbalancing metrics is the richer and the closest to what we need to support TSDB. The other collectors are much simpler.

@zmoog
Copy link
Contributor

zmoog commented Jun 23, 2023

By looking at the ID value for loadbalancing:

{
  "metric": {
    "labels": {
      "cache_result": "DISABLED",
      "client_country": "United States",
      "protocol": "HTTP/1.1",
      "proxy_continent": "America",
      "response_code": "200",
      "response_code_class": "200"
    }
  },
  "resource": {
    "labels": {
      "backend_name": "giz-2-http-lb-0",
      "backend_scope": "us-central1-a",
      "backend_scope_type": "ZONE",
      "backend_target_name": "giz-2-http-lb",
      "backend_target_type": "BACKEND_SERVICE",
      "backend_type": "INSTANCE_GROUP",
      "forwarding_rule_name": "giz-2-https-lb",
      "matched_url_path_rule": "UNMATCHED",
      "project_id": "elastic-obs-integrations-dev",
      "region": "global",
      "target_proxy_name": "giz-2-https-lb",
      "url_map_name": "giz-2-https-lb"
    },
    "type": "https_lb_rule"
  },
  "timestamp": 1687279560000000000
}

@constanca-m, can consider the fields metric.labels.* and resource.label.* as dimensions?

If this is the case, documents like this work on TSDB because they are made of timestamp + dimensions, and we should adjust the other metadata collectors to match this schema.

@tommyers-elastic @constanca-m, please let me know what you think.

@constanca-m
Copy link
Contributor

@constanca-m, can consider the fields metric.labels.* and resource.label.* as dimensions?

I believe so. Every field that is under gcp.dimensions.* works for the ID. But I don't know if there is a specific set of fields that identify each data stream.

@zmoog
Copy link
Contributor

zmoog commented Jun 23, 2023

I believe so. Every field that is under gcp.dimensions.* works for the ID. But I don't know if there is a specific set of fields that identify each data stream.

Thanks, this is useful: I'll try to align these collectors with loadbalancing.

@agithomas
Copy link
Contributor

This problem is relevant to compute_metrics as well.

In a zone, I have 2 compute instances. The number of documents created is 9 (because there exists no grouping).

Only two documents have metadata that could relate to the compute instance (gcp.labels.metrics.device_name) . The other documents have metric data which cannot be associated with the compute instance.

Even though this problem is identified as part of TSDB migration, i believe, without having the metadata (gcp.labels.metrics.device_name) in all documents (either by grouping or by copying), the documents does not serve its purpose.

@tommyers-elastic
Copy link
Contributor Author

@agithomas thanks for the heads up on that. does the metadata issue you refer to relate to this issue at all? as a side conversation, what are your thoughts on the linked issue - should we alias the fields to ECS and leave the originals in place?

@agithomas
Copy link
Contributor

does the metadata issue you refer to relate to this issue at all?

The issue I mentioned is not related to it. Sorry that i didn't explain it clearly. Assume, we have a set of metrics, labels and its values such as below.

{
  metric1 : value1,
  label_metric1: label_value1,
  metric2: value2,
  label_metric2: label_value2,
  metric3: value,
  label_metric3 : label_value3,
  device : <vm-name>
}

what I observed is we have 3 documents
First:

{
  metric1 : value1
  metric-label1: label_value1,
  device : <vm-name>
}

Second

{
  metric2 : value2,
  metric-label2: label_value2
}

Third one

{
  metric3 :value3,
  metric-label3: label_value3
}

When the documents are stored as 3 documents, what i observed is, field device is missing from document 2 and document 3.

Either we must target the metric labels copied into all documents or we must group related documents into one with all labels within it.

For TSDB enablement, it is important to uniquely separate one resource from another. Most likely, resource labels and metric labels are going to be super useful here. So, if we are removing the metric labels, going with the details of the this issue, we might be losing some important information.

@agithomas
Copy link
Contributor

it is important to uniquely separate one resource from another.

One resource from another: if we have one document/resource in a time series

If one resource (or asset) produces multiple documents, dimensions (labels) must separate one set of documents from the first resource from another set of documents from the second resource. So, grouping data is always the best approach , thinking from a TSDB point of view.

@agithomas
Copy link
Contributor

The alternate option I can consider here is to use cloud.* fields as dimensions and get myself unblocked from this issue.

@zmoog
Copy link
Contributor

zmoog commented Jun 28, 2023

As noted before, compute, redis, and cloudsql metricset have custom metadata collectors.

It is relatively easy to change the ID() function to get a different metrics grouping; however, it's not obvious why the dimensions are designed in the Metadata() function. For example, setting the instance ID and name in cloud.instance instead of using gcp.labels is intentional.

@zmoog
Copy link
Contributor

zmoog commented Jun 28, 2023

@agithomas, what fields are we considering as dimensions for the compute metricset?

@zmoog
Copy link
Contributor

zmoog commented Jun 28, 2023

In general, it seems the metadata collectors move some fields out of resource.labels into the equivalent ECS field.

In the previous example, compute moves resource.labels.instance_id into cloud.instance.id.

@agithomas
Copy link
Contributor

@agithomas, what fields are we considering as dimensions for the compute metricset?

cloud.account.id
cloud.availability_zone
cloud.instance.id
cloud.provider
cloud.region

If needed,

fingerprint processor applied on all gcp.labels, as new field, can be considered as dimensions

@agithomas
Copy link
Contributor

agithomas commented Jun 30, 2023

Under gcp.compute metrics, two documents share the same set of labels. Please see the screenshot below. With above-mentioned cloud.* fields and fingerprint() processor applied on all gcp.labels, these two documents would overlap if TSDB is enabled, leading to losing one of these two documents in a time-series

Document 1: This document captures uptime information

image

Document 2: This document captures firewall stats.

image

@zmoog
Copy link
Contributor

zmoog commented Jun 30, 2023

cloud.account.id
cloud.availability_zone
cloud.instance.id
cloud.provider
cloud.region

With these fields as dimensions, the only way to avoid overlaps is group the data returned by these IDs.

To put this is a more general terms, for GCP we probably always have:

cloud.account.id
cloud.availability_zone
cloud.provider
cloud.region

and a cloud resource ID; in this case, the cloud resource ID is cloud.instance.id.

This is what we get from the Monitoring API:

CleanShot 2023-06-30 at 17 22 11@2x

@constanca-m
Copy link
Contributor

Is there any update on this issue @zmoog?

@tommyers-elastic
Copy link
Contributor Author

@constanca-m, maurizio is out on PTO - i said i would take over on this. i'm going to write some guidelines for dimensions across the whole package, then when we have something to check against, we can make sure all the grouping logic is working correctly at the beats level.

@constanca-m
Copy link
Contributor

@zmoog or @tommyers-elastic, is this still on progress?

@zmoog
Copy link
Contributor

zmoog commented Aug 23, 2023

@constanca-m, yes, it is. You can expect an update this week.

@constanca-m constanca-m mentioned this issue Aug 28, 2023
4 tasks
@zmoog
Copy link
Contributor

zmoog commented Sep 26, 2023

I am updating this issue after a while.

I want to share how the collection, grouping, and events emitting work for GCP metrics.

Overview

The GCP metricset processes data in three main phases:

  • Collect time series data from Monitoring API
  • Group time series data using the Metadata Collector ID() function
  • Emit events

I will describe the behavior using the https.backend_request_bytes_count.value / https.backend_request.bytes metric from the load balancing metrics as an example.

Collection

The GCP metricset collects time series data for each one of the 29 metrics configured.

Here's an excerpt from the serialized version of the API response with the https.backend_request_bytes_count.value metric data only:

| 21 |   2 |                                                                                                      |
|    |     | ---------------------------------------------------------------------------------------------------  |
|    |     | metric:                                                                                              |
|    |     |     type: loadbalancing.googleapis.com/https/backend_request_bytes_count                             |
|    |     |     labels:                                                                                          |
|    |     |         cache_result: DISABLED                                                                       |
|    |     |         proxy_continent: Europe                                                                      |
|    |     |         response_code: "502"                                                                         |
|    |     |         response_code_class: "500"                                                                   |
|    |     | resource:                                                                                            |
|    |     |     type: https_lb_rule                                                                              |
|    |     |     labels:                                                                                          |
|    |     |         backend_name: INVALID_BACKEND                                                                |
|    |     |         backend_scope: INVALID_BACKEND                                                               |
|    |     |         backend_scope_type: INVALID_BACKEND                                                          |
|    |     |         backend_target_name: aman-2-http-lb                                                          |
|    |     |         backend_target_type: BACKEND_SERVICE                                                         |
|    |     |         backend_type: INVALID_BACKEND                                                                |
|    |     |         forwarding_rule_name: aman-2-https-lb                                                        |
|    |     |         matched_url_path_rule: UNMATCHED                                                             |
|    |     |         project_id: elastic-obs-integrations-dev                                                     |
|    |     |         region: global                                                                               |
|    |     |         target_proxy_name: aman-2-https-lb                                                           |
|    |     |         url_map_name: aman-2-https-lb                                                                |
|    |     | metadata: null                                                                                       |
|    |     | metrickind: 2                                                                                        |
|    |     | valuetype: 2                                                                                         |
|    |     | points:                                                                                              |
|    |     |     - interval:                                                                                      |
|    |     |         endtime:                                                                                     |
|    |     |             seconds: 1695760380                                                                      |
|    |     |             nanos: 0                                                                                 |
|    |     |         starttime:                                                                                   |
|    |     |             seconds: 1695760320                                                                      |
|    |     |             nanos: 1000000                                                                           |
|    |     |       value:                                                                                         |
|    |     |         value:                                                                                       |
|    |     |             int64value: 194                                                                          |
|    |     | unit: ""                                                                                             |
|    |     |                                                                                                      |
|    |     | ------------------------------------                                                                 |
|    |     | metric:                                                                                              |
|    |     |     type: loadbalancing.googleapis.com/https/backend_request_bytes_count                             |
|    |     |     labels:                                                                                          |
|    |     |         cache_result: DISABLED                                                                       |
|    |     |         proxy_continent: Europe                                                                      |
|    |     |         response_code: "502"                                                                         |
|    |     |         response_code_class: "500"                                                                   |
|    |     | resource:                                                                                            |
|    |     |     type: https_lb_rule                                                                              |
|    |     |     labels:                                                                                          |
|    |     |         backend_name: INVALID_BACKEND                                                                |
|    |     |         backend_scope: INVALID_BACKEND                                                               |
|    |     |         backend_scope_type: INVALID_BACKEND                                                          |
|    |     |         backend_target_name: giz-2-http-lb                                                           |
|    |     |         backend_target_type: BACKEND_SERVICE                                                         |
|    |     |         backend_type: INVALID_BACKEND                                                                |
|    |     |         forwarding_rule_name: giz-2-https-lb                                                         |
|    |     |         matched_url_path_rule: UNMATCHED                                                             |
|    |     |         project_id: elastic-obs-integrations-dev                                                     |
|    |     |         region: global                                                                               |
|    |     |         target_proxy_name: giz-2-https-lb                                                            |
|    |     |         url_map_name: giz-2-https-lb                                                                 |
|    |     | metadata: null                                                                                       |
|    |     | metrickind: 2                                                                                        |
|    |     | valuetype: 2                                                                                         |
|    |     | points:                                                                                              |
|    |     |     - interval:                                                                                      |
|    |     |         endtime:                                                                                     |
|    |     |             seconds: 1695760380                                                                      |
|    |     |             nanos: 0                                                                                 |
|    |     |         starttime:                                                                                   |
|    |     |             seconds: 1695760320                                                                      |
|    |     |             nanos: 1000000                                                                           |
|    |     |       value:                                                                                         |
|    |     |         value:                                                                                       |
|    |     |             int64value: 300                                                                          |
|    |     | unit: ""                                                                                             |
|    |     |                                                                                                      |
|    |     | ------------------------------------                                                                 |
|    |     |                                                                                                      |

The response contains two time series values for the https/backend_request_bytes_count metric.

Note that each time series value has:

  • one metric field with type and labels.
  • one resource field with type and labels.
  • Values are stored in the points field.

The labels are metadata. We can consider all metric labels dimensions; see https://cloud.google.com/load-balancing/docs/metrics for details.

In addition to metrics labels, many resource labels should also be considered dimensions.

Please consider the resource.labels.backend_target_name field values for the two time:

  1. resource.labels.backend_target_name: aman-2-http-lb
  2. resource.labels.backend_target_name: giz-2-http-lb

Grouping (current implementation)

By default, the Metadata Collector ID() functions uses the content of metric.labels.* and resource.labels.* fields (and a few other) for grouping. Since the two resource.labels.backend_target_name fields contain different values, the default ID() implementation will not group these two time series values.

We can change the Metadata Collector ID() implementation to match our requirements, but we need to check the resource label and identify all the labels that are dimensions.

Events

Ultimately, the metricset will create one event for each time series group.

What's next?

We must check the time series data and identify all the dimensions, including the values of resource.labels.* fields.

@zmoog
Copy link
Contributor

zmoog commented Sep 27, 2023

@constanca-m @gpop63, and I had a sync over Zoom to share our findings and ideas about the grouping issue with TSDB.

Summary

Let's use a time series example from GCP as a reference:

metric:                                                                                             
    type: loadbalancing.googleapis.com/https/backend_request_bytes_count                            
    labels:                                                                                         
        cache_result: DISABLED                                                                      
        proxy_continent: Europe                                                                     
        response_code: "502"                                                                        
        response_code_class: "500"                                                                  
resource:                                                                                           
    type: https_lb_rule                                                                             
    labels:                                                                                         
        backend_name: INVALID_BACKEND                                                               
        backend_scope: INVALID_BACKEND                                                              
        backend_scope_type: INVALID_BACKEND                                                         
        backend_target_name: giz-2-http-lb                                                          
        backend_target_type: BACKEND_SERVICE                                                        
        backend_type: INVALID_BACKEND                                                               
        forwarding_rule_name: giz-2-https-lb                                                        
        matched_url_path_rule: UNMATCHED                                                            
        project_id: elastic-obs-integrations-dev                                                    
        region: global                                                                              
        target_proxy_name: giz-2-https-lb                                                           
        url_map_name: giz-2-https-lb                                                                
metadata: null                                                                                      
metrickind: 2                                                                                       
valuetype: 2                                                                                        
points:                                                                                             
    - interval:                                                                                     
        endtime:                                                                                    
            seconds: 1695760380                                                                     
            nanos: 0                                                                                
        starttime:                                                                                  
            seconds: 1695760320                                                                     
            nanos: 1000000                                                                          
      value:                                                                                        
        value:                                                                                      
            int64value: 300                                                                         
unit: ""                                                                                      

We consider the following fields as dimensions:

Action Plan

Here are the next steps:

  1. Update the GCP metricset to group time series by labels (metric and resource) and a subset of ECS fields.
  2. Update the integration to fingerprint labels and the subset of ECS fields.

We opt for fingerprinting to avoid defining any existing label as a dimension and maintain the list as Google Cloud adds new labels.

(1) Update the GCP metricset

@gpop63 has already put together a working implementation in this PR: https://github.com/elastic/beats/pull/36682/files (kudos, Gabriel!)

We'll work on the PR to review and test the changes. We'll build a custom Agent to be able to execute the TSDB-migration-test-kit suite that Constança built.

(2) Update the integration to fingerprint labels and ECS fields

@constanca-m will look into the integration to add the fingerprinting and run the TSDB-migration-test-kit with the custom agent.

@zmoog
Copy link
Contributor

zmoog commented Sep 27, 2023

I am testing @gpop63 PR using load balancing metrics.

The metricset collected 44 time series and it grouped them into 16 group.

Here one group with three time series:

|     11 |             3 | - key: https.backend_request.bytes                             |
|        |               |   value: 110                                                   |
|        |               |   labels:                                                      |
|        |               |     metrics:                                                   |
|        |               |         cache_result: DISABLED                                 |
|        |               |         proxy_continent: America                               |
|        |               |         response_code: "502"                                   |
|        |               |         response_code_class: "500"                             |
|        |               |     resource:                                                  |
|        |               |         backend_name: INVALID_BACKEND                          |
|        |               |         backend_scope: INVALID_BACKEND                         |
|        |               |         backend_scope_type: INVALID_BACKEND                    |
|        |               |         backend_target_name: aman-2-http-lb                    |
|        |               |         backend_target_type: BACKEND_SERVICE                   |
|        |               |         backend_type: INVALID_BACKEND                          |
|        |               |         forwarding_rule_name: aman-2-http-lb                   |
|        |               |         matched_url_path_rule: UNMATCHED                       |
|        |               |         region: global                                         |
|        |               |         target_proxy_name: aman-2-http-lb                      |
|        |               |         url_map_name: aman-2-https-lb                          |
|        |               |   ecs:                                                         |
|        |               |     cloud:                                                     |
|        |               |         account:                                               |
|        |               |             id: elastic-obs-integrations-dev                   |
|        |               |             name: elastic-obs-integrations-dev                 |
|        |               |         provider: gcp                                          |
|        |               |   timestamp: 2023-09-27T15:54:00Z                              |
|        |               | - key: https.backend_response.bytes                            |
|        |               |   value: 258                                                   |
|        |               |   labels:                                                      |
|        |               |     metrics:                                                   |
|        |               |         cache_result: DISABLED                                 |
|        |               |         proxy_continent: America                               |
|        |               |         response_code: "502"                                   |
|        |               |         response_code_class: "500"                             |
|        |               |     resource:                                                  |
|        |               |         backend_name: INVALID_BACKEND                          |
|        |               |         backend_scope: INVALID_BACKEND                         |
|        |               |         backend_scope_type: INVALID_BACKEND                    |
|        |               |         backend_target_name: aman-2-http-lb                    |
|        |               |         backend_target_type: BACKEND_SERVICE                   |
|        |               |         backend_type: INVALID_BACKEND                          |
|        |               |         forwarding_rule_name: aman-2-http-lb                   |
|        |               |         matched_url_path_rule: UNMATCHED                       |
|        |               |         region: global                                         |
|        |               |         target_proxy_name: aman-2-http-lb                      |
|        |               |         url_map_name: aman-2-https-lb                          |
|        |               |   ecs:                                                         |
|        |               |     cloud:                                                     |
|        |               |         account:                                               |
|        |               |             id: elastic-obs-integrations-dev                   |
|        |               |             name: elastic-obs-integrations-dev                 |
|        |               |         provider: gcp                                          |
|        |               |   timestamp: 2023-09-27T15:54:00Z                              |
|        |               | - key: https.backend_request.count                             |
|        |               |   value: 1                                                     |
|        |               |   labels:                                                      |
|        |               |     metrics:                                                   |
|        |               |         cache_result: DISABLED                                 |
|        |               |         proxy_continent: America                               |
|        |               |         response_code: "502"                                   |
|        |               |         response_code_class: "500"                             |
|        |               |     resource:                                                  |
|        |               |         backend_name: INVALID_BACKEND                          |
|        |               |         backend_scope: INVALID_BACKEND                         |
|        |               |         backend_scope_type: INVALID_BACKEND                    |
|        |               |         backend_target_name: aman-2-http-lb                    |
|        |               |         backend_target_type: BACKEND_SERVICE                   |
|        |               |         backend_type: INVALID_BACKEND                          |
|        |               |         forwarding_rule_name: aman-2-http-lb                   |
|        |               |         matched_url_path_rule: UNMATCHED                       |
|        |               |         region: global                                         |
|        |               |         target_proxy_name: aman-2-http-lb                      |
|        |               |         url_map_name: aman-2-https-lb                          |
|        |               |   ecs:                                                         |
|        |               |     cloud:                                                     |
|        |               |         account:                                               |
|        |               |             id: elastic-obs-integrations-dev                   |
|        |               |             name: elastic-obs-integrations-dev                 |
|        |               |         provider: gcp                                          |
|        |               |   timestamp: 2023-09-27T15:54:00Z                              |
|        |               |                                                                |
|        |               |                                                                |

In the last step, the metricset collapsed these three time series into one event:

|      3 | rootfields:                                                        |
|        |     cloud:                                                         |
|        |         account:                                                   |
|        |             id: elastic-obs-integrations-dev                       |
|        |             name: elastic-obs-integrations-dev                     |
|        |         provider: gcp                                              |
|        | modulefields:                                                      |
|        |     labels:                                                        |
|        |         metrics:                                                   |
|        |             cache_result: DISABLED                                 |
|        |             proxy_continent: America                               |
|        |             response_code: "502"                                   |
|        |             response_code_class: "500"                             |
|        |         resource:                                                  |
|        |             backend_name: INVALID_BACKEND                          |
|        |             backend_scope: INVALID_BACKEND                         |
|        |             backend_scope_type: INVALID_BACKEND                    |
|        |             backend_target_name: aman-2-http-lb                    |
|        |             backend_target_type: BACKEND_SERVICE                   |
|        |             backend_type: INVALID_BACKEND                          |
|        |             forwarding_rule_name: aman-2-http-lb                   |
|        |             matched_url_path_rule: UNMATCHED                       |
|        |             region: global                                         |
|        |             target_proxy_name: aman-2-http-lb                      |
|        |             url_map_name: aman-2-https-lb                          |
|        | metricsetfields:                                                   |
|        |     https:                                                         |
|        |         backend_request:                                           |
|        |             bytes: 110                                             |
|        |             count: 1                                               |
|        |         backend_response:                                          |
|        |             bytes: 258                                             |
|        | index: ""                                                          |
|        | id: ""                                                             |
|        | namespace: ""                                                      |
|        | timestamp: 2023-09-27T15:54:00Z                                    |
|        | error: null                                                        |
|        | host: ""                                                           |
|        | service: ""                                                        |
|        | took: 0s                                                           |
|        | period: 0s                                                         |
|        | disabletimeseries: false                                           |

cc @constanca-m @agithomas

@zmoog
Copy link
Contributor

zmoog commented Sep 29, 2023

After some tests using the GKE metrics, @constanca-m discovered an unexpected behavior of the GCP metricset.

The Problem

The time series grouping worked as expected, but the TSDB-migration-test-kit detected data loss nonetheless. @constanca-m boiled down the problem to an uncomplicated case: she found two documents with identical timestamps and dimensions that were the cause of data loss when TSDB was enabled.

Analysis

The clue to understanding what was happening was that the two documents had the same @timestamp but different event.ingested, with values two minutes apart. This difference suggested that the metricset has not gathered the two documents in the same data collection.

tl;dr — ingest delay is the cause of this behavior.

Due to ingest delay, the metricset collects documents for the same @timestamp across several data collections, so there is no way the metricset can group them. Gathering documents for the same @timestamp over multiple collections is fine for standard data streams, but it is a problem for TSDB-enabled data streams where @timestamp + dimensions combination must be unique.

We learned that GCP metrics individually have an ingest delay; GCP describes ingest delay as:

Ingest Delay: Data points older than this value are guaranteed to be available to be read, excluding data loss due to errors. The delay does not include any time spent waiting until the next sampling period. The delay, if available, appears at the end of the description text in a sentence of the form "After sampling, data is not visible for up to y seconds."
The GCP docs includes the above text "After sampling, data is not visible for up to y seconds." for each metric with an ingest delay.

So, during each collection cycle, the metricset collects data on a different time window for each metric.

Here's how it works.

image

For example, given t as a point in time, the following metric:

  • container/memory.limit.bytes (no ingest delay) collects data in the "t-1m <—> t" window
  • container/memory/request_bytes (2 min ingest delay) collects data in the "t-3m <—> t-2m" window

If we have one "data point" in the "t-1m <—> t" time window, the metricset will collect container/memory.limit.bytes during "collection 1", and then container/memory/request_bytes after two collections, 2 minutes later during "collection 3"

If the data point in the "t-1m <—> t" time window has a constant timestamp, this would explain the two documents with the same @timestamp and different event.ingested values that @constanca.manteigas found yesterday.

Solutions

To test this hypothesis, I updated the custom agent image (image id 33ab08e90977) to add a unique event.collection_id field for each collection. This is intended for testing only; better solutions may exist.

We can avoid data loss by adding an event.collection_id field as a dimension.

@constanca-m is testing the GCP metrics integration and custom agent with these changes, and she hasn't detected data losses so far.

@zmoog
Copy link
Contributor

zmoog commented Oct 4, 2023

The tests with the custom agent image ID 33ab08e90977 have been successful. After defining event.collection_id field as a dimension, the TSDB-migration-test-kit reported no data loss.

We moved to the next phase to move the PR elastic/beats#36682 to the finish line.

Since the event.collection_id is a non-standard field, I replaced it with event.created because it:

  • is an ECS field
  • has a proper semantics
  • works for our purpose

I pushed a new version of the custom Agent with image ID 92eb8c2b4f30.

@constanca-m, could you replace event.collection_id with event.created as a dimension and run the TSDB-migration-test-kit with the Agent image ID 92eb8c2b4f30?

zmoog added a commit to gpop63/beats that referenced this issue Oct 4, 2023
# Update grouping key

The dimensionsKey contains all dimension fields values we want to use
to group the time series.

We need to add the timestamp to the key, so we only group time
series with the same timestamp.

# Add `event.created` field

We need to add an extra dimension to avoid data loss on TSDB
since GCP metrics with the same @timestamp become visible with
different "ingest delay".

For the full context, read elastic/integrations#6568 (comment)

# Drop ID() function

Remove the `ID()` function from the Metadata Collector.

Since we are unifying the metric grouping logic for all metric types, we
don't need to keep the `ID()` function anymore.

# Renaming

I also renamed some structs, functions, and variables with the purpose
of making their role and purpose more clear.

We can remove this part if it does not improve clarity.
zmoog added a commit to gpop63/beats that referenced this issue Oct 4, 2023
# Update grouping key

The dimensionsKey contains all dimension fields values we want to use
to group the time series.

We need to add the timestamp to the key, so we only group time
series with the same timestamp.

# Add `event.created` field

We need to add an extra dimension to avoid data loss on TSDB
since GCP metrics with the same @timestamp become visible with
different "ingest delay".

For the full context, read elastic/integrations#6568 (comment)

# Drop ID() function

Remove the `ID()` function from the Metadata Collector.

Since we are unifying the metric grouping logic for all metric types, we
don't need to keep the `ID()` function anymore.

# Renaming

I also renamed some structs, functions, and variables with the purpose
of making their role and purpose more clear.

We can remove this part if it does not improve clarity.
zmoog added a commit to gpop63/beats that referenced this issue Oct 4, 2023
# Update grouping key

The dimensionsKey contains all dimension fields values we want to use
to group the time series.

We need to add the timestamp to the key, so we only group time
series with the same timestamp.

# Add `event.created` field

We need to add an extra dimension to avoid data loss on TSDB
since GCP metrics with the same @timestamp become visible with
different "ingest delay".

For the full context, read elastic/integrations#6568 (comment)

# Drop ID() function

Remove the `ID()` function from the Metadata Collector.

Since we are unifying the metric grouping logic for all metric types, we
don't need to keep the `ID()` function anymore.

# Renaming

I also renamed some structs, functions, and variables with the purpose
of making their role and purpose more clear.

We can remove this part if it does not improve clarity.
@zmoog
Copy link
Contributor

zmoog commented Oct 4, 2023

TSDB does not support the date field type for dimensions. Reverted to an ID field; this time, we picked the event.batch_id.

Custom agent image 21c80fd7566b is available for testing; add event.batch_id as a dimension.

@zmoog
Copy link
Contributor

zmoog commented Oct 16, 2023

A new custom agent build version 8.10.4 image 5bbbdae4689a) is available for testing

In this version, we replaeced event.batch_id with event.metric_names as a dimension to avoid data loss from metrics collected with different ingest delay.

See elastic/beats@71c0aa0 for more details.

@zmoog
Copy link
Contributor

zmoog commented Oct 17, 2023

A new custom agent build version 8.10.5 (image 0e77fd312f4b ) is available for testing.

This is just a build after a rebase from main with no other changes. The metric names are not hashed yet, but we can handle this is the pipeline until the next build.

@zmoog
Copy link
Contributor

zmoog commented Oct 19, 2023

A new custom agent build version 8.10.5 (image 985aaeb1a00f ) is available for testing.

Changes:

  • Replace event.metric_names with event.metric_names_hash

@zmoog
Copy link
Contributor

zmoog commented Oct 24, 2023

A new custom agent build version 8.10.5 (image 9c9a3e196ba5 ) is available for testing.

Changes:

zmoog added a commit to gpop63/beats that referenced this issue Nov 2, 2023
# Update grouping key

The dimensionsKey contains all dimension fields values we want to use
to group the time series.

We need to add the timestamp to the key, so we only group time
series with the same timestamp.

# Add `event.created` field

We need to add an extra dimension to avoid data loss on TSDB
since GCP metrics with the same @timestamp become visible with
different "ingest delay".

For the full context, read elastic/integrations#6568 (comment)

# Drop ID() function

Remove the `ID()` function from the Metadata Collector.

Since we are unifying the metric grouping logic for all metric types, we
don't need to keep the `ID()` function anymore.

# Renaming

I also renamed some structs, functions, and variables with the purpose
of making their role and purpose more clear.

We can remove this part if it does not improve clarity.
@zmoog
Copy link
Contributor

zmoog commented Nov 2, 2023

@gpop63, a new custom agent build version 8.10.5 (image 178af9f385a4 ) is available for testing the latest change we discussed during the PR review.

Changes:

  • Collect metric values after they're all available; the mericset use the largest ingest delay available.
  • Drop the gcp.metric_names_fingerprint because it's no longer needed.

This build sets the event.batch_id field. It's for debug purpose only, it is not included in the PR.

@zmoog
Copy link
Contributor

zmoog commented Nov 14, 2023

We resolved the issue with the merge of PR elastic/beats#36682 and the change will ship in 8.12.0.

@zmoog zmoog closed this as completed Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

No branches or pull requests

4 participants