-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCP] Metrics are not grouped by dimension #6568
Comments
Side note after some reading about GCP metrics and TSDB. I noticed the default value for index.mapping.dimension_fields.limit is Some GCP metrics have a lot of labels. For example, the
For some reasons, there is no Footnotes
|
The grouping mechanism works as designed. The grouping is handled by the timeSeriesGrouped() function. The ID() function defines the criteria used to group by the values. For the By default, the {
"metric": {
"labels": {
"cache_result": "DISABLED",
"client_country": "United States",
"protocol": "HTTP/1.1",
"proxy_continent": "America",
"response_code": "200",
"response_code_class": "200"
},
"type": "loadbalancing.googleapis.com/https/request_count" <——— 👀 —————
},
"resource": {
"labels": {
"backend_name": "giz-2-http-lb-0",
"backend_scope": "us-central1-a",
"backend_scope_type": "ZONE",
"backend_target_name": "giz-2-http-lb",
"backend_target_type": "BACKEND_SERVICE",
"backend_type": "INSTANCE_GROUP",
"forwarding_rule_name": "giz-2-https-lb",
"matched_url_path_rule": "UNMATCHED",
"project_id": "elastic-obs-integrations-dev",
"region": "global",
"target_proxy_name": "giz-2-https-lb",
"url_map_name": "giz-2-https-lb"
},
"type": "https_lb_rule"
},
"timestamp": 1687277100000000000
} Since the ID value contains the metric type As a quick experiment, I commented the metric.type definition in the ID function, generating ID values like this one: {
"metric": {
"labels": {
"cache_result": "DISABLED",
"client_country": "United States",
"protocol": "HTTP/1.1",
"proxy_continent": "America",
"response_code": "200",
"response_code_class": "200"
}
},
"resource": {
"labels": {
"backend_name": "giz-2-http-lb-0",
"backend_scope": "us-central1-a",
"backend_scope_type": "ZONE",
"backend_target_name": "giz-2-http-lb",
"backend_target_type": "BACKEND_SERVICE",
"backend_type": "INSTANCE_GROUP",
"forwarding_rule_name": "giz-2-https-lb",
"matched_url_path_rule": "UNMATCHED",
"project_id": "elastic-obs-integrations-dev",
"region": "global",
"target_proxy_name": "giz-2-https-lb",
"url_map_name": "giz-2-https-lb"
},
"type": "https_lb_rule"
},
"timestamp": 1687279560000000000
} With this ID, I expect a group by resource type with single metric types to go together. Here's the final result in the debugger: Now |
thanks for the updates @zmoog. "works as designed" seems partially accurate. Do you have any idea why the metric type would have been intentionally added to the ID document? |
Not yet. By adding the "metric type" in the ID used for grouping, the result is @constanca-m, did run some tests on the loadbalancing metrics, and removing the metric type resulted in no document overwrites on TSDB. I see the GCP metricset generates the IDs used for grouping using a component called metadata collector. There are several implementations of such collectors, depending on the metricsets:
The default collector used by loadbalancing metrics is the richer and the closest to what we need to support TSDB. The other collectors are much simpler. |
By looking at the ID value for loadbalancing: {
"metric": {
"labels": {
"cache_result": "DISABLED",
"client_country": "United States",
"protocol": "HTTP/1.1",
"proxy_continent": "America",
"response_code": "200",
"response_code_class": "200"
}
},
"resource": {
"labels": {
"backend_name": "giz-2-http-lb-0",
"backend_scope": "us-central1-a",
"backend_scope_type": "ZONE",
"backend_target_name": "giz-2-http-lb",
"backend_target_type": "BACKEND_SERVICE",
"backend_type": "INSTANCE_GROUP",
"forwarding_rule_name": "giz-2-https-lb",
"matched_url_path_rule": "UNMATCHED",
"project_id": "elastic-obs-integrations-dev",
"region": "global",
"target_proxy_name": "giz-2-https-lb",
"url_map_name": "giz-2-https-lb"
},
"type": "https_lb_rule"
},
"timestamp": 1687279560000000000
} @constanca-m, can consider the fields If this is the case, documents like this work on TSDB because they are made of timestamp + dimensions, and we should adjust the other metadata collectors to match this schema. @tommyers-elastic @constanca-m, please let me know what you think. |
I believe so. Every field that is under |
Thanks, this is useful: I'll try to align these collectors with loadbalancing. |
This problem is relevant to compute_metrics as well. In a zone, I have 2 compute instances. The number of documents created is 9 (because there exists no grouping). Only two documents have metadata that could relate to the compute instance (gcp.labels.metrics.device_name) . The other documents have metric data which cannot be associated with the compute instance. Even though this problem is identified as part of TSDB migration, i believe, without having the metadata (gcp.labels.metrics.device_name) in all documents (either by grouping or by copying), the documents does not serve its purpose. |
@agithomas thanks for the heads up on that. does the metadata issue you refer to relate to this issue at all? as a side conversation, what are your thoughts on the linked issue - should we alias the fields to ECS and leave the originals in place? |
The issue I mentioned is not related to it. Sorry that i didn't explain it clearly. Assume, we have a set of metrics, labels and its values such as below.
what I observed is we have 3 documents
Second
Third one
When the documents are stored as 3 documents, what i observed is, field Either we must target the metric labels copied into all documents or we must group related documents into one with all labels within it. For TSDB enablement, it is important to uniquely separate one resource from another. Most likely, resource labels and metric labels are going to be super useful here. So, if we are removing the metric labels, going with the details of the this issue, we might be losing some important information. |
One resource from another: if we have one document/resource in a time series If one resource (or asset) produces multiple documents, dimensions (labels) must separate one set of documents from the first resource from another set of documents from the second resource. So, grouping data is always the best approach , thinking from a TSDB point of view. |
The alternate option I can consider here is to use |
As noted before, It is relatively easy to change the |
@agithomas, what fields are we considering as dimensions for the |
In general, it seems the metadata collectors move some fields out of In the previous example, |
If needed, fingerprint processor applied on all |
With these fields as dimensions, the only way to avoid overlaps is group the data returned by these IDs. To put this is a more general terms, for GCP we probably always have:
and a cloud resource ID; in this case, the cloud resource ID is This is what we get from the Monitoring API: |
Is there any update on this issue @zmoog? |
@constanca-m, maurizio is out on PTO - i said i would take over on this. i'm going to write some guidelines for dimensions across the whole package, then when we have something to check against, we can make sure all the grouping logic is working correctly at the beats level. |
@zmoog or @tommyers-elastic, is this still on progress? |
@constanca-m, yes, it is. You can expect an update this week. |
I am updating this issue after a while. I want to share how the collection, grouping, and events emitting work for GCP metrics. OverviewThe GCP metricset processes data in three main phases:
I will describe the behavior using the CollectionThe GCP metricset collects time series data for each one of the 29 metrics configured. Here's an excerpt from the serialized version of the API response with the
The response contains two time series values for the Note that each time series value has:
The labels are metadata. We can consider all metric labels dimensions; see https://cloud.google.com/load-balancing/docs/metrics for details. In addition to metrics labels, many resource labels should also be considered dimensions. Please consider the
Grouping (current implementation)By default, the Metadata Collector We can change the Metadata Collector EventsUltimately, the metricset will create one event for each time series group. What's next?We must check the time series data and identify all the dimensions, including the values of |
@constanca-m @gpop63, and I had a sync over Zoom to share our findings and ideas about the grouping issue with TSDB. SummaryLet's use a time series example from GCP as a reference:
We consider the following fields as dimensions:
Action PlanHere are the next steps:
We opt for fingerprinting to avoid defining any existing label as a dimension and maintain the list as Google Cloud adds new labels. (1) Update the GCP metricset@gpop63 has already put together a working implementation in this PR: https://github.com/elastic/beats/pull/36682/files (kudos, Gabriel!) We'll work on the PR to review and test the changes. We'll build a custom Agent to be able to execute the TSDB-migration-test-kit suite that Constança built. (2) Update the integration to fingerprint labels and ECS fields@constanca-m will look into the integration to add the fingerprinting and run the TSDB-migration-test-kit with the custom agent. |
I am testing @gpop63 PR using load balancing metrics. The metricset collected 44 time series and it grouped them into 16 group. Here one group with three time series:
In the last step, the metricset collapsed these three time series into one event:
|
After some tests using the GKE metrics, @constanca-m discovered an unexpected behavior of the GCP metricset. The ProblemThe time series grouping worked as expected, but the TSDB-migration-test-kit detected data loss nonetheless. @constanca-m boiled down the problem to an uncomplicated case: she found two documents with identical timestamps and dimensions that were the cause of data loss when TSDB was enabled. AnalysisThe clue to understanding what was happening was that the two documents had the same tl;dr — ingest delay is the cause of this behavior. Due to ingest delay, the metricset collects documents for the same We learned that GCP metrics individually have an ingest delay; GCP describes ingest delay as:
So, during each collection cycle, the metricset collects data on a different time window for each metric. Here's how it works. For example, given
If we have one "data point" in the "t-1m <—> t" time window, the metricset will collect If the data point in the "t-1m <—> t" time window has a constant timestamp, this would explain the two documents with the same @timestamp and different SolutionsTo test this hypothesis, I updated the custom agent image (image id We can avoid data loss by adding an @constanca-m is testing the GCP metrics integration and custom agent with these changes, and she hasn't detected data losses so far. |
The tests with the custom agent image ID We moved to the next phase to move the PR elastic/beats#36682 to the finish line. Since the
I pushed a new version of the custom Agent with image ID @constanca-m, could you replace |
# Update grouping key The dimensionsKey contains all dimension fields values we want to use to group the time series. We need to add the timestamp to the key, so we only group time series with the same timestamp. # Add `event.created` field We need to add an extra dimension to avoid data loss on TSDB since GCP metrics with the same @timestamp become visible with different "ingest delay". For the full context, read elastic/integrations#6568 (comment) # Drop ID() function Remove the `ID()` function from the Metadata Collector. Since we are unifying the metric grouping logic for all metric types, we don't need to keep the `ID()` function anymore. # Renaming I also renamed some structs, functions, and variables with the purpose of making their role and purpose more clear. We can remove this part if it does not improve clarity.
# Update grouping key The dimensionsKey contains all dimension fields values we want to use to group the time series. We need to add the timestamp to the key, so we only group time series with the same timestamp. # Add `event.created` field We need to add an extra dimension to avoid data loss on TSDB since GCP metrics with the same @timestamp become visible with different "ingest delay". For the full context, read elastic/integrations#6568 (comment) # Drop ID() function Remove the `ID()` function from the Metadata Collector. Since we are unifying the metric grouping logic for all metric types, we don't need to keep the `ID()` function anymore. # Renaming I also renamed some structs, functions, and variables with the purpose of making their role and purpose more clear. We can remove this part if it does not improve clarity.
# Update grouping key The dimensionsKey contains all dimension fields values we want to use to group the time series. We need to add the timestamp to the key, so we only group time series with the same timestamp. # Add `event.created` field We need to add an extra dimension to avoid data loss on TSDB since GCP metrics with the same @timestamp become visible with different "ingest delay". For the full context, read elastic/integrations#6568 (comment) # Drop ID() function Remove the `ID()` function from the Metadata Collector. Since we are unifying the metric grouping logic for all metric types, we don't need to keep the `ID()` function anymore. # Renaming I also renamed some structs, functions, and variables with the purpose of making their role and purpose more clear. We can remove this part if it does not improve clarity.
TSDB does not support the Custom agent image |
A new custom agent build version 8.10.4 image In this version, we replaeced See elastic/beats@71c0aa0 for more details. |
A new custom agent build version 8.10.5 (image This is just a build after a rebase from main with no other changes. The metric names are not hashed yet, but we can handle this is the pipeline until the next build. |
A new custom agent build version 8.10.5 (image Changes:
|
A new custom agent build version 8.10.5 (image Changes:
|
# Update grouping key The dimensionsKey contains all dimension fields values we want to use to group the time series. We need to add the timestamp to the key, so we only group time series with the same timestamp. # Add `event.created` field We need to add an extra dimension to avoid data loss on TSDB since GCP metrics with the same @timestamp become visible with different "ingest delay". For the full context, read elastic/integrations#6568 (comment) # Drop ID() function Remove the `ID()` function from the Metadata Collector. Since we are unifying the metric grouping logic for all metric types, we don't need to keep the `ID()` function anymore. # Renaming I also renamed some structs, functions, and variables with the purpose of making their role and purpose more clear. We can remove this part if it does not improve clarity.
@gpop63, a new custom agent build version 8.10.5 (image Changes:
This build sets the |
We resolved the issue with the merge of PR elastic/beats#36682 and the change will ship in 8.12.0. |
We have noticed occurrences of metric documents being sent to ES which have the same timestamp and dimensions (and metadata). As far as we can see, each GCP metric document contains only a single metric.
There is code in the GCP metrics module in metricbeat which claims to stop this happening. https://github.com/elastic/beats/blob/main/x-pack/metricbeat/module/gcp/metrics/timeseries.go. It appears there is a bug in that logic that is not performing the grouping at all.
This was tested on the
gcp.loadbalancing
metricset, which usesgcp.metrics
metricset under the hood.The text was updated successfully, but these errors were encountered: