Skip to content

Conversation

@pmw-rp
Copy link
Contributor

@pmw-rp pmw-rp commented Sep 29, 2025

What does this PR do?

This PR brings Redpanda metrics up to date.

Motivation

Completeness, customer demand.

Review checklist

  • PR has a meaningful title or PR has the no-changelog label attached
  • Feature or bugfix has tests
  • Git history is clean

Additional Notes

We will follow with a later PR to add supporting dashboards that demonstrate some of these metrics. The focus here is to enable customers to use metrics that are already available in the product.

This PR replaces the previous PR which was unfortunately closed due to inactivity.

Some of the issues with the previous PR seemed to be related to the DD validation - some help may be required with that.

@dd-dominic dd-dominic changed the title Redpanda v2.2.0 (metrics update) [ECOINT-248] Redpanda v2.2.0 (metrics update) Sep 29, 2025
Copy link
Collaborator

@dd-dominic dd-dominic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix validations. Once fixed, we'll begin our review.

@pmw-rp
Copy link
Contributor Author

pmw-rp commented Sep 29, 2025

@dd-dominic What do I do about the validations like these?

redpanda.cloud.client_lease_duration row has error: Invalid value for metric_type: 'histogram'. Must be one of '['gauge', 'rate', 'count', 'counter']'

According to this, histogram is a supported type

@dd-dominic
Copy link
Collaborator

@dd-dominic What do I do about the validations like these?

redpanda.cloud.client_lease_duration row has error: Invalid value for metric_type: 'histogram'. Must be one of '['gauge', 'rate', 'count', 'counter']'

According to this, histogram is a supported type

The in-app type is different. See example here: https://docs.datadoghq.com/metrics/types/?tab=histogram#example.

@pmw-rp
Copy link
Contributor Author

pmw-rp commented Sep 29, 2025

@dd-dominic Are there restrictions in changing the types listed in metadata.csv? Would we need to publish under a new metric name to change the type?

@dd-dominic
Copy link
Collaborator

@dd-dominic Are there restrictions in changing the types listed in metadata.csv? Would we need to publish under a new metric name to change the type?

You cannot change the metric name, but you can safely change the types. If you need to change the metric name, leave the old metric as is and create a new metric.

@pmw-rp
Copy link
Contributor Author

pmw-rp commented Oct 2, 2025

@dd-dominic I'm now seeing a bunch of lint issues such as the following:

datadog_checks/redpanda/metrics.py:48:121: E501 Line too long (136 > 120)
   |
46 |     'redpanda_cloud_storage_housekeeping_pauses': 'cloud.storage.housekeeping.pauses',
47 |     'redpanda_cloud_storage_housekeeping_resumes': 'cloud.storage.housekeeping.resumes',
48 |     'redpanda_cloud_storage_housekeeping_requests_throttled_average_rate': 'cloud.storage_housekeeping_requests_throttled_average_rate',
   |                                                                                                                         ^^^^^^^^^^^^^^^^ E501
49 |     'redpanda_cloud_storage_housekeeping_rounds': 'cloud.storage.housekeeping.rounds',
50 |     'redpanda_cloud_storage_jobs_cloud_segment_reuploads': 'cloud.storage.jobs.cloud_segment_reuploads',

In the source, the line looks like this:

    'redpanda_cloud_storage_housekeeping_requests_throttled_average_rate': 'cloud.storage_housekeeping_requests_throttled_average_rate',

If I split it over two lines, the linter combines it back into one, so then the lint check fails since the output of the linter doesn't match the input.

What do I need to do here?

@dd-dominic
Copy link
Collaborator

@pmw-rp can you confirm the ddev version you're working with (ddev --version)?

Additionally, did you try running ddev test --fmt?

@dd-dominic dd-dominic requested review from a team and james-eichelbaum and removed request for a team October 2, 2025 21:08
@pmw-rp
Copy link
Contributor Author

pmw-rp commented Oct 2, 2025

Here you go:

$ ddev --version
ddev, version 11.4.0
$ ddev test redpanda --fmt
──────────────────────────────────────────────────────────── Redpanda ────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────── lint ──────────────────────────────────────────────────────────────
cmd [1] | black . --config ../pyproject.toml
All done! ✨ 🍰 ✨
10 files left unchanged.
cmd [2] | ruff check --config ../pyproject.toml --fix .
datadog_checks/redpanda/metrics.py:48:121: E501 Line too long (136 > 120)
   |
46 |     'redpanda_cloud_storage_housekeeping_pauses': 'cloud.storage.housekeeping.pauses',
47 |     'redpanda_cloud_storage_housekeeping_resumes': 'cloud.storage.housekeeping.resumes',
48 |     'redpanda_cloud_storage_housekeeping_requests_throttled_average_rate': 'cloud.storage_housekeeping_requests_throttled_average_rate',
   |                                                                                                                         ^^^^^^^^^^^^^^^^ E501
49 |     'redpanda_cloud_storage_housekeeping_rounds': 'cloud.storage.housekeeping.rounds',
50 |     'redpanda_cloud_storage_jobs_cloud_segment_reuploads': 'cloud.storage.jobs.cloud_segment_reuploads',
   |

datadog_checks/redpanda/metrics.py:68:121: E501 Line too long (124 > 120)
   |
66 |     'redpanda_cloud_storage_segments_pending_deletion': 'cloud.storage.segments_pending_deletion',
67 |     'redpanda_cloud_storage_spillover_manifest_uploads': 'cloud.storage_spillover_manifest_uploads',
68 |     'redpanda_cloud_storage_spillover_manifests_materialized_bytes': 'cloud.storage_spillover_manifests_materialized_bytes',
   |                                                                                                                         ^^^^ E501
69 |     'redpanda_cloud_storage_spillover_manifests_materialized_count': 'cloud.storage_spillover_manifests_materialized_count',
70 |     'redpanda_cloud_storage_uploaded_bytes': 'cloud.storage.uploaded_bytes',
   |

datadog_checks/redpanda/metrics.py:69:121: E501 Line too long (124 > 120)
   |
67 |     'redpanda_cloud_storage_spillover_manifest_uploads': 'cloud.storage_spillover_manifest_uploads',
68 |     'redpanda_cloud_storage_spillover_manifests_materialized_bytes': 'cloud.storage_spillover_manifests_materialized_bytes',
69 |     'redpanda_cloud_storage_spillover_manifests_materialized_count': 'cloud.storage_spillover_manifests_materialized_count',
   |                                                                                                                         ^^^^ E501
70 |     'redpanda_cloud_storage_uploaded_bytes': 'cloud.storage.uploaded_bytes',
71 | }
   |

datadog_checks/redpanda/metrics.py:80:121: E501 Line too long (126 > 120)
   |
78 |     'redpanda_cluster_non_homogenous_fips_mode': 'cluster.non_homogenous_fips_mode',
79 |     'redpanda_cluster_partition_num_with_broken_rack_constraint': 'cluster.partition_num_with_broken_rack_constraint',
80 |     'redpanda_cluster_partition_schema_id_validation_records_failed': 'cluster.partition_schema_id_validation_records_failed',
   |                                                                                                                         ^^^^^^ E501
81 |     'redpanda_cluster_partitions': 'cluster.partitions',
82 |     'redpanda_cluster_topics': 'cluster.topics',
   |

datadog_checks/redpanda/metrics.py:89:121: E501 Line too long (126 > 120)
   |
87 |     'redpanda_debug_bundle_failed_generation_count': 'debug_bundle.failed_generation_count',
88 |     'redpanda_debug_bundle_last_failed_bundle_timestamp_seconds': 'debug_bundle.last_failed_bundle_timestamp_seconds',
89 |     'redpanda_debug_bundle_last_successful_bundle_timestamp_seconds': 'debug_bundle.last_successful_bundle_timestamp_seconds',
   |                                                                                                                         ^^^^^^ E501
90 |     'redpanda_debug_bundle_successful_generation_count': 'debug_bundle.successful_generation_count',
91 | }
   |

datadog_checks/redpanda/metrics.py:193:121: E501 Line too long (124 > 120)
    |
191 |     'redpanda_schema_registry_cache_subject_count': 'schema_registry.cache_subject_count',
192 |     'redpanda_schema_registry_cache_subject_version_count': 'schema_registry.cache_subject_version_count',
193 |     'redpanda_schema_registry_inflight_requests_memory_usage_ratio': 'schema_registry.inflight_requests_memory_usage_ratio',
    |                                                                                                                         ^^^^ E501
194 |     'redpanda_schema_registry_inflight_requests_usage_ratio': 'schema_registry.inflight_requests_usage_ratio',
195 |     'redpanda_schema_registry_queued_requests_memory_blocked': 'schema_registry.queued_requests_memory_blocked',
    |

datadog_checks/redpanda/metrics.py:213:121: E501 Line too long (124 > 120)
    |
211 |     'redpanda_iceberg_rest_client_active_puts': 'iceberg.rest_client_active_puts',
212 |     'redpanda_iceberg_rest_client_active_requests': 'iceberg.rest_client_active_requests',
213 |     'redpanda_iceberg_rest_client_num_commit_table_update_requests': 'iceberg.rest_client_num_commit_table_update_requests',
    |                                                                                                                         ^^^^ E501
214 |     'redpanda_iceberg_rest_client_num_commit_table_update_requests_failed': 'iceberg.rest_client_num_commit_table_update_requests_failed',
215 |     'redpanda_iceberg_rest_client_num_create_namespace_requests': 'iceberg.rest_client_num_create_namespace_requests',
    |

datadog_checks/redpanda/metrics.py:214:121: E501 Line too long (138 > 120)
    |
212 |     'redpanda_iceberg_rest_client_active_requests': 'iceberg.rest_client_active_requests',
213 |     'redpanda_iceberg_rest_client_num_commit_table_update_requests': 'iceberg.rest_client_num_commit_table_update_requests',
214 |     'redpanda_iceberg_rest_client_num_commit_table_update_requests_failed': 'iceberg.rest_client_num_commit_table_update_requests_failed',
    |                                                                                                                         ^^^^^^^^^^^^^^^^^^ E501
215 |     'redpanda_iceberg_rest_client_num_create_namespace_requests': 'iceberg.rest_client_num_create_namespace_requests',
216 |     'redpanda_iceberg_rest_client_num_create_namespace_requests_failed': 'iceberg.rest_client_num_create_namespace_requests_failed',
    |

datadog_checks/redpanda/metrics.py:216:121: E501 Line too long (132 > 120)
    |
214 |     'redpanda_iceberg_rest_client_num_commit_table_update_requests_failed': 'iceberg.rest_client_num_commit_table_update_requests_failed',
215 |     'redpanda_iceberg_rest_client_num_create_namespace_requests': 'iceberg.rest_client_num_create_namespace_requests',
216 |     'redpanda_iceberg_rest_client_num_create_namespace_requests_failed': 'iceberg.rest_client_num_create_namespace_requests_failed',
    |                                                                                                                         ^^^^^^^^^^^^ E501
217 |     'redpanda_iceberg_rest_client_num_create_table_requests': 'iceberg.rest_client_num_create_table_requests',
218 |     'redpanda_iceberg_rest_client_num_create_table_requests_failed': 'iceberg.rest_client_num_create_table_requests_failed',
    |

datadog_checks/redpanda/metrics.py:218:121: E501 Line too long (124 > 120)
    |
216 |     'redpanda_iceberg_rest_client_num_create_namespace_requests_failed': 'iceberg.rest_client_num_create_namespace_requests_failed',
217 |     'redpanda_iceberg_rest_client_num_create_table_requests': 'iceberg.rest_client_num_create_table_requests',
218 |     'redpanda_iceberg_rest_client_num_create_table_requests_failed': 'iceberg.rest_client_num_create_table_requests_failed',
    |                                                                                                                         ^^^^ E501
219 |     'redpanda_iceberg_rest_client_num_drop_table_requests': 'iceberg.rest_client_num_drop_table_requests',
220 |     'redpanda_iceberg_rest_client_num_drop_table_requests_failed': 'iceberg.rest_client_num_drop_table_requests_failed',
    |

datadog_checks/redpanda/metrics.py:226:121: E501 Line too long (122 > 120)
    |
224 |     'redpanda_iceberg_rest_client_num_load_table_requests_failed': 'iceberg.rest_client_num_load_table_requests_failed',
225 |     'redpanda_iceberg_rest_client_num_oauth_token_requests': 'iceberg.rest_client_num_oauth_token_requests',
226 |     'redpanda_iceberg_rest_client_num_oauth_token_requests_failed': 'iceberg.rest_client_num_oauth_token_requests_failed',
    |                                                                                                                         ^^ E501
227 |     'redpanda_iceberg_rest_client_num_request_timeouts': 'iceberg.rest_client_num_request_timeouts',
228 |     'redpanda_iceberg_rest_client_num_transport_errors': 'iceberg.rest_client_num_transport_errors',
    |

Found 11 errors.

@pmw-rp
Copy link
Contributor Author

pmw-rp commented Oct 21, 2025

@dd-dominic Any progress on the CI issue?

@dd-dominic
Copy link
Collaborator

@dd-dominic Any progress on the CI issue?

@pmw-rp I learned we can safely ignore this error for now. Let me know when your changes are complete and will continue the review.

@pmw-rp
Copy link
Contributor Author

pmw-rp commented Oct 22, 2025

Ok, please continue.

@pmw-rp
Copy link
Contributor Author

pmw-rp commented Oct 22, 2025

@dd-dominic Just a thought: I could commit local changes to test-target.yml that would 'unpin' the versions, which should unblock the CI problem and allow all of the validations to happen. We'd need to revert that change before merging, but that may make the review process easier?

Let me know if you want that and I'll get it done.

pmw-rp and others added 2 commits October 22, 2025 17:32
Co-authored-by: Dominic Medina <[email protected]>
Co-authored-by: Dominic Medina <[email protected]>
@pmw-rp pmw-rp requested a review from dd-dominic October 23, 2025 12:15
@pmw-rp
Copy link
Contributor Author

pmw-rp commented Oct 24, 2025

@dd-dominic What's needed to progress here?

@dd-dominic
Copy link
Collaborator

@pmw-rp Metric names can't be removed from the metadata.csv file. Can you please double check?

Looks like redpanda.cluster.replicas and redpanda.cluster.controller_log_limit_requests_dropped are missing.

@dd-dominic
Copy link
Collaborator

@pmw-rp let us know when it's ready for another review. Not sure if changes are still being made. As of now, looks like redpanda.controller.log_limit_requests_dropped is missing.

@pmw-rp
Copy link
Contributor Author

pmw-rp commented Oct 31, 2025

Ahhh - it looks like this has always been wrong.

In metadata.csv, it's referenced as redpanda.cluster.controller_log_limit_requests_dropped, whereas in metrics.py and common.py, it's already referenced as redpanda.controller.log_limit_requests_dropped.

I'll push it all back to redpanda.controller.log_limit_requests_dropped, and add an additional line in metadata.csv so that you're not seeing a missing entry.

@pmw-rp
Copy link
Contributor Author

pmw-rp commented Nov 3, 2025

@dd-dominic Where are we with this?

@dd-dominic
Copy link
Collaborator

@pmw-rp working with our team to get this merge this week (hoping today)

@james-eichelbaum james-eichelbaum merged commit c4f7071 into DataDog:master Nov 3, 2025
33 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants