-
Notifications
You must be signed in to change notification settings - Fork 8
Set up Flink to report metrics via Prometheus #678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis change introduces Prometheus metrics reporting to Flink jobs by updating both the Flink job environment properties and the build system dependencies. The Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant DataprocSubmitter
participant FlinkJob
participant PrometheusReporter
User->>DataprocSubmitter: Submit Flink job
DataprocSubmitter->>FlinkJob: Build Flink job with Prometheus metrics config
FlinkJob->>PrometheusReporter: Initialize metrics reporting
PrometheusReporter-->>FlinkJob: Expose metrics on configured port
Possibly related PRs
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (4)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala(1 hunks)flink/BUILD.bazel(1 hunks)maven_install.json(8 hunks)tools/build_rules/dependencies/maven_repository.bzl(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (19)
- GitHub Check: cloud_gcp_tests
- GitHub Check: cloud_aws_tests
- GitHub Check: service_tests
- GitHub Check: hub_tests
- GitHub Check: service_tests
- GitHub Check: cloud_gcp_tests
- GitHub Check: online_tests
- GitHub Check: service_commons_tests
- GitHub Check: cloud_aws_tests
- GitHub Check: hub_tests
- GitHub Check: api_tests
- GitHub Check: flink_tests
- GitHub Check: online_tests
- GitHub Check: api_tests
- GitHub Check: flink_tests
- GitHub Check: aggregator_tests
- GitHub Check: aggregator_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: bazel_config_tests
🔇 Additional comments (11)
tools/build_rules/dependencies/maven_repository.bzl (1)
167-167: Added Prometheus metrics dependency for Flink.The dependency addition is correct and matches the version of other Flink components.
flink/BUILD.bazel (1)
31-31: Added Prometheus metrics reporter to Flink library.This dependency addition enables the Prometheus reporter to be included in the Flink library.
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala (1)
148-148: Enabled flamegraph for performance analysis.The flamegraph feature complements metrics reporting by providing visual performance analysis.
maven_install.json (8)
3-4: Verify artifacts hash bump.
Ensure the new__INPUT_ARTIFACTS_HASHand__RESOLVED_ARTIFACTS_HASHcover the added Prometheus entries.
2831-2837: Add flink‑metrics‑prometheus artifact.
Version and shasums look correct.
7146-7150: Include Prometheus in module dependencies.
Dependency list updated properly.
13714-13719: Register Prometheus reporter factories.
Service loader entries are correctly added.
24607-24608: Add Prometheus artifact to default set.
Jar and sources included.
26039-26040: Add Prometheus artifact to default set.
Jar and sources included.
27471-27472: Add Prometheus artifact to default set.
Jar and sources included.
29996-30007: Add service loader metadata for Prometheus.
Factory mappings for jar and sources in place.
| "metrics.reporters" -> "prom", | ||
| "metrics.reporter.prom.factory.class" -> "org.apache.flink.metrics.prometheus.PrometheusReporterFactory", | ||
| "metrics.reporter.prom.host" -> "localhost", | ||
| "metrics.reporter.prom.port" -> "9250-9260", | ||
| "metrics.reporter.statsd.interval" -> "60 SECONDS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
❓ Verification inconclusive
Added Prometheus metrics configuration.
The configuration properly enables Prometheus metrics reporting with appropriate settings for host and port range.
Is the StatsD reporting interval (line 155) intended to be used alongside Prometheus? If so, verify that metrics.reporters should include both reporters:
🏁 Script executed:
#!/bin/bash
# Check if StatsDreporter is imported or used anywhere
rg -A 2 -B 2 "StatsDReporter" --glob "*.{scala,java}"
# Check if metrics.reporters setting includes statsd in any other files
rg "metrics\.reporters.*statsd" --glob "*.{scala,java,properties,yaml,conf}"Length of output: 132
Unintended StatsD interval remains without StatsD reporter
The "metrics.reporter.statsd.interval" setting is inert because only Prometheus is enabled. Either remove the StatsD interval line or explicitly add and import a StatsD reporter.
• File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala (lines 151–155)
• Action:
- Remove the "metrics.reporter.statsd.interval" entry if you don’t plan to use StatsD,
- OR add "statsd" to the "metrics.reporters" list and import/instantiate the StatsD reporter class.
## Summary To get us ready to being able to support production streaming workloads on Flink and Dataproc, we need to have metrics published. [OpsAgent](https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent) is the seeming paved path to reporting metrics from GCloud hosts and OpsAgent does have some support for some of [Flink's metrics reporters](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/). Ended up going with prometheus as that is supported in our current version and OspAgent seems to have decent support for it. This PR can go independently of the infra PR that does the metrics scraping - zipline-ai/infrastructure#46 Tested manually and was able to confirm that metrics are making their way to cloud monitoring -  ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enabled Prometheus metrics reporting for Flink jobs, allowing improved monitoring and observability. - **Chores** - Added Prometheus metrics library as a dependency to the build configuration. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary To get us ready to being able to support production streaming workloads on Flink and Dataproc, we need to have metrics published. [OpsAgent](https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent) is the seeming paved path to reporting metrics from GCloud hosts and OpsAgent does have some support for some of [Flink's metrics reporters](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/). Ended up going with prometheus as that is supported in our current version and OspAgent seems to have decent support for it. This PR can go independently of the infra PR that does the metrics scraping - zipline-ai/infrastructure#46 Tested manually and was able to confirm that metrics are making their way to cloud monitoring -  ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enabled Prometheus metrics reporting for Flink jobs, allowing improved monitoring and observability. - **Chores** - Added Prometheus metrics library as a dependency to the build configuration. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary To get us ready to being able to support production streaming workloads on Flink and Dataproc, we need to have metrics published. [OpsAgent](https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent) is the seeming paved path to reporting metrics from GCloud hosts and OpsAgent does have some support for some of [Flink's metrics reporters](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/). Ended up going with prometheus as that is supported in our current version and OspAgent seems to have decent support for it. This PR can go independently of the infra PR that does the metrics scraping - zipline-ai/infrastructure#46 Tested manually and was able to confirm that metrics are making their way to cloud monitoring -  ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enabled Prometheus metrics reporting for Flink jobs, allowing improved monitoring and observability. - **Chores** - Added Prometheus metrics library as a dependency to the build configuration. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary To get us ready to being able to support production streaming workloads on Flink and Dataproc, we need to have metrics published. [OpsAgent](https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent) is the seeming paved path to reporting metrics from GCloud hosts and OpsAgent does have some support for some of [Flink's metrics reporters](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/). Ended up going with prometheus as that is supported in our current version and OspAgent seems to have decent support for it. This PR can go independently of the infra PR that does the metrics scraping - zipline-ai/infrastructure#46 Tested manually and was able to confirm that metrics are making their way to cloud monitoring -  ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enabled Prometheus metrics reporting for Flink jobs, allowing improved monitoring and observability. - **Chores** - Added Prometheus metrics library as a dependency to the build configuration. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary To get us ready to being able to support production streaming workloads on Flink and Dataproc, we need to have metrics published. [OpsAgent](https://cloud.google.com/staour clientsdriver/docs/solutions/agents/ops-agent) is the seeming paved path to reporting metrics from GCloud hosts and OpsAgent does have some support for some of [Flink's metrics reporters](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/). Ended up going with prometheus as that is supported in our current version and OspAgent seems to have decent support for it. This PR can go independently of the infra PR that does the metrics scraping - zipline-ai/infrastructure#46 Tested manually and was able to confirm that metrics are making their way to cloud monitoring -  ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enabled Prometheus metrics reporting for Flink jobs, allowing improved monitoring and observability. - **Chores** - Added Prometheus metrics library as a dependency to the build configuration. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
To get us ready to being able to support production streaming workloads on Flink and Dataproc, we need to have metrics published. OpsAgent is the seeming paved path to reporting metrics from GCloud hosts and OpsAgent does have some support for some of Flink's metrics reporters. Ended up going with prometheus as that is supported in our current version and OspAgent seems to have decent support for it.
This PR can go independently of the infra PR that does the metrics scraping - https://github.com/zipline-ai/infrastructure/pull/46
Tested manually and was able to confirm that metrics are making their way to cloud monitoring -

Checklist
Summary by CodeRabbit