Skip to content

Conversation

@piyush-zlai
Copy link
Contributor

@piyush-zlai piyush-zlai commented Apr 21, 2025

Summary

To get us ready to being able to support production streaming workloads on Flink and Dataproc, we need to have metrics published. OpsAgent is the seeming paved path to reporting metrics from GCloud hosts and OpsAgent does have some support for some of Flink's metrics reporters. Ended up going with prometheus as that is supported in our current version and OspAgent seems to have decent support for it.

This PR can go independently of the infra PR that does the metrics scraping - https://github.com/zipline-ai/infrastructure/pull/46

Tested manually and was able to confirm that metrics are making their way to cloud monitoring -
Screenshot 2025-04-21 at 3 33 55 PM

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • New Features
    • Enabled Prometheus metrics reporting for Flink jobs, allowing improved monitoring and observability.
  • Chores
    • Added Prometheus metrics library as a dependency to the build configuration.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Apr 21, 2025

Walkthrough

This change introduces Prometheus metrics reporting to Flink jobs by updating both the Flink job environment properties and the build system dependencies. The DataprocSubmitter.scala file is modified to add Prometheus and StatsD metrics configuration to the Flink job environment. Correspondingly, the build files are updated to include the flink-metrics-prometheus artifact as a dependency, and the Maven artifact metadata is refreshed to register the Prometheus reporter implementations.

Changes

Files/Paths Change Summary
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala Added Prometheus and StatsD metrics reporting properties to Flink job environment configuration.
flink/BUILD.bazel, tools/build_rules/dependencies/maven_repository.bzl Added org.apache.flink:flink-metrics-prometheus as a build dependency for Flink modules.
maven_install.json Registered the Prometheus metrics artifact, its SHA sums, and service loader entries for reporter factory classes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DataprocSubmitter
    participant FlinkJob
    participant PrometheusReporter

    User->>DataprocSubmitter: Submit Flink job
    DataprocSubmitter->>FlinkJob: Build Flink job with Prometheus metrics config
    FlinkJob->>PrometheusReporter: Initialize metrics reporting
    PrometheusReporter-->>FlinkJob: Expose metrics on configured port
Loading

Possibly related PRs

Suggested reviewers

  • david-zlai
  • nikhil-zlai

Poem

Metrics now flow, Prometheus awake,
Flink jobs reporting, for insight’s sake.
Ports open wide, 9250 to 9260,
StatsD ticks by, every minute’s plenty.
Code and config, together they sing,
Observability—let the dashboards ring!
📊✨

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between f026d26 and aaffc96.

📒 Files selected for processing (4)
  • cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala (1 hunks)
  • flink/BUILD.bazel (1 hunks)
  • maven_install.json (8 hunks)
  • tools/build_rules/dependencies/maven_repository.bzl (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (19)
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: service_tests
  • GitHub Check: hub_tests
  • GitHub Check: service_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: online_tests
  • GitHub Check: service_commons_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: hub_tests
  • GitHub Check: api_tests
  • GitHub Check: flink_tests
  • GitHub Check: online_tests
  • GitHub Check: api_tests
  • GitHub Check: flink_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: bazel_config_tests
🔇 Additional comments (11)
tools/build_rules/dependencies/maven_repository.bzl (1)

167-167: Added Prometheus metrics dependency for Flink.

The dependency addition is correct and matches the version of other Flink components.

flink/BUILD.bazel (1)

31-31: Added Prometheus metrics reporter to Flink library.

This dependency addition enables the Prometheus reporter to be included in the Flink library.

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala (1)

148-148: Enabled flamegraph for performance analysis.

The flamegraph feature complements metrics reporting by providing visual performance analysis.

maven_install.json (8)

3-4: Verify artifacts hash bump.
Ensure the new __INPUT_ARTIFACTS_HASH and __RESOLVED_ARTIFACTS_HASH cover the added Prometheus entries.


2831-2837: Add flink‑metrics‑prometheus artifact.
Version and shasums look correct.


7146-7150: Include Prometheus in module dependencies.
Dependency list updated properly.


13714-13719: Register Prometheus reporter factories.
Service loader entries are correctly added.


24607-24608: Add Prometheus artifact to default set.
Jar and sources included.


26039-26040: Add Prometheus artifact to default set.
Jar and sources included.


27471-27472: Add Prometheus artifact to default set.
Jar and sources included.


29996-30007: Add service loader metadata for Prometheus.
Factory mappings for jar and sources in place.

Comment on lines +151 to +155
"metrics.reporters" -> "prom",
"metrics.reporter.prom.factory.class" -> "org.apache.flink.metrics.prometheus.PrometheusReporterFactory",
"metrics.reporter.prom.host" -> "localhost",
"metrics.reporter.prom.port" -> "9250-9260",
"metrics.reporter.statsd.interval" -> "60 SECONDS"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Added Prometheus metrics configuration.

The configuration properly enables Prometheus metrics reporting with appropriate settings for host and port range.

Is the StatsD reporting interval (line 155) intended to be used alongside Prometheus? If so, verify that metrics.reporters should include both reporters:


🏁 Script executed:

#!/bin/bash
# Check if StatsDreporter is imported or used anywhere
rg -A 2 -B 2 "StatsDReporter" --glob "*.{scala,java}"

# Check if metrics.reporters setting includes statsd in any other files
rg "metrics\.reporters.*statsd" --glob "*.{scala,java,properties,yaml,conf}"

Length of output: 132


Unintended StatsD interval remains without StatsD reporter
The "metrics.reporter.statsd.interval" setting is inert because only Prometheus is enabled. Either remove the StatsD interval line or explicitly add and import a StatsD reporter.

• File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala (lines 151–155)
• Action:
- Remove the "metrics.reporter.statsd.interval" entry if you don’t plan to use StatsD,
- OR add "statsd" to the "metrics.reporters" list and import/instantiate the StatsD reporter class.

@piyush-zlai piyush-zlai merged commit 0cdc57a into main Apr 22, 2025
23 checks passed
@piyush-zlai piyush-zlai deleted the piyush/flink_metrics_reporter branch April 22, 2025 13:47
kumar-zlai pushed a commit that referenced this pull request Apr 25, 2025
## Summary
To get us ready to being able to support production streaming workloads
on Flink and Dataproc, we need to have metrics published.
[OpsAgent](https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent)
is the seeming paved path to reporting metrics from GCloud hosts and
OpsAgent does have some support for some of [Flink's metrics
reporters](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/).
Ended up going with prometheus as that is supported in our current
version and OspAgent seems to have decent support for it.

This PR can go independently of the infra PR that does the metrics
scraping - zipline-ai/infrastructure#46

Tested manually and was able to confirm that metrics are making their
way to cloud monitoring -
![Screenshot 2025-04-21 at 3 33
55 PM](https://github.com/user-attachments/assets/463c491f-5d32-4ef1-8a7d-62be093e7e93)

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enabled Prometheus metrics reporting for Flink jobs, allowing improved
monitoring and observability.
- **Chores**
- Added Prometheus metrics library as a dependency to the build
configuration.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kumar-zlai pushed a commit that referenced this pull request Apr 29, 2025
## Summary
To get us ready to being able to support production streaming workloads
on Flink and Dataproc, we need to have metrics published.
[OpsAgent](https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent)
is the seeming paved path to reporting metrics from GCloud hosts and
OpsAgent does have some support for some of [Flink's metrics
reporters](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/).
Ended up going with prometheus as that is supported in our current
version and OspAgent seems to have decent support for it.

This PR can go independently of the infra PR that does the metrics
scraping - zipline-ai/infrastructure#46

Tested manually and was able to confirm that metrics are making their
way to cloud monitoring -
![Screenshot 2025-04-21 at 3 33
55 PM](https://github.com/user-attachments/assets/463c491f-5d32-4ef1-8a7d-62be093e7e93)

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enabled Prometheus metrics reporting for Flink jobs, allowing improved
monitoring and observability.
- **Chores**
- Added Prometheus metrics library as a dependency to the build
configuration.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
## Summary
To get us ready to being able to support production streaming workloads
on Flink and Dataproc, we need to have metrics published.
[OpsAgent](https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent)
is the seeming paved path to reporting metrics from GCloud hosts and
OpsAgent does have some support for some of [Flink's metrics
reporters](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/).
Ended up going with prometheus as that is supported in our current
version and OspAgent seems to have decent support for it.

This PR can go independently of the infra PR that does the metrics
scraping - zipline-ai/infrastructure#46

Tested manually and was able to confirm that metrics are making their
way to cloud monitoring -
![Screenshot 2025-04-21 at 3 33
55 PM](https://github.com/user-attachments/assets/463c491f-5d32-4ef1-8a7d-62be093e7e93)

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enabled Prometheus metrics reporting for Flink jobs, allowing improved
monitoring and observability.
- **Chores**
- Added Prometheus metrics library as a dependency to the build
configuration.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
## Summary
To get us ready to being able to support production streaming workloads
on Flink and Dataproc, we need to have metrics published.
[OpsAgent](https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent)
is the seeming paved path to reporting metrics from GCloud hosts and
OpsAgent does have some support for some of [Flink's metrics
reporters](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/).
Ended up going with prometheus as that is supported in our current
version and OspAgent seems to have decent support for it.

This PR can go independently of the infra PR that does the metrics
scraping - zipline-ai/infrastructure#46

Tested manually and was able to confirm that metrics are making their
way to cloud monitoring -
![Screenshot 2025-04-21 at 3 33
55 PM](https://github.com/user-attachments/assets/463c491f-5d32-4ef1-8a7d-62be093e7e93)

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enabled Prometheus metrics reporting for Flink jobs, allowing improved
monitoring and observability.
- **Chores**
- Added Prometheus metrics library as a dependency to the build
configuration.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 16, 2025
## Summary
To get us ready to being able to support production streaming workloads
on Flink and Dataproc, we need to have metrics published.
[OpsAgent](https://cloud.google.com/staour clientsdriver/docs/solutions/agents/ops-agent)
is the seeming paved path to reporting metrics from GCloud hosts and
OpsAgent does have some support for some of [Flink's metrics
reporters](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/).
Ended up going with prometheus as that is supported in our current
version and OspAgent seems to have decent support for it.

This PR can go independently of the infra PR that does the metrics
scraping - zipline-ai/infrastructure#46

Tested manually and was able to confirm that metrics are making their
way to cloud monitoring -
![Screenshot 2025-04-21 at 3 33
55 PM](https://github.com/user-attachments/assets/463c491f-5d32-4ef1-8a7d-62be093e7e93)

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enabled Prometheus metrics reporting for Flink jobs, allowing improved
monitoring and observability.
- **Chores**
- Added Prometheus metrics library as a dependency to the build
configuration.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants