Add PubSub Flink source #794

piyush-zlai · 2025-05-20T16:11:37Z

Summary

This PR adds a Google Pub/Sub Flink source. Some aspects to call out:

We use the Flink Google PubSub connector rather than rolling completely from scratch
Users need to configure: Project name and PubSub subscription name (project name in teams.py and subscription on the topic string) and optionally the parallelism (using the 'tasks' param on their topic info config).
Pubsub differs from Kafka in a couple of aspects - a) topic parallelism is managed internally (so we can't use that to derive parallelism); b) We ACK per message rather than per offset (done during checkpoints); c) Subscriptions always resume from where they left off (no offset rewinding / fast forwarding) so we might have a really large backlog on restarts.
Add a PubSub event driver to help with testing
To not pull in Flink jars in the cloud gcp module and vice versa, I've created a connectors directory in flink and added a pubsub directory there - we build and release the pubsub connector jars. If a user adds the 'ENABLE_PUBSUB' config to their teams.py, then we'll load up the pubsub connector + read the subscription name.

Some charts:
Publishing of messages to PubSub using the event driver Flink app:

Writes to BT from the gcp.item_event_canary.actions_pubsub Flink app:

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

New Features
- Added support for Google Cloud Pub/Sub as a streaming source in Flink jobs, enabling ingestion from Pub/Sub alongside Kafka.
- Introduced a new Flink streaming application to emit events to Pub/Sub topics.
- Added optional configuration to include a Pub/Sub connector JAR in Flink job submissions.
- Enhanced build scripts and Bazel targets to build and deploy Flink PubSub connector artifacts.
Bug Fixes
- Improved configuration handling for streaming sources, ensuring correct property resolution for Kafka and Pub/Sub.
Refactor
- Unified streaming source creation under a provider abstraction, simplifying extension to new message bus types.
- Consolidated and simplified Kafka Flink source implementation.
- Updated method signatures to accept generic properties maps for better flexibility.
Tests
- Added comprehensive tests for Kafka and Pub/Sub Flink source providers, validating configuration and error handling.
- Introduced integration tests for Pub/Sub source parallelism and property validation.
Chores
- Updated dependencies to include Flink Pub/Sub connector and related libraries.
- Adjusted environment and test configurations to enable and validate Pub/Sub support.

coderabbitai · 2025-05-20T16:11:45Z

Walkthrough

This update introduces full support for Google Pub/Sub as a streaming source in Flink jobs. It adds new Scala classes, build targets, Python integration, and deployment scripts for Pub/Sub connectors, refactors source creation to be message-bus agnostic, and updates tests and configuration to enable and validate Pub/Sub ingestion.

Changes

File(s) / Path(s)	Change Summary
`flink/src/main/scala/ai/chronon/flink/source/FlinkSourceProvider.scala`, `flink/src/main/scala/ai/chronon/flink/source/FlinkSource.scala`, `flink/src/main/scala/ai/chronon/flink/source/KafkaFlinkSource.scala`, `flink/src/main/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSource.scala`	Added unified Flink source provider supporting Kafka and Pub/Sub; refactored Kafka source; added Pub/Sub source with parallelism config.
`flink/src/main/scala/ai/chronon/flink/FlinkJob.scala`, `flink/src/main/scala/ai/chronon/flink/validation/ValidationFlinkJob.scala`	Refactored job and validation logic to use generic source provider and properties map instead of Kafka-specific args; enabled checkpointing.
`flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala`	Added new Flink streaming driver emitting events to Pub/Sub from Avro files with configurable delay and schema.
`flink/BUILD.bazel`, `tools/build_rules/dependencies/maven_repository.bzl`, `maven_install.json`	Added Flink Pub/Sub connector dependencies and new build targets for Pub/Sub connectors and assemblies.
`cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala`, `spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala`	Added optional Flink Pub/Sub connector JAR URI support in job submission; updated argument parsing and job builder signatures.
`api/python/ai/chronon/repo/gcp.py`, `api/python/test/canary/group_bys/gcp/item_event_canary.py`, `api/python/test/canary/teams.py`	Integrated Pub/Sub support into Python GCP runner; refactored group-by tests to use helper and added Pub/Sub event source; enabled Pub/Sub in team config.
`scripts/distribution/build_and_upload_artifacts.sh`	Automated building and uploading of Flink Pub/Sub connector JAR during GCP deployment.
`flink/src/test/scala/ai/chronon/flink/test/source/KafkaFlinkSourceSpec.scala`, `flink/src/test/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSourceSpec.scala`	Added test suites for Kafka and Pub/Sub Flink source creation and config validation.
`flink/src/test/scala/ai/chronon/flink/test/FlinkTestUtils.scala`, `flink/src/test/scala/ai/chronon/flink/validation/ValidationFlinkJobIntegrationTest.scala`	Updated or removed test sources reflecting new source provider logic; added parallelism implicits.
`cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala`	Added tests for Flink job submission with Pub/Sub connector JAR and local ingest simulation.

Sequence Diagram(s)

sequenceDiagram
    participant Python as GcpRunner (Python)
    participant Submitter as DataprocSubmitter (Scala)
    participant Flink as FlinkJob/FlinkSourceProvider
    participant PubSub as PubSubFlinkSource

    Python->>Submitter: Submit Flink job (with optional Pub/Sub JAR, subscription)
    Submitter->>Flink: Build Flink job (includes Pub/Sub JAR if enabled)
    Flink->>Flink: Use FlinkSourceProvider to select source
    Flink->>PubSub: Instantiate PubSubFlinkSource if message bus is Pub/Sub
    PubSub-->>Flink: Provide Flink DataStream from Pub/Sub

Possibly related PRs

zipline-ai/chronon#189: Related Flink job submission logic changes involving Pub/Sub connector JAR handling.
zipline-ai/chronon#540: Introduces Pub/Sub interface and updates job submission to use Pub/Sub messaging, related to streaming integration.

Suggested reviewers

nikhil-zlai
chewy-zlai
david-zlai

Poem

🚀
Flink and Pub/Sub now entwined,
Streaming data, so refined!
With connectors built and tests anew,
GCP jobs just flew and flew—
Kafka, Pub/Sub, choose your stream,
Chronon’s pipelines reign supreme!
🌩️

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (7)

flink/src/test/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSourceSpec.scala (2)

8-48: Test suite looks good but could use more specific error checks.

Using assertThrows[Exception] is too generic. Consider using more specific exception types to ensure correct failure modes.

35-36: Add type verification for created source.

Verify that src is actually a PubSubFlinkSource instance using shouldBe a [PubSubFlinkSource].

api/python/ai/chronon/repo/gcp.py (1)

78-86: Consider more descriptive environment variable name.

SUBSCRIPTION_NAME is generic. Consider using a more specific name like GCP_PUBSUB_SUBSCRIPTION_NAME to match other GCP variable naming.
flink/src/main/scala/ai/chronon/flink/source/FlinkSourceProvider.scala (1)
29-39: Smart use of reflection for PubSub source.

Avoids direct dependency on GCP libraries - good for environments like AWS.

Consider adding error handling for reflection failures:
 private def loadPubsubSource[T](props: Map[String, String],
                                  deserializationSchema: DeserializationSchema[T],
                                  topicInfo: TopicInfo): FlinkSource[T] = {
-    val cl = Thread.currentThread().getContextClassLoader() // Use Flink's classloader
-    val cls = cl.loadClass("ai.chronon.flink_connectors.pubsub.PubSubFlinkSource")
-    val constructor = cls.getConstructors.apply(0)
-    val onlineImpl = constructor.newInstance(props, deserializationSchema, topicInfo)
-    onlineImpl.asInstanceOf[FlinkSource[T]]
+    try {
+      val cl = Thread.currentThread().getContextClassLoader() // Use Flink's classloader
+      val cls = cl.loadClass("ai.chronon.flink_connectors.pubsub.PubSubFlinkSource")
+      val constructor = cls.getConstructors.apply(0)
+      val onlineImpl = constructor.newInstance(props, deserializationSchema, topicInfo)
+      onlineImpl.asInstanceOf[FlinkSource[T]]
+    } catch {
+      case e: ClassNotFoundException =>
+        throw new IllegalStateException("PubSub connector not found on classpath. Ensure PubSub connector jar is included.", e)
+      case e: Exception =>
+        throw new IllegalStateException(s"Failed to initialize PubSub source: ${e.getMessage}", e)
+    }
}
flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala (2)

111-119: Consider alternatives to Thread.sleep

Using Thread.sleep blocks the executing thread. For production, consider using Flink's timer service or rate-limiting operators instead.

44-62: Add parallelism configuration option

Source parallelism is hardcoded to 1. Consider allowing this to be configurable via command line arguments for handling larger datasets.

flink/src/main/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSource.scala (1)

46-53: Watermark strategy documentation needed

The comment about skipping watermarks is valid but could benefit from more context about where/how watermarks are generated downstream.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b025731 and f73eb85.

📒 Files selected for processing (21)

api/python/ai/chronon/repo/gcp.py (4 hunks)
api/python/test/canary/group_bys/gcp/item_event_canary.py (1 hunks)
api/python/test/canary/teams.py (1 hunks)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala (4 hunks)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (5 hunks)
flink/BUILD.bazel (4 hunks)
flink/src/main/scala/ai/chronon/flink/FlinkJob.scala (5 hunks)
flink/src/main/scala/ai/chronon/flink/source/FlinkSource.scala (1 hunks)
flink/src/main/scala/ai/chronon/flink/source/FlinkSourceProvider.scala (1 hunks)
flink/src/main/scala/ai/chronon/flink/source/KafkaFlinkSource.scala (2 hunks)
flink/src/main/scala/ai/chronon/flink/validation/ValidationFlinkJob.scala (5 hunks)
flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala (1 hunks)
flink/src/main/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSource.scala (1 hunks)
flink/src/test/scala/ai/chronon/flink/test/FlinkTestUtils.scala (2 hunks)
flink/src/test/scala/ai/chronon/flink/test/source/KafkaFlinkSourceSpec.scala (1 hunks)
flink/src/test/scala/ai/chronon/flink/validation/ValidationFlinkJobIntegrationTest.scala (2 hunks)
flink/src/test/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSourceSpec.scala (1 hunks)
maven_install.json (13 hunks)
scripts/distribution/build_and_upload_artifacts.sh (3 hunks)
spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala (3 hunks)
tools/build_rules/dependencies/maven_repository.bzl (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (3)

flink/src/test/scala/ai/chronon/flink/validation/ValidationFlinkJobIntegrationTest.scala (2)

flink/src/main/scala/ai/chronon/flink/source/FlinkSource.scala (1)

FlinkSource (6-21)

flink/src/test/scala/ai/chronon/flink/test/FlinkTestUtils.scala (1)

FlinkTestUtils (86-156)

api/python/ai/chronon/repo/gcp.py (1)

api/python/ai/chronon/repo/utils.py (1)

get_environ_arg (49-53)

flink/src/main/scala/ai/chronon/flink/FlinkJob.scala (4)

flink/src/main/scala/ai/chronon/flink/source/FlinkSource.scala (1)

FlinkSource (6-21)

flink/src/main/scala/ai/chronon/flink/source/FlinkSourceProvider.scala (2)

FlinkSourceProvider (6-40)

build (7-18)

flink/src/main/scala/ai/chronon/flink/source/KafkaFlinkSource.scala (2)

KafkaFlinkSource (12-53)

KafkaFlinkSource (55-62)

flink/src/main/scala/ai/chronon/flink/validation/ValidationFlinkJob.scala (3)

ValidationFlinkJob (84-131)

ValidationFlinkJob (133-183)

run (134-182)

⏰ Context from checks skipped due to timeout of 90000ms (31)

GitHub Check: cloud_gcp_tests
GitHub Check: service_tests
GitHub Check: cloud_aws_tests
GitHub Check: streaming_tests
GitHub Check: online_tests
GitHub Check: join_tests
GitHub Check: service_tests
GitHub Check: cloud_gcp_tests
GitHub Check: streaming_tests
GitHub Check: service_commons_tests
GitHub Check: groupby_tests
GitHub Check: join_tests
GitHub Check: cloud_aws_tests
GitHub Check: online_tests
GitHub Check: fetcher_tests
GitHub Check: groupby_tests
GitHub Check: api_tests
GitHub Check: api_tests
GitHub Check: batch_tests
GitHub Check: fetcher_tests
GitHub Check: aggregator_tests
GitHub Check: analyzer_tests
GitHub Check: analyzer_tests
GitHub Check: flink_tests
GitHub Check: flink_tests
GitHub Check: spark_tests
GitHub Check: spark_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: aggregator_tests
GitHub Check: batch_tests
GitHub Check: python_tests

🔇 Additional comments (59)

api/python/test/canary/teams.py (1)

66-67: PubSub configuration added correctly.

The addition of these environment variables enables PubSub functionality and defines the subscription name for GCP team configuration.

tools/build_rules/dependencies/maven_repository.bzl (1)

179-180: Appropriate PubSub connector dependency added.

Added the required Flink connector for Google PubSub.

flink/src/test/scala/ai/chronon/flink/validation/ValidationFlinkJobIntegrationTest.scala (2)

4-4: Import correctly updated for new package structure.

Import statement properly updated to reflect the new package location of FlinkSource.

23-23: Added required parallelism value.

The implicit parallelism parameter is correctly implemented to match the FlinkSource interface changes.

flink/src/main/scala/ai/chronon/flink/source/FlinkSource.scala (2)

1-1: Package reorganization looks good.

Moved to a more specific package name that better represents the source functionality.

8-11: Well-documented parallelism field added.

The parallelism field is properly documented and marked as implicit. This change supports the PubSub integration where parallelism configuration is important.

flink/src/test/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSourceSpec.scala (1)

14-16: Avoid passing null as parameter.

The null argument to FlinkSourceProvider.build() should be replaced with a proper value or use an overloaded method without this parameter.

flink/src/test/scala/ai/chronon/flink/test/FlinkTestUtils.scala (2)

14-15: Import change looks good.

The update to the FlinkSource import path reflects the package restructuring.

52-53: Good addition of implicit parallelism.

Adding default parallelism value standardizes test source behavior.

scripts/distribution/build_and_upload_artifacts.sh (3)

157-160: Good implementation for PubSub connector build.

The changes correctly add the PubSub connector JAR build step.

173-176: Appropriate validation for PubSub JAR.

The check ensures PubSub JAR was successfully built, following the pattern used for other JARs.

204-204: Upload step correctly implemented.

The PubSub JAR upload uses consistent metadata and destination path.

api/python/ai/chronon/repo/gcp.py (3)

33-33: Good constant naming for PubSub JAR.

Follows the established pattern for JAR file constants.

289-301: Correctly handles PubSub JAR URI conditionally.

The implementation only includes PubSub JAR when enabled, which is proper feature flagging.

343-347: Correctly passes subscription configuration.

PubSub subscription is properly included in user arguments when enabled.

flink/src/test/scala/ai/chronon/flink/test/source/KafkaFlinkSourceSpec.scala (1)

1-43: LGTM! Test coverage for FlinkSourceProvider's Kafka functionality.

Good test coverage for bootstrap server resolution from various sources.

spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala (3)

111-111: Added constant for PubSub connector JAR URI.

Consistent naming with existing constants.

139-139: Added CLI argument keyword for PubSub connector JAR.

Follows existing naming pattern.

164-164: Added PubSub argument to shared internal args set.

Ensures proper arg handling in submission pipeline.

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (4)

18-18: Added UUID import for generating unique job IDs.

Required for PubSub driver test.

31-31: Updated existing tests with PubSub connector parameter.

Properly accommodates new optional parameter.

Also applies to: 92-92

101-122: New test validating PubSub connector JAR inclusion.

Verifies JARs are properly combined in submission.

879-904: Replaced Kafka test with PubSub equivalent.

Both tests are ignored for CI but useful for local testing.

api/python/test/canary/group_bys/gcp/item_event_canary.py (3)

36-53: Good refactoring - extracted reusable function.

Improves code organization and reusability.

58-58: Renamed variable for clarity and updated function call.

Better variable naming improves readability.

Also applies to: 60-60

62-65: Added PubSub equivalent source with parallelism configuration.

Using 'tasks=4' parameter to set PubSub parallelism level.

flink/src/main/scala/ai/chronon/flink/validation/ValidationFlinkJob.scala (3)

21-24: Good addition of imports.

These imports support the new abstracted source creation model and enable checkpointing.

135-135: Source creation properly abstracted.

Changed from direct Kafka source instantiation to FlinkSourceProvider, enabling support for PubSub.

Also applies to: 150-151

170-170: Important checkpoint configuration added.

AT_LEAST_ONCE checkpointing mode is critical for PubSub which acknowledges messages during checkpoints.

flink/src/main/scala/ai/chronon/flink/FlinkJob.scala (5)

10-10: Updated import for source abstraction.

Now includes FlinkSourceProvider for source creation abstraction.

270-272: Good consolidation of streaming parameters.

Creating a combined properties map provides unified configuration approach supporting multiple source types.

278-278: Properly updated method call.

ValidationFlinkJob.run now correctly receives the combined properties map.

292-292: Method parameter updated consistently.

buildFlinkJob now accepts props map instead of Kafka-specific bootstrap parameter.

338-339: Source creation properly abstracted.

Using FlinkSourceProvider.build enables support for multiple message bus types.

Also applies to: 355-355

flink/src/main/scala/ai/chronon/flink/source/FlinkSourceProvider.scala (2)

6-18: Well-designed factory method.

The build method cleanly routes to appropriate source implementation based on message bus type.

20-27: Good property resolution helper.

Properly checks props first, falls back to topic params, and filters empty values.

flink/src/main/scala/ai/chronon/flink/source/KafkaFlinkSource.scala (4)

12-23: Good refactoring of bootstrap resolution.

First checks props map, then falls back to host/port from topicInfo. Clear error message when bootstrap is missing.

28-30: Efficient lazy parallelism calculation.

Properly calculates parallelism only when needed.

34-35: Moved topic check to appropriate location.

Topic existence check moved into getDataStream method where it logically belongs.

55-62: Well-structured companion object.

Good Scala pattern with constant definition and helper method.

maven_install.json (13)

3-4: Autogenerated hashes updated.

490-496: Add gRPC Pub/Sub v1 artifact metadata.

2900-2906: Add Flink Pub/Sub connector artifact metadata.

5478-5497: Declare dependencies for grpc-google-cloud-pubsub-v1.

7329-7340: Register transitive deps for Flink GCP Pub/Sub connector.

9891-9893: Map Pub/Sub proto package.

13919-13923: List Flink Pub/Sub connector packages.

24307-24308: Include Pub/Sub artifact sources.

24995-24996: Reference Flink Pub/Sub connector in sources.

25779-25780: Include Pub/Sub gRPC jar and sources.

26467-26468: Add Flink Pub/Sub connector JAR references.

27251-27252: Add Pub/Sub gRPC jar & sources entry.

27939-27940: Include Flink Pub/Sub connector in artifact lists.

flink/src/main/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSource.scala (2)

33-35: Good validation of required properties

Required properties are properly checked and clear error messages are provided.

11-23: Well-documented PubSub differences

Good explanation of key differences between PubSub and Kafka, particularly around topic parallelism and subscription behavior.

flink/BUILD.bazel (2)

95-144: Clean connector target isolation

Good separation of PubSub connector into dedicated targets, preventing dependency conflicts.

3-3: Appropriate source glob narrowing

Correctly narrowed source file patterns to specific package structure.

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala (2)

291-298: Well-implemented JAR URI handling

Correctly appends PubSub connector JAR to job configuration when provided.

397-408: Good optional property handling

Properly extracts optional PubSub connector URI and conditionally adds it to job properties.

coderabbitai · 2025-05-20T16:14:37Z

flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala

+package ai.chronon.flink_connectors.pubsub
+
+import org.apache.avro.Schema
+import org.apache.avro.generic.GenericRecord
+import org.apache.flink.api.common.functions.MapFunction
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.core.fs.Path
+import org.apache.flink.formats.avro.{AvroInputFormat, AvroSerializationSchema}
+import org.apache.flink.formats.avro.typeutils.GenericRecordAvroTypeInfo
+import org.apache.flink.formats.avro.utils.AvroKryoSerializerUtils
+import org.apache.flink.streaming.api.datastream.DataStream
+import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
+import org.apache.flink.streaming.connectors.gcp.pubsub.PubSubSink
+import org.rogach.scallop.{ScallopConf, ScallopOption, Serialization}
+
+// Canary test app that can point to a source data file and will emit an event to PubSub periodically with an updated timestamp
+object FlinkPubSubItemEventDriver {
+  // Pull in the Serialization trait to sidestep: https://github.com/scallop/scallop/issues/137
+  class JobArgs(args: Seq[String]) extends ScallopConf(args) with Serialization {
+    val dataFileName: ScallopOption[String] =
+      opt[String](required = true, descr = "Name of the file on GCS to read data from")
+    val gcpProject: ScallopOption[String] =
+      opt[String](required = true, descr = "Gcp project")
+    val topic: ScallopOption[String] = opt[String](required = true, descr = "PubSub topic to write to")
+    val parentJobId: ScallopOption[String] =
+      opt[String](required = false,
+                  descr = "Parent job id that invoked the Flink job. For example, the Dataproc job id.")
+    val eventDelayMillis: ScallopOption[Int] =
+      opt[Int](required = false,
+               descr = "Delay to use between event publishes (dictates the eps)",
+               default = Some(1000))
+
+    verify()
+  }
+
+  def main(args: Array[String]): Unit = {
+    val jobArgs = new JobArgs(args)
+    val dataFileName = jobArgs.dataFileName()
+    val gcpProject = jobArgs.gcpProject()
+    val topic = jobArgs.topic()
+    val parentJobId = jobArgs.parentJobId()
+    val eventDelayMillis = jobArgs.eventDelayMillis()
+
+    // Configure GCS source
+    val avroFormat = new AvroInputFormat[GenericRecord](
+      new Path(dataFileName),
+      classOf[GenericRecord]
+    )
+
+    implicit val typeInfo: TypeInformation[GenericRecord] = new GenericRecordAvroTypeInfo(avroSchema)
+
+    // Set up the streaming execution environment
+    val env = StreamExecutionEnvironment.getExecutionEnvironment
+    env.getConfig
+      .enableForceKryo() // use kryo for complex types that Flink's default ser system doesn't support (e.g case classes)
+    env.getConfig.enableGenericTypes() // more permissive type checks
+    env.addDefaultKryoSerializer(classOf[Schema], classOf[AvroKryoSerializerUtils.AvroSchemaSerializer])
+
+    val stream = env
+      .createInput(avroFormat)
+      .setParallelism(1)
+
+    val transformedStream: DataStream[GenericRecord] = stream
+      .map(new DelayedSourceTransformFn(eventDelayMillis))
+      .setParallelism(stream.getParallelism)
+
+    // Configure PubSub sink
+    val serializationSchema = AvroSerializationSchema.forGeneric(avroSchema)
+
+    val pubSubSink = PubSubSink
+      .newBuilder()
+      .withSerializationSchema(serializationSchema)
+      .withProjectName(gcpProject)
+      .withTopicName(topic)
+      .build()
+
+    // Write to PubSub
+    transformedStream
+      .addSink(pubSubSink)
+      .setParallelism(transformedStream.getParallelism)
+
+    // Execute program
+    env.execute("Periodic PubSub Data Producer")
+  }


🛠️ Refactor suggestion

Add error handling and logging

No error handling for file read failures, transformation errors, or Pub/Sub publishing issues.

🤖 Prompt for AI Agents

In flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala lines 1 to 84, the code lacks error handling and logging for critical operations such as reading the input file, transforming the data stream, and publishing to Pub/Sub. Add try-catch blocks or Flink-specific error handling mechanisms around file reading and stream transformations to catch and log exceptions. Also, implement logging for failures in the Pub/Sub sink setup and data publishing steps to ensure any runtime issues are captured and can be diagnosed.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala (4)
16-35: Add error handling and logging

No error handling for file read failures, transformation errors, or Pub/Sub publishing issues.

70-76: 🛠️ Refactor suggestion

Add error handling for Pub/Sub sink.

Implement proper error handling for Pub/Sub failures.
val pubSubSink = PubSubSink
  .newBuilder()
  .withSerializationSchema(serializationSchema)
  .withProjectName(gcpProject)
  .withTopicName(topic)
+  .withFailOnError(false) // Consider setting this based on requirements
+  // Add retry policy if supported by the connector
  .build()

+logger.info(s"Created Pub/Sub sink for topic $topic")
1-15: 🛠️ Refactor suggestion

Missing imports for error handling and logging.

Add imports for proper logging and exception handling.
+import org.slf4j.{Logger, LoggerFactory}
+import scala.util.{Try, Success, Failure}
36-84: 🛠️ Refactor suggestion

Implement logging and exception handling.

Add logging and try-catch blocks to handle failures gracefully.
def main(args: Array[String]): Unit = {
+  val logger = LoggerFactory.getLogger(getClass)
+  logger.info("Starting Pub/Sub event driver")
+  
+  try {
    val jobArgs = new JobArgs(args)
    // ... existing code
    
+    logger.info(s"Reading from file: $dataFileName")
+    logger.info(s"Publishing to $topic in project $gcpProject")
    
    // ... existing code
    
    // Execute program
+    logger.info("Executing Flink job")
    env.execute("Periodic PubSub Data Producer")
+  } catch {
+    case e: Exception =>
+      logger.error(s"Failed to execute Pub/Sub event driver: ${e.getMessage}", e)
+      throw e
+  }
}

🧹 Nitpick comments (2)

flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala (2)

59-62: Consider making parallelism configurable.

Hardcoded parallelism limits scalability.

val stream = env
  .createInput(avroFormat)
-  .setParallelism(1)
+  .setParallelism(jobArgs.parallelism.getOrElse(1))

Add to JobArgs:

val parallelism: ScallopOption[Int] = opt[Int](required = false, default = Some(1))

86-108: Document schema fields.

Add documentation explaining each field's purpose.

lazy val avroSchema: Schema = {
+  // Define schema for event records with the following fields:
+  // - event_type: Type of event (e.g., view, click)
+  // - timestamp: Time when event occurred (in epoch milliseconds)
+  // - visitor_id: Unique identifier for the visitor
+  // - is_primary: Flag indicating primary event
+  // - Additional fields for tracking context and metadata
  new Schema.Parser().parse("""

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between f73eb85 and 7552838.

📒 Files selected for processing (21)

api/python/ai/chronon/repo/gcp.py (4 hunks)
api/python/test/canary/group_bys/gcp/item_event_canary.py (1 hunks)
api/python/test/canary/teams.py (1 hunks)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala (4 hunks)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (5 hunks)
flink/BUILD.bazel (4 hunks)
flink/src/main/scala/ai/chronon/flink/FlinkJob.scala (5 hunks)
flink/src/main/scala/ai/chronon/flink/source/FlinkSource.scala (1 hunks)
flink/src/main/scala/ai/chronon/flink/source/FlinkSourceProvider.scala (1 hunks)
flink/src/main/scala/ai/chronon/flink/source/KafkaFlinkSource.scala (2 hunks)
flink/src/main/scala/ai/chronon/flink/validation/ValidationFlinkJob.scala (5 hunks)
flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala (1 hunks)
flink/src/main/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSource.scala (1 hunks)
flink/src/test/scala/ai/chronon/flink/test/FlinkTestUtils.scala (2 hunks)
flink/src/test/scala/ai/chronon/flink/test/source/KafkaFlinkSourceSpec.scala (1 hunks)
flink/src/test/scala/ai/chronon/flink/validation/ValidationFlinkJobIntegrationTest.scala (2 hunks)
flink/src/test/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSourceSpec.scala (1 hunks)
maven_install.json (13 hunks)
scripts/distribution/build_and_upload_artifacts.sh (3 hunks)
spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala (3 hunks)
tools/build_rules/dependencies/maven_repository.bzl (1 hunks)

✅ Files skipped from review due to trivial changes (1)

tools/build_rules/dependencies/maven_repository.bzl

🚧 Files skipped from review as they are similar to previous changes (19)

api/python/test/canary/teams.py
flink/src/test/scala/ai/chronon/flink/validation/ValidationFlinkJobIntegrationTest.scala
flink/src/main/scala/ai/chronon/flink/source/FlinkSource.scala
flink/src/test/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSourceSpec.scala
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala
scripts/distribution/build_and_upload_artifacts.sh
spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala
flink/src/test/scala/ai/chronon/flink/test/source/KafkaFlinkSourceSpec.scala
api/python/test/canary/group_bys/gcp/item_event_canary.py
flink/src/main/scala/ai/chronon/flink/validation/ValidationFlinkJob.scala
api/python/ai/chronon/repo/gcp.py
flink/src/main/scala/ai/chronon/flink/FlinkJob.scala
flink/src/test/scala/ai/chronon/flink/test/FlinkTestUtils.scala
flink/src/main/scala/ai/chronon/flink/source/KafkaFlinkSource.scala
maven_install.json
flink/src/main/scala/ai/chronon/flink/source/FlinkSourceProvider.scala
flink/src/main/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSource.scala
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala
flink/BUILD.bazel

⏰ Context from checks skipped due to timeout of 90000ms (18)

GitHub Check: service_commons_tests
GitHub Check: online_tests
GitHub Check: flink_tests
GitHub Check: cloud_gcp_tests
GitHub Check: cloud_gcp_tests
GitHub Check: cloud_aws_tests
GitHub Check: cloud_aws_tests
GitHub Check: api_tests
GitHub Check: aggregator_tests
GitHub Check: flink_tests
GitHub Check: api_tests
GitHub Check: online_tests
GitHub Check: aggregator_tests
GitHub Check: spark_tests
GitHub Check: fetcher_tests
GitHub Check: spark_tests
GitHub Check: fetcher_tests
GitHub Check: enforce_triggered_workflows

coderabbitai · 2025-05-20T17:18:29Z

flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala

+class DelayedSourceTransformFn(delayMs: Int) extends MapFunction[GenericRecord, GenericRecord] {
+  override def map(value: GenericRecord): GenericRecord = {
+    val updatedTimestamp = System.currentTimeMillis()
+    // Update the timestamp field in the record
+    value.put("timestamp", updatedTimestamp)
+    Thread.sleep(delayMs)
+    value
+  }
+}


🛠️ Refactor suggestion

Avoid Thread.sleep in transformation functions.

Blocking sleep affects performance in distributed processing.

class DelayedSourceTransformFn(delayMs: Int) extends MapFunction[GenericRecord, GenericRecord] { + private val logger = LoggerFactory.getLogger(getClass) + override def map(value: GenericRecord): GenericRecord = { + try { val updatedTimestamp = System.currentTimeMillis() // Update the timestamp field in the record value.put("timestamp", updatedTimestamp) Thread.sleep(delayMs) value + } catch { + case e: InterruptedException => + logger.warn("Sleep interrupted", e) + Thread.currentThread().interrupt() + value + case e: Exception => + logger.error(s"Error processing record: ${e.getMessage}", e) + throw e + } } }

Consider alternative rate-limiting approaches.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala around lines 111 to 119, the map function uses Thread.sleep to introduce delay, which blocks the thread and degrades performance in distributed processing. Remove the Thread.sleep call and instead implement a non-blocking rate-limiting mechanism, such as using Flink's built-in timers, event time processing, or a custom asynchronous function to control the processing rate without blocking the thread.

tchow-zlai · 2025-05-27T16:44:44Z

api/python/ai/chronon/repo/gcp.py

-            )
+                ) + f" --flink-main-jar-uri={flink_jar_uri}"
+            if enable_pubsub:
+                base_formatted_args += f" --flink-pubsub-jar-uri={flink_pubsub_connector_jar_uri}"


btw the batch side of things doesn't adopt this architecture but we do also have the connector pattern going there. We just bring in the deps for cloud_gcp. I guess there are pro's and cons to both, curious whether you think we should converge on the same arch between batch and streaming.

yeah I debated this a bit - ended up going with pulling in the jar optionally to cut the risks of inadvertent class conflicts. This way I don't need to worry at all about any pubsub changes affecting users that have nothing to do with pubsub in their streaming jobs as we're not loading the jars..

david-zlai · 2025-05-28T15:27:12Z

flink/BUILD.bazel

 scala_library(
    name = "lib",
-    srcs = glob(["src/main/**/*.scala"]),
+    srcs = glob(["src/main/scala/ai/chronon/flink/**/*.scala"]),


david-zlai · 2025-05-28T15:31:17Z

flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala

+    env.getConfig
+      .enableForceKryo() // use kryo for complex types that Flink's default ser system doesn't support (e.g case classes)
+    env.getConfig.enableGenericTypes() // more permissive type checks
+    env.addDefaultKryoSerializer(classOf[Schema], classOf[AvroKryoSerializerUtils.AvroSchemaSerializer])


curious, but if we don't add this, does serialization break?

yeah iirc, I hit kryo issues without

david-zlai · 2025-05-28T15:36:24Z

flink/src/main/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSource.scala

+  * This means if the job is down for a while, we'll have a decent sized backlog to catch up on. To start afresh, a new subscription is
+  * needed.


To start afresh, a new subscription is needed

oof interesting

…e topic level

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 868fadb and 1b59a71.

📒 Files selected for processing (21)

api/python/ai/chronon/repo/gcp.py (3 hunks)
api/python/test/canary/group_bys/gcp/item_event_canary.py (1 hunks)
api/python/test/canary/teams.py (1 hunks)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala (4 hunks)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (5 hunks)
flink/BUILD.bazel (3 hunks)
flink/src/main/scala/ai/chronon/flink/FlinkJob.scala (5 hunks)
flink/src/main/scala/ai/chronon/flink/source/FlinkSource.scala (1 hunks)
flink/src/main/scala/ai/chronon/flink/source/FlinkSourceProvider.scala (1 hunks)
flink/src/main/scala/ai/chronon/flink/source/KafkaFlinkSource.scala (2 hunks)
flink/src/main/scala/ai/chronon/flink/validation/ValidationFlinkJob.scala (5 hunks)
flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala (1 hunks)
flink/src/main/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSource.scala (1 hunks)
flink/src/test/scala/ai/chronon/flink/test/FlinkTestUtils.scala (2 hunks)
flink/src/test/scala/ai/chronon/flink/test/source/KafkaFlinkSourceSpec.scala (1 hunks)
flink/src/test/scala/ai/chronon/flink/validation/ValidationFlinkJobIntegrationTest.scala (2 hunks)
flink/src/test/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSourceSpec.scala (1 hunks)
maven_install.json (13 hunks)
scripts/distribution/build_and_upload_artifacts.sh (2 hunks)
spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala (3 hunks)
tools/build_rules/dependencies/maven_repository.bzl (1 hunks)

✅ Files skipped from review due to trivial changes (1)

flink/src/test/scala/ai/chronon/flink/test/source/KafkaFlinkSourceSpec.scala

🚧 Files skipped from review as they are similar to previous changes (19)

flink/src/main/scala/ai/chronon/flink/source/FlinkSource.scala
scripts/distribution/build_and_upload_artifacts.sh
api/python/test/canary/teams.py
tools/build_rules/dependencies/maven_repository.bzl
flink/src/test/scala/ai/chronon/flink/validation/ValidationFlinkJobIntegrationTest.scala
api/python/ai/chronon/repo/gcp.py
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala
flink/src/test/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSourceSpec.scala
flink/src/test/scala/ai/chronon/flink/test/FlinkTestUtils.scala
flink/src/main/scala/ai/chronon/flink/source/KafkaFlinkSource.scala
flink/src/main/scala/ai/chronon/flink/FlinkJob.scala
spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala
flink/src/main/scala/ai/chronon/flink_connectors/pubsub/PubSubFlinkSource.scala
flink/src/main/scala/ai/chronon/flink/validation/ValidationFlinkJob.scala
maven_install.json
flink/src/main/scala/ai/chronon/flink_connectors/pubsub/FlinkPubSubItemEventDriver.scala
api/python/test/canary/group_bys/gcp/item_event_canary.py
flink/BUILD.bazel
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala

⏰ Context from checks skipped due to timeout of 90000ms (18)

GitHub Check: cloud_gcp_tests
GitHub Check: service_commons_tests
GitHub Check: online_tests
GitHub Check: service_tests
GitHub Check: api_tests
GitHub Check: cloud_aws_tests
GitHub Check: flink_tests
GitHub Check: aggregator_tests
GitHub Check: streaming_tests
GitHub Check: groupby_tests
GitHub Check: analyzer_tests
GitHub Check: fetcher_tests
GitHub Check: join_tests
GitHub Check: batch_tests
GitHub Check: python_tests
GitHub Check: spark_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (3)

flink/src/main/scala/ai/chronon/flink/source/FlinkSourceProvider.scala (3)

1-5: Clean imports and package structure.

Standard setup looks good.

6-18: Factory pattern implementation is solid.

Good separation of concerns using pattern matching on message bus type.

20-27: Useful property resolution utility.

Nice fallback mechanism from props to topicInfo params.

coderabbitai · 2025-05-28T22:51:46Z

flink/src/main/scala/ai/chronon/flink/source/FlinkSourceProvider.scala

+  // Pubsub source is loaded via reflection as we don't want the Flink module to depend on the PubSub connector
+  // module as we don't want to pull in Gcp deps in contexts such as running in Aws
+  private def loadPubsubSource[T](props: Map[String, String],
+                                  deserializationSchema: DeserializationSchema[T],
+                                  topicInfo: TopicInfo): FlinkSource[T] = {
+    val cl = Thread.currentThread().getContextClassLoader // Use Flink's classloader
+    val cls = cl.loadClass("ai.chronon.flink_connectors.pubsub.PubSubFlinkSource")
+    val constructor = cls.getConstructors.apply(0)
+    val onlineImpl = constructor.newInstance(props, deserializationSchema, topicInfo)
+    onlineImpl.asInstanceOf[FlinkSource[T]]
+  }


🛠️ Refactor suggestion

Reflection approach needs error handling.

The reflection loading is clever for dependency isolation but fragile - getConstructors.apply(0) assumes constructor order and lacks exception handling.

private def loadPubsubSource[T](props: Map[String, String], deserializationSchema: DeserializationSchema[T], topicInfo: TopicInfo): FlinkSource[T] = { - val cl = Thread.currentThread().getContextClassLoader // Use Flink's classloader - val cls = cl.loadClass("ai.chronon.flink_connectors.pubsub.PubSubFlinkSource") - val constructor = cls.getConstructors.apply(0) - val onlineImpl = constructor.newInstance(props, deserializationSchema, topicInfo) - onlineImpl.asInstanceOf[FlinkSource[T]] + try { + val cl = Thread.currentThread().getContextClassLoader + val cls = cl.loadClass("ai.chronon.flink_connectors.pubsub.PubSubFlinkSource") + val constructor = cls.getConstructor(classOf[Map[String, String]], classOf[DeserializationSchema[T]], classOf[TopicInfo]) + val onlineImpl = constructor.newInstance(props, deserializationSchema, topicInfo) + onlineImpl.asInstanceOf[FlinkSource[T]] + } catch { + case e: Exception => throw new RuntimeException("Failed to load PubSub source", e) + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Pubsub source is loaded via reflection as we don't want the Flink module to depend on the PubSub connector

// module as we don't want to pull in Gcp deps in contexts such as running in Aws

private def loadPubsubSource[T](props: Map[String, String],

deserializationSchema: DeserializationSchema[T],

topicInfo: TopicInfo): FlinkSource[T] = {

val cl = Thread.currentThread().getContextClassLoader // Use Flink's classloader

val cls = cl.loadClass("ai.chronon.flink_connectors.pubsub.PubSubFlinkSource")

val constructor = cls.getConstructors.apply(0)

val onlineImpl = constructor.newInstance(props, deserializationSchema, topicInfo)

onlineImpl.asInstanceOf[FlinkSource[T]]

}

private def loadPubsubSource[T](props: Map[String, String],

deserializationSchema: DeserializationSchema[T],

topicInfo: TopicInfo): FlinkSource[T] = {

try {

val cl = Thread.currentThread().getContextClassLoader

val cls = cl.loadClass("ai.chronon.flink_connectors.pubsub.PubSubFlinkSource")

val constructor = cls.getConstructor(

classOf[Map[String, String]],

classOf[DeserializationSchema[T]],

classOf[TopicInfo]

)

val onlineImpl = constructor.newInstance(props, deserializationSchema, topicInfo)

onlineImpl.asInstanceOf[FlinkSource[T]]

} catch {

case e: Exception =>

throw new RuntimeException("Failed to load PubSub source", e)

}

}

🤖 Prompt for AI Agents

In flink/src/main/scala/ai/chronon/flink/source/FlinkSourceProvider.scala around lines 29 to 39, the reflection code uses getConstructors.apply(0) without verifying constructor order and lacks error handling. To fix this, add proper exception handling around the reflection calls to catch and handle ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, and InvocationTargetException. Also, instead of blindly using the first constructor, explicitly find the constructor matching the expected parameter types to avoid relying on constructor order.

coderabbitai bot reviewed May 20, 2025

View reviewed changes

piyush-zlai force-pushed the piyush/flink_pubsub branch 2 times, most recently from 88c8994 to 7552838 Compare May 20, 2025 17:14

coderabbitai bot reviewed May 20, 2025

View reviewed changes

piyush-zlai requested review from david-zlai, nikhil-zlai and tchow-zlai May 20, 2025 17:34

tchow-zlai reviewed May 27, 2025

View reviewed changes

david-zlai approved these changes May 28, 2025

View reviewed changes

piyush-zlai added 14 commits May 28, 2025 18:42

Basic pubsub connector scaffolding

c9aa0b0

Truncate deps a bit

a6a3a60

Refactor sources and wire up to Flink & Validation jobs

63c2c67

Add unit tests and fix bug

d6059a2

Functioning pubsub even driver

943024c

Add pubsub canary job

c9ec095

Add uid + fix parallelism

f586971

Add step to build and publish pubsub connector jars

f731392

Wire up pubsub connector to submitter & cli

d198c81

Add some comments

78d1308

Add checkpoints to validation flink job to play well with PubSub

7a6d8af

style: Apply scalafix and scalafmt changes

a8931e2

Yank out subscription name teams level param as its meant to be at th…

687361f

…e topic level

Post rebase fixes

1b59a71

piyush-zlai force-pushed the piyush/flink_pubsub branch from 868fadb to 1b59a71 Compare May 28, 2025 22:48

coderabbitai bot reviewed May 28, 2025

View reviewed changes

piyush-zlai merged commit de087f0 into main May 28, 2025
38 of 44 checks passed

piyush-zlai deleted the piyush/flink_pubsub branch May 28, 2025 23:16

This was referenced Jul 2, 2025

Switch Flink Pub/Sub source to fast ack based behavior #924

Merged

refactor: scala + py builds on mill #916

Merged

		* This means if the job is down for a while, we'll have a decent sized backlog to catch up on. To start afresh, a new subscription is
		* needed.

Add PubSub Flink source #794

Add PubSub Flink source #794

Uh oh!

Conversation

piyush-zlai commented May 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 20, 2025

Choose a reason for hiding this comment

Uh oh!

tchow-zlai May 27, 2025

Choose a reason for hiding this comment

Uh oh!

piyush-zlai May 28, 2025

Choose a reason for hiding this comment

Uh oh!

david-zlai May 28, 2025

Choose a reason for hiding this comment

Uh oh!

david-zlai May 28, 2025

Choose a reason for hiding this comment

Uh oh!

piyush-zlai May 28, 2025

Choose a reason for hiding this comment

Uh oh!

david-zlai May 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

piyush-zlai commented May 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 20, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)