-
Notifications
You must be signed in to change notification settings - Fork 8
Move GCP args from DataprocSubmitter to CLI #915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes update the Spark job argument construction in the Python GcpRunner to append explicit GCP flags. The Scala DataprocSubmitter removes these flags from its argument list. A new constant for a configuration path argument is added in another Scala file. The build script drops the old wheel upload. No public API signatures are changed. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant GcpRunner (Python)
participant DataprocSubmitter (Scala)
participant Spark Job
User->>GcpRunner (Python): Submit Spark job
GcpRunner (Python)->>DataprocSubmitter (Scala): Pass args with explicit GCP flags
DataprocSubmitter (Scala)->>Spark Job: Launch with filtered args (no GCP flags)
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (2)📓 Common learningscloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (1)⏰ Context from checks skipped due to timeout of 90000ms (12)
🔇 Additional comments (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
api/python/ai/chronon/repo/gcp.py(1 hunks)cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala(1 hunks)spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: david-zlai
PR: zipline-ai/chronon#222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration `spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"` should remain in the main `additional-confs.yaml` file, not in dev-specific configs.
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#70
File: service/src/main/java/ai/chronon/service/ApiProvider.java:6-6
Timestamp: 2024-12-03T04:04:33.809Z
Learning: The import `scala.util.ScalaVersionSpecificCollectionsConverter` in `service/src/main/java/ai/chronon/service/ApiProvider.java` is correct and should not be flagged in future reviews.
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: online/src/main/scala/ai/chronon/online/Api.scala:46-50
Timestamp: 2024-10-07T15:17:18.494Z
Learning: When adding new parameters with default values to Scala case classes like `GetRequest`, existing usages don't need updating if backward compatibility is intended.
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: online/src/main/scala/ai/chronon/online/Api.scala:46-50
Timestamp: 2024-10-08T16:18:45.669Z
Learning: When adding new parameters with default values to Scala case classes like `GetRequest`, existing usages don't need updating if backward compatibility is intended.
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala (4)
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#70
File: service/src/main/java/ai/chronon/service/ApiProvider.java:6-6
Timestamp: 2024-12-03T04:04:33.809Z
Learning: The import `scala.util.ScalaVersionSpecificCollectionsConverter` in `service/src/main/java/ai/chronon/service/ApiProvider.java` is correct and should not be flagged in future reviews.
Learnt from: david-zlai
PR: zipline-ai/chronon#222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration `spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"` should remain in the main `additional-confs.yaml` file, not in dev-specific configs.
Learnt from: david-zlai
PR: zipline-ai/chronon#292
File: distribution/run_zipline_quickstart.sh:72-75
Timestamp: 2025-01-30T19:37:27.646Z
Learning: The `gcloud dataproc jobs wait` command is a blocking call that waits for job completion, so adding an external timeout wrapper is unnecessary.
Learnt from: tchow-zlai
PR: zipline-ai/chronon#263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:56-57
Timestamp: 2025-01-24T23:55:40.650Z
Learning: For BigQuery table creation operations in BigQueryFormat.scala, allow exceptions to propagate directly without wrapping them in try-catch blocks, as the original BigQuery exceptions provide sufficient context.
spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala (1)
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#70
File: service/src/main/java/ai/chronon/service/ApiProvider.java:6-6
Timestamp: 2024-12-03T04:04:33.809Z
Learning: The import `scala.util.ScalaVersionSpecificCollectionsConverter` in `service/src/main/java/ai/chronon/service/ApiProvider.java` is correct and should not be flagged in future reviews.
api/python/ai/chronon/repo/gcp.py (2)
Learnt from: david-zlai
PR: zipline-ai/chronon#222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration `spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"` should remain in the main `additional-confs.yaml` file, not in dev-specific configs.
Learnt from: chewy-zlai
PR: zipline-ai/chronon#789
File: api/python/ai/chronon/repo/cluster.py:16-16
Timestamp: 2025-05-23T22:52:27.605Z
Learning: The project uses terraform to provision the required "dataproc@${project_id}.iam.gserviceaccount.com" service account, making hardcoded service account patterns in Dataproc configurations reliable and acceptable.
⏰ Context from checks skipped due to timeout of 90000ms (18)
- GitHub Check: fetcher_tests
- GitHub Check: join_tests
- GitHub Check: batch_tests
- GitHub Check: spark_tests
- GitHub Check: groupby_tests
- GitHub Check: streaming_tests
- GitHub Check: analyzer_tests
- GitHub Check: cloud_gcp_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: online_tests
- GitHub Check: cloud_aws_tests
- GitHub Check: service_commons_tests
- GitHub Check: service_tests
- GitHub Check: api_tests
- GitHub Check: aggregator_tests
- GitHub Check: flink_tests
- GitHub Check: python_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (2)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala (1)
443-444: Good coordination planning.The TODO aligns with the Python CLI changes that now construct explicit GCP flags.
spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala (1)
175-175: LGTM.New constant follows established naming conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
api/python/ai/chronon/repo/gcp.py (1)
311-311: Fix malformed argument string.The extra
--after the bigtable instance ID will cause argument parsing issues.- f"--gcp-bigtable-instance-id={GcpRunner.get_gcp_bigtable_instance_id()}--", + f"--gcp-bigtable-instance-id={GcpRunner.get_gcp_bigtable_instance_id()}",
🧹 Nitpick comments (1)
api/python/ai/chronon/repo/gcp.py (1)
310-312: Fix indentation alignment.The hanging indent is misaligned.
), "--is-gcp", - f"--gcp-project-id={GcpRunner.get_gcp_project_id()}", - f"--gcp-bigtable-instance-id={GcpRunner.get_gcp_bigtable_instance_id()}--", - f" --files={gcs_file_args}" if gcs_file_args else "", + f"--gcp-project-id={GcpRunner.get_gcp_project_id()}", + f"--gcp-bigtable-instance-id={GcpRunner.get_gcp_bigtable_instance_id()}", + f" --files={gcs_file_args}" if gcs_file_args else "",
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
api/python/ai/chronon/repo/gcp.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: david-zlai
PR: zipline-ai/chronon#222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration `spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"` should remain in the main `additional-confs.yaml` file, not in dev-specific configs.
Learnt from: chewy-zlai
PR: zipline-ai/chronon#789
File: api/python/ai/chronon/repo/cluster.py:16-16
Timestamp: 2025-05-23T22:52:27.605Z
Learning: The project uses terraform to provision the required "dataproc@${project_id}.iam.gserviceaccount.com" service account, making hardcoded service account patterns in Dataproc configurations reliable and acceptable.
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: online/src/main/scala/ai/chronon/online/Api.scala:46-50
Timestamp: 2024-10-07T15:17:18.494Z
Learning: When adding new parameters with default values to Scala case classes like `GetRequest`, existing usages don't need updating if backward compatibility is intended.
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: online/src/main/scala/ai/chronon/online/Api.scala:46-50
Timestamp: 2024-10-08T16:18:45.669Z
Learning: When adding new parameters with default values to Scala case classes like `GetRequest`, existing usages don't need updating if backward compatibility is intended.
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#70
File: service/src/main/java/ai/chronon/service/ApiProvider.java:6-6
Timestamp: 2024-12-03T04:04:33.809Z
Learning: The import `scala.util.ScalaVersionSpecificCollectionsConverter` in `service/src/main/java/ai/chronon/service/ApiProvider.java` is correct and should not be flagged in future reviews.
api/python/ai/chronon/repo/gcp.py (2)
Learnt from: david-zlai
PR: zipline-ai/chronon#222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration `spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"` should remain in the main `additional-confs.yaml` file, not in dev-specific configs.
Learnt from: chewy-zlai
PR: zipline-ai/chronon#789
File: api/python/ai/chronon/repo/cluster.py:16-16
Timestamp: 2025-05-23T22:52:27.605Z
Learning: The project uses terraform to provision the required "dataproc@${project_id}.iam.gserviceaccount.com" service account, making hardcoded service account patterns in Dataproc configurations reliable and acceptable.
🪛 Flake8 (7.2.0)
api/python/ai/chronon/repo/gcp.py
[error] 310-310: continuation line unaligned for hanging indent
(E131)
⏰ Context from checks skipped due to timeout of 90000ms (16)
- GitHub Check: python_tests
- GitHub Check: streaming_tests
- GitHub Check: spark_tests
- GitHub Check: batch_tests
- GitHub Check: analyzer_tests
- GitHub Check: fetcher_tests
- GitHub Check: cloud_aws_tests
- GitHub Check: service_commons_tests
- GitHub Check: service_tests
- GitHub Check: api_tests
- GitHub Check: online_tests
- GitHub Check: cloud_gcp_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: flink_tests
- GitHub Check: aggregator_tests
- GitHub Check: enforce_triggered_workflows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
api/python/ai/chronon/repo/gcp.py (1)
301-314: Fix indentation and improve formatting.The logic correctly implements moving GCP args to CLI, but there's a formatting issue.
- return " ".join([ - final_args.format( - user_args=user_args, - jar_uri=jar_uri, - job_type=job_type.value, - main_class=main_class, - zipline_version=self._version, - job_id=self.job_id, - ), "--is-gcp", - f"--gcp-project-id={GcpRunner.get_gcp_project_id()}", - f"--gcp-bigtable-instance-id={GcpRunner.get_gcp_bigtable_instance_id()}", - f"--files={gcs_file_args}" if gcs_file_args else "", - ] - ) + return " ".join([ + final_args.format( + user_args=user_args, + jar_uri=jar_uri, + job_type=job_type.value, + main_class=main_class, + zipline_version=self._version, + job_id=self.job_id, + ), + "--is-gcp", + f"--gcp-project-id={GcpRunner.get_gcp_project_id()}", + f"--gcp-bigtable-instance-id={GcpRunner.get_gcp_bigtable_instance_id()}", + f"--files={gcs_file_args}" if gcs_file_args else "", + ])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
api/python/ai/chronon/repo/gcp.py(1 hunks)scripts/distribution/build_and_upload_artifacts.sh(0 hunks)
💤 Files with no reviewable changes (1)
- scripts/distribution/build_and_upload_artifacts.sh
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: david-zlai
PR: zipline-ai/chronon#222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration `spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"` should remain in the main `additional-confs.yaml` file, not in dev-specific configs.
Learnt from: chewy-zlai
PR: zipline-ai/chronon#789
File: api/python/ai/chronon/repo/cluster.py:16-16
Timestamp: 2025-05-23T22:52:27.605Z
Learning: The project uses terraform to provision the required "dataproc@${project_id}.iam.gserviceaccount.com" service account, making hardcoded service account patterns in Dataproc configurations reliable and acceptable.
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: online/src/main/scala/ai/chronon/online/Api.scala:46-50
Timestamp: 2024-10-07T15:17:18.494Z
Learning: When adding new parameters with default values to Scala case classes like `GetRequest`, existing usages don't need updating if backward compatibility is intended.
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: online/src/main/scala/ai/chronon/online/Api.scala:46-50
Timestamp: 2024-10-08T16:18:45.669Z
Learning: When adding new parameters with default values to Scala case classes like `GetRequest`, existing usages don't need updating if backward compatibility is intended.
Learnt from: tchow-zlai
PR: zipline-ai/chronon#393
File: cloud_gcp/BUILD.bazel:99-99
Timestamp: 2025-02-22T20:30:28.381Z
Learning: The jar file "iceberg-bigquery-catalog-1.5.2-1.0.1-beta.jar" in cloud_gcp/BUILD.bazel is a local dependency and should not be replaced with maven_artifact.
api/python/ai/chronon/repo/gcp.py (3)
Learnt from: david-zlai
PR: zipline-ai/chronon#222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration `spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"` should remain in the main `additional-confs.yaml` file, not in dev-specific configs.
Learnt from: chewy-zlai
PR: zipline-ai/chronon#789
File: api/python/ai/chronon/repo/cluster.py:16-16
Timestamp: 2025-05-23T22:52:27.605Z
Learning: The project uses terraform to provision the required "dataproc@${project_id}.iam.gserviceaccount.com" service account, making hardcoded service account patterns in Dataproc configurations reliable and acceptable.
Learnt from: chewy-zlai
PR: zipline-ai/chronon#46
File: docker-init/generate_anomalous_data.py:0-0
Timestamp: 2024-10-15T19:03:19.403Z
Learning: If `generate_anomalous_data.py` contains unintended changes due to accidental commits, disregard reviewing this file unless instructed otherwise.
🪛 Flake8 (7.2.0)
api/python/ai/chronon/repo/gcp.py
[error] 310-310: continuation line unaligned for hanging indent
(E131)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: service_tests
- GitHub Check: cloud_aws_tests
- GitHub Check: cloud_gcp_tests
- GitHub Check: flink_tests
- GitHub Check: online_tests
- GitHub Check: enforce_triggered_workflows
Summary
Checklist
Summary by CodeRabbit
New Features
Chores