Refactor run.py to not hardcode against zipline-artifacts or zipline-warehouse buckets but configurable #658

david-zlai · 2025-04-16T21:20:04Z

Summary

^^^

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

New Features
- Added CLI options to specify custom artifact and warehouse bucket locations for both AWS and GCP.
- Environment variables for artifact and warehouse buckets are now supported and can be set for cloud operations.
Improvements
- Enhanced validation for bucket names to ensure correct cloud storage URI prefixes.
- Centralized and simplified bucket management for cloud runners, reducing configuration errors.
Bug Fixes
- Improved compatibility with cloud APIs by properly handling storage URI prefixes.
Documentation
- Updated CLI help text to reflect new bucket configuration options.
Tests
- Updated test configurations to include new environment variables for bucket locations.
Other
- Disabled two optional fields in Thrift data structures.

coderabbitai · 2025-04-16T21:20:11Z

Walkthrough

This update centralizes the management of cloud storage bucket names for both AWS and GCP runners by introducing new constants and enforcing validation on bucket name prefixes. The runners now store bucket names as instance variables, removing the need for dynamic construction using customer IDs. Method signatures for downloading JARs have been simplified to accept explicit bucket names. The CLI adds options for specifying artifact and warehouse buckets. Test environment configurations and Thrift struct definitions are also updated to reflect these changes.

Changes

File(s)	Change Summary
`api/python/ai/chronon/repo/aws.py`, `api/python/ai/chronon/repo/gcp.py`	Refactored runner constructors to store `args`, validate bucket name prefixes, and set bucket names as instance variables. Updated JAR download methods to require explicit bucket names. Removed dynamic bucket construction using customer IDs. Adjusted internal logic and error handling accordingly.
`api/python/ai/chronon/repo/constants.py`	Added constants for artifact/warehouse bucket environment variable keys and cloud storage URI prefixes (`GCS_PREFIX`, `S3_PREFIX`).
`api/python/ai/chronon/repo/default_runner.py`	Changed `Runner` constructor to remove `jar_path` parameter; added logic to initialize bucket names from `args` or environment. Introduced `set_jar_path()` method.
`api/python/ai/chronon/repo/run.py`	Added CLI options for specifying artifact and warehouse buckets. Updated `main` function signature and instantiation logic for runners to align with new bucket management. Refined imports and removed trailing blank line.
`api/python/test/canary/teams.py`	Set new environment variables for artifact and warehouse buckets in test team configurations for both GCP and AWS.
`api/thrift/agent.thrift`	Commented out two optional fields: `args` in `YarnJob` and `end` in `DatePartitionRange` structs.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Runner
    participant CloudRunner

    User->>CLI: Provide --zipline-artifacts-bucket and --zipline-warehouse-bucket
    CLI->>Runner: Pass bucket names via args
    Runner->>CloudRunner: Initialize with bucket names
    CloudRunner->>CloudRunner: Validate bucket name prefix
    CloudRunner->>CloudRunner: Set jar path
    CloudRunner->>CloudRunner: Use bucket names directly for JAR/file operations

Possibly related PRs

Use Version Parameter to Get Jars #511: Refactors AwsRunner and download_zipline_aws_jar parameter handling—overlaps with this PR’s changes to bucket management and method signatures.
feat: align wheel and jar versions #559: Aligns wheel and jar versions in repo/aws.py—directly related due to overlapping file and functionality.
Build zipline-ai with Commands "zipline compile" and "zipline run" #161: Refactors CLI in run.py—related as both PRs modify CLI structure and command handling.

Suggested reviewers

tchow-zlai

Poem

Buckets aligned, with prefixes checked,
Artifacts and warehouses, no longer wrecked.
CLI options bloom,
Old fields swept with a broom,
Now code and clouds connect!
☁️🪣✨

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🔭 Outside diff range comments (2)

api/python/ai/chronon/repo/aws.py (1)
61-72: ⚠️ Potential issue

Prefix handling in S3 operations
upload_file, download_file, head_object all need the plain bucket name. Quick fix: strip once at call‑site.
-        obj.upload_file(source_file_name, bucket_name, destination_blob_name)
+        stripped = bucket_name[len(S3_PREFIX):] if bucket_name.startswith(S3_PREFIX) else bucket_name
+        obj.upload_file(source_file_name, stripped, destination_blob_name)
Do the same in download_zipline_aws_jar() and get_s3_file_hash().
api/python/ai/chronon/repo/gcp.py (1)
80-90: ⚠️ Potential issue

download_gcs_blob still passes prefixed bucket
Add the same gs:// stripping logic used elsewhere:
-            bucket = storage_client.bucket(bucket_name)
+            if bucket_name.startswith(GCS_PREFIX):
+                bucket_name = bucket_name[len(GCS_PREFIX):]
+            bucket = storage_client.bucket(bucket_name)

🧹 Nitpick comments (4)

api/thrift/agent.thrift (2)
36-37: Consider documenting the reason for commenting out.

Rather than commenting out the field, add an explanation or consider proper deprecation method.
-    // 10: optional list<string> args
+    // 10: optional list<string> args - Deprecated: args now passed through environment variables and runner configuration
126-127: Consider documenting the reason for commenting out.

Similar to above, add context for why this field is being disabled.
-    // 2: optional string end
+    // 2: optional string end - Deprecated: end date handling refactored
api/python/ai/chronon/repo/run.py (2)
185-190: Add env‑fallback & type guard for new bucket flags
CLI flags are added but not backed by type=str, defaults or set_defaults(). Users omitting the flag but setting the env var will still see None in ctx.params, breaking downstream validation.

243-245: Use the canonical setter
Directly mutating default_runner.jar_path bypasses any validation encapsulated in Runner.set_jar_path().
-            default_runner.jar_path = os.path.expanduser(chronon_jar)
+            default_runner.set_jar_path(os.path.expanduser(chronon_jar))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 3cfd5eb and 345a5c7.

📒 Files selected for processing (7)

api/python/ai/chronon/repo/aws.py (5 hunks)
api/python/ai/chronon/repo/constants.py (1 hunks)
api/python/ai/chronon/repo/default_runner.py (2 hunks)
api/python/ai/chronon/repo/gcp.py (7 hunks)
api/python/ai/chronon/repo/run.py (4 hunks)
api/python/test/canary/teams.py (3 hunks)
api/thrift/agent.thrift (2 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (2)

api/python/ai/chronon/repo/default_runner.py (1)

api/python/ai/chronon/repo/constants.py (1)

RunMode (4-30)

api/python/ai/chronon/repo/gcp.py (1)

api/python/ai/chronon/repo/default_runner.py (1)

set_jar_path (106-107)

🔇 Additional comments (6)

api/python/ai/chronon/repo/constants.py (1)

160-164: Good addition of configurable bucket constants.

The addition of bucket environment keys and URI prefixes supports the PR objective of removing hardcoded bucket references.

api/python/test/canary/teams.py (2)

60-61: Proper bucket configuration for GCP team.

GCS bucket URLs correctly use the "gs://" prefix.

107-108: Proper bucket configuration for AWS team.

S3 bucket URLs correctly use the "s3://" prefix.

api/python/ai/chronon/repo/default_runner.py (3)

15-17: Appropriate constant imports.

New imports support the bucket name configuration.

22-24: Constructor refactored to initialize jar_path as None.

This aligns with the new pattern of setting jar_path after initialization.

106-108: Good addition of setter method.

Setter enables setting jar_path after initialization, supporting refactored construction pattern.

coderabbitai · 2025-04-16T21:23:23Z

api/python/ai/chronon/repo/default_runner.py

+        self.zipline_artifacts_bucket = (args.get("zipline_artifacts_bucket")
+                                         or os.environ.get(ZIPLINE_ARTIFACTS_BUCKET_ENV_KEY))
+        self.zipline_warehouse_bucket = (args.get("zipline_warehouse_bucket")
+                                         or os.environ.get(ZIPLINE_WAREHOUSE_BUCKET_ENV_KEY))
+


🛠️ Refactor suggestion

Add bucket name validation.

Validate that the bucket names have correct prefixes (S3_PREFIX or GCS_PREFIX).

self.zipline_artifacts_bucket = (args.get("zipline_artifacts_bucket") or os.environ.get(ZIPLINE_ARTIFACTS_BUCKET_ENV_KEY)) self.zipline_warehouse_bucket = (args.get("zipline_warehouse_bucket") or os.environ.get(ZIPLINE_WAREHOUSE_BUCKET_ENV_KEY)) + +if self.zipline_artifacts_bucket and not (self.zipline_artifacts_bucket.startswith(S3_PREFIX) or + self.zipline_artifacts_bucket.startswith(GCS_PREFIX)): + raise ValueError(f"Artifacts bucket must start with {S3_PREFIX} or {GCS_PREFIX}") + +if self.zipline_warehouse_bucket and not (self.zipline_warehouse_bucket.startswith(S3_PREFIX) or + self.zipline_warehouse_bucket.startswith(GCS_PREFIX)): + raise ValueError(f"Warehouse bucket must start with {S3_PREFIX} or {GCS_PREFIX}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.zipline_artifacts_bucket = (args.get("zipline_artifacts_bucket")

or os.environ.get(ZIPLINE_ARTIFACTS_BUCKET_ENV_KEY))

self.zipline_warehouse_bucket = (args.get("zipline_warehouse_bucket")

or os.environ.get(ZIPLINE_WAREHOUSE_BUCKET_ENV_KEY))

self.zipline_artifacts_bucket = (args.get("zipline_artifacts_bucket")

or os.environ.get(ZIPLINE_ARTIFACTS_BUCKET_ENV_KEY))

self.zipline_warehouse_bucket = (args.get("zipline_warehouse_bucket")

or os.environ.get(ZIPLINE_WAREHOUSE_BUCKET_ENV_KEY))

if self.zipline_artifacts_bucket and not (

self.zipline_artifacts_bucket.startswith(S3_PREFIX)

or self.zipline_artifacts_bucket.startswith(GCS_PREFIX)

):

raise ValueError(

f"Artifacts bucket must start with {S3_PREFIX} or {GCS_PREFIX}"

)

if self.zipline_warehouse_bucket and not (

self.zipline_warehouse_bucket.startswith(S3_PREFIX)

or self.zipline_warehouse_bucket.startswith(GCS_PREFIX)

):

raise ValueError(

f"Warehouse bucket must start with {S3_PREFIX} or {GCS_PREFIX}"

)

coderabbitai · 2025-04-16T21:23:23Z

api/python/ai/chronon/repo/aws.py

+        for bucket in [self.zipline_artifacts_bucket, self.zipline_warehouse_bucket]:
+            if not bucket.startswith(S3_PREFIX):
+                raise ValueError(
+                    f"Invalid bucket name: {bucket}. "
+                    f"Bucket names must start with '{S3_PREFIX}'."
+                )
+


⚠️ Potential issue

Guard None and strip s3:// before AWS SDK calls
bucket may be None, and boto3 expects bare bucket names (no s3://). bucket.startswith() on None raises, and passing the prefixed value to boto3 will fail.

- for bucket in [self.zipline_artifacts_bucket, self.zipline_warehouse_bucket]: - if not bucket.startswith(S3_PREFIX): + for bucket in [self.zipline_artifacts_bucket, self.zipline_warehouse_bucket]: + if bucket is None: + raise ValueError("Zipline bucket not provided.") + if not bucket.startswith(S3_PREFIX): raise ValueError( f"Invalid bucket name: {bucket}. " f"Bucket names must start with '{S3_PREFIX}'." )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for bucket in [self.zipline_artifacts_bucket, self.zipline_warehouse_bucket]:

if not bucket.startswith(S3_PREFIX):

raise ValueError(

f"Invalid bucket name: {bucket}. "

f"Bucket names must start with '{S3_PREFIX}'."

)

for bucket in [self.zipline_artifacts_bucket, self.zipline_warehouse_bucket]:

if bucket is None:

raise ValueError("Zipline bucket not provided.")

if not bucket.startswith(S3_PREFIX):

raise ValueError(

f"Invalid bucket name: {bucket}. "

f"Bucket names must start with '{S3_PREFIX}'."

)

coderabbitai · 2025-04-16T21:23:23Z

api/python/ai/chronon/repo/gcp.py

+        # Validate bucket names start with "gs://"
+        for bucket in [self.zipline_artifacts_bucket, self.zipline_warehouse_bucket]:
+            if not bucket.startswith(GCS_PREFIX):
+                raise ValueError(
+                    f"Invalid bucket name: {bucket}. "
+                    f"Bucket names must start with '{GCS_PREFIX}'."
+                )


🛠️ Refactor suggestion

Same None / prefix issue as AWS
Handle missing buckets before .startswith() to avoid AttributeError.

david-zlai added 5 commits April 16, 2025 16:22

Changes to allow for artifacts and warehouse bucket to be configurable

68b15ec

comment out thrift not related

867ab6e

remove validations

e516323

validate is back and remove

164f6bf

gcs python client doesn't like gs:// prefixes

345a5c7

david-zlai requested review from chewy-zlai and tchow-zlai April 16, 2025 21:20

coderabbitai bot reviewed Apr 16, 2025

View reviewed changes

david-zlai closed this Apr 16, 2025

david-zlai deleted the davidhan/refactor_gcp_run branch May 12, 2025 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor run.py to not hardcode against zipline-artifacts or zipline-warehouse buckets but configurable #658

Refactor run.py to not hardcode against zipline-artifacts or zipline-warehouse buckets but configurable #658

Uh oh!

david-zlai commented Apr 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 16, 2025 •

edited

Loading

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 16, 2025

Uh oh!

coderabbitai bot Apr 16, 2025

Uh oh!

coderabbitai bot Apr 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor run.py to not hardcode against zipline-artifacts or zipline-warehouse buckets but configurable #658

Refactor run.py to not hardcode against zipline-artifacts or zipline-warehouse buckets but configurable #658

Uh oh!

Conversation

david-zlai commented Apr 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

david-zlai commented Apr 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 16, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)