Skip to content

Conversation

@david-zlai
Copy link
Contributor

@david-zlai david-zlai commented Apr 16, 2025

Summary

^^^

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • New Features

    • Added CLI options to specify custom artifact and warehouse bucket locations for both AWS and GCP.
    • Environment variables for artifact and warehouse buckets are now supported and can be set for cloud operations.
  • Improvements

    • Enhanced validation for bucket names to ensure correct cloud storage URI prefixes.
    • Centralized and simplified bucket management for cloud runners, reducing configuration errors.
  • Bug Fixes

    • Improved compatibility with cloud APIs by properly handling storage URI prefixes.
  • Documentation

    • Updated CLI help text to reflect new bucket configuration options.
  • Tests

    • Updated test configurations to include new environment variables for bucket locations.
  • Other

    • Disabled two optional fields in Thrift data structures.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Apr 16, 2025

Walkthrough

This update centralizes the management of cloud storage bucket names for both AWS and GCP runners by introducing new constants and enforcing validation on bucket name prefixes. The runners now store bucket names as instance variables, removing the need for dynamic construction using customer IDs. Method signatures for downloading JARs have been simplified to accept explicit bucket names. The CLI adds options for specifying artifact and warehouse buckets. Test environment configurations and Thrift struct definitions are also updated to reflect these changes.

Changes

File(s) Change Summary
api/python/ai/chronon/repo/aws.py, api/python/ai/chronon/repo/gcp.py Refactored runner constructors to store args, validate bucket name prefixes, and set bucket names as instance variables. Updated JAR download methods to require explicit bucket names. Removed dynamic bucket construction using customer IDs. Adjusted internal logic and error handling accordingly.
api/python/ai/chronon/repo/constants.py Added constants for artifact/warehouse bucket environment variable keys and cloud storage URI prefixes (GCS_PREFIX, S3_PREFIX).
api/python/ai/chronon/repo/default_runner.py Changed Runner constructor to remove jar_path parameter; added logic to initialize bucket names from args or environment. Introduced set_jar_path() method.
api/python/ai/chronon/repo/run.py Added CLI options for specifying artifact and warehouse buckets. Updated main function signature and instantiation logic for runners to align with new bucket management. Refined imports and removed trailing blank line.
api/python/test/canary/teams.py Set new environment variables for artifact and warehouse buckets in test team configurations for both GCP and AWS.
api/thrift/agent.thrift Commented out two optional fields: args in YarnJob and end in DatePartitionRange structs.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Runner
    participant CloudRunner

    User->>CLI: Provide --zipline-artifacts-bucket and --zipline-warehouse-bucket
    CLI->>Runner: Pass bucket names via args
    Runner->>CloudRunner: Initialize with bucket names
    CloudRunner->>CloudRunner: Validate bucket name prefix
    CloudRunner->>CloudRunner: Set jar path
    CloudRunner->>CloudRunner: Use bucket names directly for JAR/file operations
Loading

Possibly related PRs

Suggested reviewers

  • tchow-zlai

Poem

Buckets aligned, with prefixes checked,
Artifacts and warehouses, no longer wrecked.
CLI options bloom,
Old fields swept with a broom,
Now code and clouds connect!
☁️🪣✨

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (2)
api/python/ai/chronon/repo/aws.py (1)

61-72: ⚠️ Potential issue

Prefix handling in S3 operations
upload_file, download_file, head_object all need the plain bucket name. Quick fix: strip once at call‑site.

-        obj.upload_file(source_file_name, bucket_name, destination_blob_name)
+        stripped = bucket_name[len(S3_PREFIX):] if bucket_name.startswith(S3_PREFIX) else bucket_name
+        obj.upload_file(source_file_name, stripped, destination_blob_name)

Do the same in download_zipline_aws_jar() and get_s3_file_hash().

api/python/ai/chronon/repo/gcp.py (1)

80-90: ⚠️ Potential issue

download_gcs_blob still passes prefixed bucket
Add the same gs:// stripping logic used elsewhere:

-            bucket = storage_client.bucket(bucket_name)
+            if bucket_name.startswith(GCS_PREFIX):
+                bucket_name = bucket_name[len(GCS_PREFIX):]
+            bucket = storage_client.bucket(bucket_name)
🧹 Nitpick comments (4)
api/thrift/agent.thrift (2)

36-37: Consider documenting the reason for commenting out.

Rather than commenting out the field, add an explanation or consider proper deprecation method.

-    // 10: optional list<string> args
+    // 10: optional list<string> args - Deprecated: args now passed through environment variables and runner configuration

126-127: Consider documenting the reason for commenting out.

Similar to above, add context for why this field is being disabled.

-    // 2: optional string end
+    // 2: optional string end - Deprecated: end date handling refactored
api/python/ai/chronon/repo/run.py (2)

185-190: Add env‑fallback & type guard for new bucket flags
CLI flags are added but not backed by type=str, defaults or set_defaults(). Users omitting the flag but setting the env var will still see None in ctx.params, breaking downstream validation.


243-245: Use the canonical setter
Directly mutating default_runner.jar_path bypasses any validation encapsulated in Runner.set_jar_path().

-            default_runner.jar_path = os.path.expanduser(chronon_jar)
+            default_runner.set_jar_path(os.path.expanduser(chronon_jar))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 3cfd5eb and 345a5c7.

📒 Files selected for processing (7)
  • api/python/ai/chronon/repo/aws.py (5 hunks)
  • api/python/ai/chronon/repo/constants.py (1 hunks)
  • api/python/ai/chronon/repo/default_runner.py (2 hunks)
  • api/python/ai/chronon/repo/gcp.py (7 hunks)
  • api/python/ai/chronon/repo/run.py (4 hunks)
  • api/python/test/canary/teams.py (3 hunks)
  • api/thrift/agent.thrift (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
api/python/ai/chronon/repo/default_runner.py (1)
api/python/ai/chronon/repo/constants.py (1)
  • RunMode (4-30)
api/python/ai/chronon/repo/gcp.py (1)
api/python/ai/chronon/repo/default_runner.py (1)
  • set_jar_path (106-107)
🔇 Additional comments (6)
api/python/ai/chronon/repo/constants.py (1)

160-164: Good addition of configurable bucket constants.

The addition of bucket environment keys and URI prefixes supports the PR objective of removing hardcoded bucket references.

api/python/test/canary/teams.py (2)

60-61: Proper bucket configuration for GCP team.

GCS bucket URLs correctly use the "gs://" prefix.


107-108: Proper bucket configuration for AWS team.

S3 bucket URLs correctly use the "s3://" prefix.

api/python/ai/chronon/repo/default_runner.py (3)

15-17: Appropriate constant imports.

New imports support the bucket name configuration.


22-24: Constructor refactored to initialize jar_path as None.

This aligns with the new pattern of setting jar_path after initialization.


106-108: Good addition of setter method.

Setter enables setting jar_path after initialization, supporting refactored construction pattern.

Comment on lines +101 to +105
self.zipline_artifacts_bucket = (args.get("zipline_artifacts_bucket")
or os.environ.get(ZIPLINE_ARTIFACTS_BUCKET_ENV_KEY))
self.zipline_warehouse_bucket = (args.get("zipline_warehouse_bucket")
or os.environ.get(ZIPLINE_WAREHOUSE_BUCKET_ENV_KEY))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add bucket name validation.

Validate that the bucket names have correct prefixes (S3_PREFIX or GCS_PREFIX).

 self.zipline_artifacts_bucket = (args.get("zipline_artifacts_bucket")
                                  or os.environ.get(ZIPLINE_ARTIFACTS_BUCKET_ENV_KEY))
 self.zipline_warehouse_bucket = (args.get("zipline_warehouse_bucket")
                                  or os.environ.get(ZIPLINE_WAREHOUSE_BUCKET_ENV_KEY))
+
+if self.zipline_artifacts_bucket and not (self.zipline_artifacts_bucket.startswith(S3_PREFIX) or 
+                                          self.zipline_artifacts_bucket.startswith(GCS_PREFIX)):
+    raise ValueError(f"Artifacts bucket must start with {S3_PREFIX} or {GCS_PREFIX}")
+
+if self.zipline_warehouse_bucket and not (self.zipline_warehouse_bucket.startswith(S3_PREFIX) or 
+                                          self.zipline_warehouse_bucket.startswith(GCS_PREFIX)):
+    raise ValueError(f"Warehouse bucket must start with {S3_PREFIX} or {GCS_PREFIX}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.zipline_artifacts_bucket = (args.get("zipline_artifacts_bucket")
or os.environ.get(ZIPLINE_ARTIFACTS_BUCKET_ENV_KEY))
self.zipline_warehouse_bucket = (args.get("zipline_warehouse_bucket")
or os.environ.get(ZIPLINE_WAREHOUSE_BUCKET_ENV_KEY))
self.zipline_artifacts_bucket = (args.get("zipline_artifacts_bucket")
or os.environ.get(ZIPLINE_ARTIFACTS_BUCKET_ENV_KEY))
self.zipline_warehouse_bucket = (args.get("zipline_warehouse_bucket")
or os.environ.get(ZIPLINE_WAREHOUSE_BUCKET_ENV_KEY))
if self.zipline_artifacts_bucket and not (
self.zipline_artifacts_bucket.startswith(S3_PREFIX)
or self.zipline_artifacts_bucket.startswith(GCS_PREFIX)
):
raise ValueError(
f"Artifacts bucket must start with {S3_PREFIX} or {GCS_PREFIX}"
)
if self.zipline_warehouse_bucket and not (
self.zipline_warehouse_bucket.startswith(S3_PREFIX)
or self.zipline_warehouse_bucket.startswith(GCS_PREFIX)
):
raise ValueError(
f"Warehouse bucket must start with {S3_PREFIX} or {GCS_PREFIX}"
)

Comment on lines +38 to +44
for bucket in [self.zipline_artifacts_bucket, self.zipline_warehouse_bucket]:
if not bucket.startswith(S3_PREFIX):
raise ValueError(
f"Invalid bucket name: {bucket}. "
f"Bucket names must start with '{S3_PREFIX}'."
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Guard None and strip s3:// before AWS SDK calls
bucket may be None, and boto3 expects bare bucket names (no s3://). bucket.startswith() on None raises, and passing the prefixed value to boto3 will fail.

-        for bucket in [self.zipline_artifacts_bucket, self.zipline_warehouse_bucket]:
-            if not bucket.startswith(S3_PREFIX):
+        for bucket in [self.zipline_artifacts_bucket, self.zipline_warehouse_bucket]:
+            if bucket is None:
+                raise ValueError("Zipline bucket not provided.")
+            if not bucket.startswith(S3_PREFIX):
                 raise ValueError(
                     f"Invalid bucket name: {bucket}. "
                     f"Bucket names must start with '{S3_PREFIX}'."
                 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for bucket in [self.zipline_artifacts_bucket, self.zipline_warehouse_bucket]:
if not bucket.startswith(S3_PREFIX):
raise ValueError(
f"Invalid bucket name: {bucket}. "
f"Bucket names must start with '{S3_PREFIX}'."
)
for bucket in [self.zipline_artifacts_bucket, self.zipline_warehouse_bucket]:
if bucket is None:
raise ValueError("Zipline bucket not provided.")
if not bucket.startswith(S3_PREFIX):
raise ValueError(
f"Invalid bucket name: {bucket}. "
f"Bucket names must start with '{S3_PREFIX}'."
)

Comment on lines +38 to +44
# Validate bucket names start with "gs://"
for bucket in [self.zipline_artifacts_bucket, self.zipline_warehouse_bucket]:
if not bucket.startswith(GCS_PREFIX):
raise ValueError(
f"Invalid bucket name: {bucket}. "
f"Bucket names must start with '{GCS_PREFIX}'."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Same None / prefix issue as AWS
Handle missing buckets before .startswith() to avoid AttributeError.

@david-zlai david-zlai closed this Apr 16, 2025
@david-zlai david-zlai deleted the davidhan/refactor_gcp_run branch May 12, 2025 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants