Skip to content

Conversation

@tchow-zlai
Copy link
Collaborator

@tchow-zlai tchow-zlai commented May 1, 2025

Summary

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • New Features

    • Added scripts for launching services, loading data, building and uploading artifacts, publishing Docker images, and running quickstart workflows for both GCP and AWS environments.
    • Introduced a Docker Compose configuration for orchestrating multi-service environments.
    • Added a script for generating synthetic datasets using PySpark.
    • Provided scripts for building documentation, packaging releases, and initializing the Chronon environment.
    • Introduced an interactive gateway script for Spark-based evaluation.
  • Chores

    • Implemented environment validation, error handling, and automation across all new scripts to streamline setup and deployment processes.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 1, 2025

Walkthrough

This pull request introduces a suite of new scripts and configuration files to automate and orchestrate various data engineering workflows, including cloud quickstart setups for GCP and AWS, artifact building and publishing, Docker image management, synthetic data generation, environment initialization, and documentation site release processes. The additions span Bash, Python, and YAML files, each encapsulating specific operational tasks such as service launching, data loading, environment validation, and artifact deployment, with robust error handling and environment variable checks throughout.

Changes

File(s) Change Summary
scripts/cloud_gcp_quickstart/fetcher_launch.sh New shell script to validate environment variables and launch the Fetcher service via Java.
scripts/cloud_gcp_quickstart/gcp-docker-compose.yml New Docker Compose file defining services for Bigtable emulator, data loading, statsd, and app orchestration.
scripts/cloud_gcp_quickstart/load_data.sh New Bash script to orchestrate data compilation, loading, batch jobs, and metadata upload for GCP pipelines.
scripts/datagen/plaid_dataset.py New Python script to generate synthetic data with PySpark and write to a Hudi dataset on S3.
scripts/distribution/build_and_upload_artifacts.sh New Bash script for building and uploading artifacts for GCP and AWS, with metadata tagging and customer ID support.
scripts/distribution/build_wheel.sh New shell script to generate Python code from Thrift files and build a Python wheel package.
scripts/distribution/publish_docker_images.sh New shell script to build and publish multi-platform Docker images after validating Git state and Bazel builds.
scripts/distribution/run_aws_quickstart.sh New Bash script to automate running Zipline AI backfill jobs on AWS EMR for canary/dev environments.
scripts/distribution/run_gcp_quickstart.sh New Bash script to automate running Zipline AI workflows on GCP for canary/dev environments.
scripts/docsite_release/build.sh New Bash script to build and package Chronon Spark jar, documentation site, and test repo for release.
scripts/docsite_release/gcloud_release.sh New shell script to upload release files to a GCS bucket, with setup instructions.
scripts/docsite_release/init.sh New Bash script to initialize a Chronon environment, set up PYTHONPATH, and ensure Spark installation.
scripts/interactive/gateway.sh New Bash script to validate environment and launch a Java interactive gateway with JVM module opens.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Script
    participant CloudService
    participant JavaApp
    participant Docker
    participant Spark
    participant GCP/AWS

    User->>Script: Run script (e.g., fetcher_launch.sh, run_gcp_quickstart.sh)
    Script->>Script: Validate environment variables & inputs
    alt Docker Compose
        Script->>Docker: Start services (Bigtable, statsd, app, etc.)
        Docker->>CloudService: Launch emulators/services
        Docker->>Script: Run data loading/init scripts
    else Data Loading
        Script->>Spark: Submit job to load/transform data
        Spark->>CloudService: Write to storage (Bigtable/S3/Hudi)
    else Artifact Build/Publish
        Script->>Script: Build JARs/wheels with Bazel/Thrift
        Script->>GCP/AWS: Upload artifacts with metadata
    else Interactive Gateway
        Script->>JavaApp: Launch Java with custom classpath and module opens
    end
    Script->>User: Print status or error messages
Loading

Possibly related PRs

  • zipline-ai/chronon#422: Adds a Docker-based fetcher start script and Docker image publishing support, related to the fetcher launch automation in this PR.

Suggested reviewers

  • nikhil-zlai
  • chewy-zlai

Poem

Scripts now bloom in folders new,
Spinning up clouds, building wheels too!
Docker and Spark join the show,
Data flows where devs command it go.
With bashful checks and Python’s might,
Chronon’s pipelines launch just right.
🚀✨

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@tchow-zlai tchow-zlai merged commit 123f1bc into main May 1, 2025
3 of 4 checks passed
@tchow-zlai tchow-zlai deleted the tchow/move-integ-back branch May 1, 2025 17:23
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🔭 Outside diff range comments (1)
scripts/datagen/plaid_dataset.py (1)

215-222: 🛠️ Refactor suggestion

Missing error handling.

Script lacks try/except blocks to handle potential failures.

+try:
     df.write.format("org.apache.hudi").options(**hudi_options).mode(
         "overwrite"
     ).partitionBy("dt").save("s3://zipline-warehouse-canary/data/plaid_raw")
+    print("Data successfully written to Hudi dataset")
+except Exception as e:
+    print(f"Error writing data: {e}")
+    sys.exit(1)
+
+# Don't forget to add import sys at the top
🧹 Nitpick comments (35)
scripts/datagen/plaid_dataset.py (2)

152-161: Random value generation needs more variety.

Values are too simplistic and lack diversity needed for realistic testing.

-    "string": random.choice(["These", "Are", "Random", "Words", "For", "Strings"]),
+    "string": ''.join(random.choices('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', k=random.randint(5, 15))),

190-192: Data display is minimal.

Consider adding sample of actual data fields, not just partitions.

 # Show the data
 df.select("dt").show(10, truncate=False)
+# Show sample of full records
+df.show(5, vertical=True, truncate=100)
scripts/distribution/build_wheel.sh (3)

1-1: Add a shebang.
Include #!/usr/bin/env bash at the top to satisfy Shellcheck SC2148.

🧰 Tools
🪛 Shellcheck (0.10.0)

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


6-7: Quote the trap command.
Use single quotes and wrap the variable to defer expansion and handle spaces:

trap 'rm -rf "${SCRATCH_DIR}"' EXIT
🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 7-7: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)


10-10: Quote the wildcard path.
Wrap the source path in quotes to handle spaces correctly:

- cp ${SCRATCH_DIR}/zipline_ai-${ENV_VERSION}*.whl .
+ cp "${SCRATCH_DIR}/zipline_ai-${ENV_VERSION}"*.whl .
scripts/docsite_release/gcloud_release.sh (1)

1-1: Enable strict mode.
Add set -euo pipefail after the shebang to catch failures and unset variables.

scripts/interactive/gateway.sh (2)

1-1: Add strict mode.
Place set -euo pipefail immediately after the shebang to guard against errors and unset vars.

🧰 Tools
🪛 Shellcheck (0.10.0)

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


17-30: Use exec for Java.
Replace the shell invocation with exec java ... to forward signals and avoid an extra process.

scripts/cloud_gcp_quickstart/fetcher_launch.sh (2)

1-2: Harden error handling.
Switch from set -e to set -euo pipefail to catch unset variables and pipeline failures.


15-19: Quote variables in Java command.
Wrap $FETCHER_JAR and other expansions in quotes to handle paths with spaces:

if ! java -jar "$FETCHER_JAR" run ai.chronon.service.FetcherVerticle \
  -Dserver.port=9000 \
  -Donline.jar="$CLOUD_GCP_JAR" \
  -Dai.chronon.metrics.host="$STATSD_HOST" \
  -Donline.class="$CHRONON_ONLINE_CLASS"; then
scripts/docsite_release/build.sh (3)

27-27: Quote the cleanup path.
Wrap ${DOC_BUILD} in quotes to handle any special characters:

rm -rf "${DOC_BUILD}"

30-33: Use Python venv and quote paths.
Prefer python3 -m venv "${VIRTUAL_ENV}" over virtualenv, and quote all variable expansions.


45-46: Avoid parsing ls.
Replace the ls | grep | awk pipeline with a find or glob to select the latest jar, e.g.:

SBT_JAR_12=$(find spark/target/scala-2.12 -maxdepth 1 -name '*assembly*.jar' -print0 | xargs -0 ls -1t | head -n1)
🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 46-46: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames.

(SC2010)

scripts/distribution/run_gcp_quickstart.sh (7)

23-32: Enforce mutually exclusive env flags
Add a check to error if both --canary and --dev are passed.


69-77: Implement Bigtable cleanup
#TODO: delete bigtable rows remains.


90-92: Handle pip uninstall failure
pip uninstall zipline-ai errors if not present; consider pip uninstall -y zipline-ai || true.


95-109: Improve Dataproc state check
Use gcloud dataproc jobs describe … --format='value(status.state)' and fix the TODO so failures exit.


111-116: Simplify failure handler
fail_if_bash_failed relies on $?; consider passing exit code as $1 or inline || exit 1.


121-121: Remove unused variable
DATAPROC_SUBMITTER_ID_STR is never referenced.

🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 121-121: DATAPROC_SUBMITTER_ID_STR appears unused. Verify use (or export if used externally).

(SC2034)


171-179: Avoid UUOC with cat
Replace cat tmp_fetch.out | grep with grep directly and drop redundant exit check.

scripts/cloud_gcp_quickstart/load_data.sh (4)

2-3: Enable pipefail
Add set -o pipefail to catch failures in piped commands.


5-11: Redirect errors & quote expansions
Echo errors with >&2 and wrap ${!var} in quotes to avoid word splitting.


13-17: Ensure compile.py executes reliably
Either add a shebang to compile.py or call python3 compile.py.


36-40: Quote Java classpath
Wrap -cp "$DRIVER_JAR_PATH:$CLASSPATH" to prevent word splitting.

scripts/distribution/publish_docker_images.sh (2)

57-61: Use variables when copying jars
Prefer cp "$SERVICE_JAR" build_output/ etc. over hard-coded paths.


65-72: Quote Git output in Docker tag
Wrap $(git rev-parse --short HEAD) in quotes to avoid word splitting.

🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 68-68: Quote this to prevent word splitting.

(SC2046)

scripts/docsite_release/init.sh (2)

5-5: Enable strict mode
Uncomment set -euxo pipefail for robustness.


6-13: Remove unused color vars
GRAY, GREEN, and MAGENTA are defined but never used.

🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 6-6: GRAY appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 11-11: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 12-12: MAGENTA appears unused. Verify use (or export if used externally).

(SC2034)

scripts/cloud_gcp_quickstart/gcp-docker-compose.yml (2)

1-1: Specify Compose version
Add version: "3.8" (or appropriate) at the top for clarity.


9-25: Parameterize project and instance
Replace hard-coded quickstart-project/quickstart-instance in cbt commands with env vars.

scripts/distribution/build_and_upload_artifacts.sh (2)

82-84: Quote variable expansions.
Wrap $CHRONON_ROOT_DIR in quotes to handle paths with spaces.

-cd $CHRONON_ROOT_DIR
+cd "$CHRONON_ROOT_DIR"

173-199: Restore original gcloud upload setting.
Unconditionally re-enabling parallel uploads may override user preferences. Capture the original value and restore it after uploads.

scripts/distribution/run_aws_quickstart.sh (3)

9-11: Remove duplicate color definitions.
GREEN and RED are defined twice. Consolidate into a single declaration at the top.

Also applies to: 92-93


73-90: Remove stale comments & enhance wait logic.
Commented PLAID cleanup can be removed. Instead of sleep 30, use an AWS CLI wait or polling loop to ensure the Glue table is gone.


118-132: Remove unused GCP helper.
check_dataproc_job_state is never called in this AWS script. Delete it to reduce dead code.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 7212304 and f3ba9f4.

📒 Files selected for processing (13)
  • scripts/cloud_gcp_quickstart/fetcher_launch.sh (1 hunks)
  • scripts/cloud_gcp_quickstart/gcp-docker-compose.yml (1 hunks)
  • scripts/cloud_gcp_quickstart/load_data.sh (1 hunks)
  • scripts/datagen/plaid_dataset.py (1 hunks)
  • scripts/distribution/build_and_upload_artifacts.sh (1 hunks)
  • scripts/distribution/build_wheel.sh (1 hunks)
  • scripts/distribution/publish_docker_images.sh (1 hunks)
  • scripts/distribution/run_aws_quickstart.sh (1 hunks)
  • scripts/distribution/run_gcp_quickstart.sh (1 hunks)
  • scripts/docsite_release/build.sh (1 hunks)
  • scripts/docsite_release/gcloud_release.sh (1 hunks)
  • scripts/docsite_release/init.sh (1 hunks)
  • scripts/interactive/gateway.sh (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
scripts/distribution/run_aws_quickstart.sh (5)
scripts/distribution/build_and_upload_artifacts.sh (1)
  • print_usage (3-11)
scripts/distribution/run_gcp_quickstart.sh (2)
  • print_usage (6-12)
  • check_dataproc_job_state (95-109)
aggregator/src/test/scala/ai/chronon/aggregator/test/RowAggregatorTest.scala (1)
  • set (44-44)
api/python/ai/chronon/repo/zipline.py (1)
  • zipline (42-44)
api/python/ai/chronon/cli/compile/compiler.py (1)
  • compile (33-63)
scripts/distribution/build_and_upload_artifacts.sh (2)
scripts/distribution/run_aws_quickstart.sh (1)
  • print_usage (12-18)
scripts/distribution/run_gcp_quickstart.sh (1)
  • print_usage (6-12)
🪛 Shellcheck (0.10.0)
scripts/interactive/gateway.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

scripts/distribution/build_wheel.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 7-7: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

scripts/distribution/run_gcp_quickstart.sh

[warning] 121-121: DATAPROC_SUBMITTER_ID_STR appears unused. Verify use (or export if used externally).

(SC2034)

scripts/distribution/publish_docker_images.sh

[warning] 68-68: Quote this to prevent word splitting.

(SC2046)

scripts/docsite_release/build.sh

[warning] 46-46: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames.

(SC2010)

scripts/docsite_release/init.sh

[warning] 6-6: GRAY appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 11-11: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 12-12: MAGENTA appears unused. Verify use (or export if used externally).

(SC2034)

scripts/distribution/run_aws_quickstart.sh

[warning] 115-115: Declare and assign separately to avoid masking return values.

(SC2155)

🔇 Additional comments (12)
scripts/datagen/plaid_dataset.py (4)

1-15: Imports are comprehensive.

Clear structure with appropriate PySpark imports.


141-149: Type mapping looks correct.

Clean mapping between string type names and PySpark types.


176-188: Schema creation and data generation look solid.

Appropriate structure for generating synthetic data with partition dates.


16-139:

❓ Verification inconclusive

Schema includes sensitive fingerprinting fields.

Schema appears to represent browser fingerprinting data with potential PII concerns.


🏁 Script executed:

#!/bin/bash
# Check if this schema aligns with any data governance policies
grep -r "data privacy" --include="*.md" .
grep -r "fingerprint" --include="*.md" .

Length of output: 84


Review Privacy Compliance for Fingerprinting Fields

This schema in scripts/datagen/plaid_dataset.py (lines 16–139) collects extensive fingerprinting and geolocation data (e.g., user_agent_device, fingerprint_pro_data_ip_v4_city, latitude/longitude). These may be considered PII under data-protection regulations.
• Confirm with your data-governance team or privacy policy whether storing these fields is permitted.
• Ensure you have user consent, apply anonymization, or remove unnecessary sensitive attributes.

scripts/distribution/build_and_upload_artifacts.sh (6)

3-56: Trivial: Usage and option parsing
The print_usage, argument parsing, and flag initialization are standard and consistent.


59-75: Trivial: Git state checks
Uncommitted-changes and branch-sync validations look correct.


77-81: Trivial: Setup script environment
Strict mode and path resolution are correctly implemented.


103-164: Trivial: Build and artifact checks
Wheel build, jar builds, and file-existence checks follow the expected pattern.


170-172: Trivial: Default customer IDs
Default arrays for customer IDs are clear.


201-253: Trivial: Upload and cleanup
AWS uploads, final option checks, and cleanup are straightforward.

scripts/distribution/run_aws_quickstart.sh (2)

175-177: Verify CHRONON_ROOT path.
Ensure the hard-coded api/python/test/canary directory matches your project layout.


213-215: Completion message is clear.
The final success echo is straightforward.

Comment on lines +194 to +208
hudi_options = {
"hoodie.table.name": "plaid_raw",
"hoodie.datasource.write.partitionpath.field": "dt",
"hoodie.datasource.write.operation": "upsert",
"hoodie.database.name": "data",
"hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.database": "data",
"hoodie.datasource.hive_sync.table": "plaid_raw",
"hoodie.datasource.hive_sync.partition_fields": "dt",
"hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.use_jdbc": "false",
"hoodie.datasource.hive_sync.mode": "hms",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Hardcoded Hudi configuration.

Configuration has hardcoded values and bucket paths.

 hudi_options = {
-    "hoodie.table.name": "plaid_raw",
+    "hoodie.table.name": "${TABLE_NAME:-plaid_raw}",
     "hoodie.datasource.write.partitionpath.field": "dt",
     "hoodie.datasource.write.operation": "upsert",
-    "hoodie.database.name": "data",
+    "hoodie.database.name": "${DATABASE_NAME:-data}",

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +210 to +213
df.write.format("org.apache.hudi").options(**hudi_options).mode(
"overwrite"
).partitionBy("dt").save("s3://zipline-warehouse-canary/data/plaid_raw")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

S3 path is hardcoded.

Destination path should be configurable.

 df.write.format("org.apache.hudi").options(**hudi_options).mode(
     "overwrite"
-).partitionBy("dt").save("s3://zipline-warehouse-canary/data/plaid_raw")
+).partitionBy("dt").save(os.environ.get("OUTPUT_PATH", "s3://zipline-warehouse-canary/data/plaid_raw"))

Missing import os at the top of the file.

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +164 to +174
def rand_row(spark_schema):
vals = [
(
values_map[f.dataType.simpleString()]
if f.name != "dt"
else random.choice(partition_dates)
)
for f in spark_schema.fields
]
return tuple(vals)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve rand_row implementation.

Function doesn't handle schema field nullability properly.

 def rand_row(spark_schema):
     vals = [
         (
-            values_map[f.dataType.simpleString()]
+            None if f.nullable and random.random() < 0.1 else values_map[f.dataType.simpleString()]
             if f.name != "dt"
             else random.choice(partition_dates)
         )
         for f in spark_schema.fields
     ]
     return tuple(vals)


SCRATCH_DIR=$(mktemp -d)
trap "rm -rf ${SCRATCH_DIR}" EXIT
ENV_VERSION=$1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Validate version argument.
Fail early with a usage message if $1 is empty to avoid unintended builds.

Comment on lines +14 to +17
if [ $# -ne 3 ]; then
print_usage
exit 1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Help flag unreachable under strict arg count
The if [ $# -ne 3 ] check blocks -h|--help. Move help handling before or loosen arg count.

Comment on lines +12 to +23
function print_usage() {
echo "Usage: $0 [OPTIONS]"
echo "Options:"
echo " --canary | --dev Must specify the environment (canary or dev)"
echo " --version <version> Must specify the version you want to run"
echo " -h, --help Show this help message"
}

if [ $# -ne 3 ]; then
print_usage
exit 1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Allow -h|--help early.
The help flag is unreachable due to the strict if [ $# -ne 3 ] check. Move the help logic above the argument-count guard.

Comment on lines +29 to +58
while [[ $# -gt 0 ]]; do
case $1 in
--canary)
USE_CANARY=true
shift
;;
--dev)
USE_DEV=true
shift
;;
-h|--help)
print_usage
exit 0
;;
--version)
if [[ -z $2 ]]; then
echo "Error: --version requires a value"
print_usage
exit 1
fi
VERSION="$2"
shift 2
;;
*)
echo "Unknown option: $1"
print_usage
exit 1
;;
esac
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enable --create-cluster.
The create_cluster variable is never set via CLI, so the “create” path is never exercised. Add a --create-cluster option to toggle cluster creation.

Comment on lines +1 to +8
#!/bin/bash

# default option is to not create a new cluster for testing but use existing cluster
create_cluster=false

# otherwise use the canary cluster id
CANARY_CLUSTER_ID="j-13BASWFP15TLR"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Parameterize cluster ID.
Hardcoding CANARY_CLUSTER_ID limits flexibility. Accept a --cluster-id flag or environment variable to override the default.

Comment on lines +99 to +105

# Create a virtualenv to fresh install zipline-ai
VENV_DIR="tmp_chronon"
rm -rf $VENV_DIR
python3 -m venv $VENV_DIR
source $VENV_DIR/bin/activate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Make pip uninstall noninteractive.
pip uninstall zipline-ai prompts or errors if not installed. Use:

pip uninstall -y zipline-ai || true

to avoid hangs or failures.

Comment on lines +208 to +210
EMR_SUBMITER_ID_STEP_STR="EMR step id"
STEP_ID=$(cat tmp_backfill.out | grep "$EMR_SUBMITER_ID_STEP_STR" | cut -d " " -f4) # expecting the step id to be the 4th field
check_emr_step_state $STEP_ID $CLUSTER_ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typo in step-ID marker.
The variable EMR_SUBMITER_ID_STEP_STR is misspelled, breaking step-ID extraction. Rename it to:

-EMR_SUBMITER_ID_STEP_STR="EMR step id"
+EMR_SUBMITTER_ID_STEP_STR="EMR step id"

and update the grep accordingly.

Committable suggestion skipped: line range outside the PR's diff.

chewy-zlai pushed a commit that referenced this pull request May 15, 2025
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":""}
```
-->

Co-authored-by: thomaschow <[email protected]>
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":""}
```
-->

Co-authored-by: thomaschow <[email protected]>
chewy-zlai pushed a commit that referenced this pull request May 16, 2025
## Summary

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to traour clients
the status of staour clientss when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":""}
```
-->

Co-authored-by: thomaschow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants