-
Notifications
You must be signed in to change notification settings - Fork 8
Connect fetch with run.py + Driver for GCP
#276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe pull request introduces modifications to multiple files across the Chronon project. In Changes
Possibly related PRs
Suggested Reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. ✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
api/py/ai/chronon/repo/run.py
Outdated
| else: | ||
| # Always download the jar for now so that we can pull | ||
| # in any fixes or latest changes | ||
| # if self.dataproc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be commented out?
varant-zlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirming was able to get this to work locally
## Summary Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files Should be rebased and tested off of #276 ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update - [x] Not yet tested, should test after merging 276 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added a builder pattern for `JavaFetcher` to enhance object creation flexibility. - Introduced an optional `ExecutionContext` parameter across multiple classes to improve execution context management. - **Improvements** - Updated constructor signatures in Java and Scala classes to support more configurable execution contexts. - Enhanced instantiation process with more structured object creation methods. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]> Co-authored-by: Thomas Chow <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
api/py/ai/chronon/repo/run.py (1)
415-416: Consider reducing condition complexity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (3)
api/py/ai/chronon/repo/run.py(10 hunks)api/py/test/sample/teams.json(2 hunks)spark/src/main/scala/ai/chronon/spark/Driver.scala(2 hunks)
🔇 Additional comments (13)
api/py/ai/chronon/repo/run.py (9)
208-208: All good.
411-411: Looks fine.
452-458: SPARK_HOME logic looks fine.
474-490: Remove commented-out code if not needed.
491-547: No issues spotted.
Line range hint
548-654: Line 550 is commented out; remove if unneeded.
655-664: Parallel mode looks fine.
665-665: Single call usage is fine.
900-908: Temp directory logic is clear.spark/src/main/scala/ai/chronon/spark/Driver.scala (2)
693-693: Shifting to args.api is OK.
726-727: Neat alignment of result logging.api/py/test/sample/teams.json (2)
18-18: Dataproc cluster name updated successfully.
64-64: Namespace change looks fine.
a566189 to
0bae733
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
api/py/ai/chronon/repo/run.py (2)
414-416: Enhance JAR validation logic.Consider extracting the complex condition into a descriptive method for better readability.
- if (self.mode in ONLINE_MODES) and (not args["sub_help"] and not self.dataproc) and not valid_jar and ( - args.get("online_jar_fetch")): + def should_download_online_jar(): + return (self.mode in ONLINE_MODES and + not args["sub_help"] and + not self.dataproc and + not valid_jar and + args.get("online_jar_fetch")) + + if should_download_online_jar():
808-819: Enhance error messages in GCP JAR download.Add more context to error messages to help troubleshoot GCP-specific issues.
def download_chronon_gcp_jar(destination_dir: str, customer_id: str): - print("Downloading chronon gcp jar from GCS...") + print(f"Downloading chronon GCP JAR from gs://{bucket_name}/{source_blob_name}...") bucket_name = f"zipline-artifacts-{customer_id}" file_name = "cloud_gcp-assembly-0.1.0-SNAPSHOT.jar" source_blob_name = f"jars/{file_name}" chronon_gcp_jar_destination_path = f"{destination_dir}/{file_name}" download_gcs_blob(bucket_name, source_blob_name, chronon_gcp_jar_destination_path) return chronon_gcp_jar_destination_path
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (3)
api/py/ai/chronon/repo/run.py(10 hunks)api/py/test/sample/teams.json(2 hunks)spark/src/main/scala/ai/chronon/spark/Driver.scala(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- api/py/test/sample/teams.json
- spark/src/main/scala/ai/chronon/spark/Driver.scala
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: table_utils_delta_format_spark_tests
- GitHub Check: other_spark_tests
- GitHub Check: mutation_spark_tests
- GitHub Check: join_spark_tests
- GitHub Check: fetcher_spark_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: no_spark_scala_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (2)
api/py/ai/chronon/repo/run.py (2)
474-474: Remove commented code.The commented line appears to be debugging code. Either remove it or add a clear comment explaining its purpose.
453-458: Validate SPARK_HOME path more thoroughly.The current validation only checks if the jars directory exists. Consider validating essential JAR files too.
api/py/ai/chronon/repo/run.py
Outdated
| # command = f"java -cp {dataproc_jar} {DATAPROC_ENTRY} {dataproc_command}" | ||
| command = f"java -cp {self.jar_path} {DATAPROC_ENTRY} {dataproc_command}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Eliminate command generation duplication.
The Dataproc command generation is duplicated. Consider extracting it into a helper method.
+ def _build_dataproc_command(self, dataproc_command: str) -> str:
+ return f"java -cp {self.jar_path} {DATAPROC_ENTRY} {dataproc_command}"
+
- command = f"java -cp {self.jar_path} {DATAPROC_ENTRY} {dataproc_command}"
+ command = self._build_dataproc_command(dataproc_command)Also applies to: 651-652
| running_app_map = {} | ||
| for app in running_apps: | ||
| try: | ||
| app_json = json.loads(app.strip()) | ||
| app_name = app_json["app_name"].strip() | ||
| if app_name not in running_app_map: | ||
| running_app_map[app_name] = [] | ||
| running_app_map[app_name].append(app_json) | ||
| except Exception as ex: | ||
| print("failed to process line into app: " + app) | ||
| print(ex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling in app processing.
The exception handling swallows errors silently. Consider logging the full stack trace for debugging.
running_app_map = {}
for app in running_apps:
try:
app_json = json.loads(app.strip())
app_name = app_json["app_name"].strip()
if app_name not in running_app_map:
running_app_map[app_name] = []
running_app_map[app_name].append(app_json)
except Exception as ex:
- print("failed to process line into app: " + app)
- print(ex)
+ logging.error("Failed to process app: %s", app, exc_info=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| running_app_map = {} | |
| for app in running_apps: | |
| try: | |
| app_json = json.loads(app.strip()) | |
| app_name = app_json["app_name"].strip() | |
| if app_name not in running_app_map: | |
| running_app_map[app_name] = [] | |
| running_app_map[app_name].append(app_json) | |
| except Exception as ex: | |
| print("failed to process line into app: " + app) | |
| print(ex) | |
| running_app_map = {} | |
| for app in running_apps: | |
| try: | |
| app_json = json.loads(app.strip()) | |
| app_name = app_json["app_name"].strip() | |
| if app_name not in running_app_map: | |
| running_app_map[app_name] = [] | |
| running_app_map[app_name].append(app_json) | |
| except Exception as ex: | |
| logging.error("Failed to process app: %s", app, exc_info=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
api/py/ai/chronon/repo/run.py (1)
411-416: Improve JAR download condition readability.The nested conditions make the logic hard to follow.
- if (self.mode in ONLINE_MODES) and (not args["sub_help"] and not self.dataproc) and not valid_jar and ( - args.get("online_jar_fetch")): + should_download = ( + self.mode in ONLINE_MODES + and not args["sub_help"] + and not self.dataproc + and not valid_jar + and args.get("online_jar_fetch") + ) + if should_download:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (1)
api/py/ai/chronon/repo/run.py(10 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: table_utils_delta_format_spark_tests
- GitHub Check: mutation_spark_tests
- GitHub Check: fetcher_spark_tests
- GitHub Check: other_spark_tests
- GitHub Check: join_spark_tests
- GitHub Check: no_spark_scala_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (5)
api/py/ai/chronon/repo/run.py (5)
453-458: LGTM! Good enhancement for SPARK_HOME integration.The code properly checks for SPARK_HOME existence and appends jars directory to classpath.
474-534: Improve error handling in streaming job management.The error handling for app processing is still basic.
- print("failed to process line into app: " + app) - print(ex) + logging.error("Failed to process app: %s", app, exc_info=True)
578-584: Eliminate command generation duplication.The Dataproc command generation logic is duplicated.
+ def _build_dataproc_command(self, user_args: str, local_files: List[str]) -> str: + dataproc_command = generate_dataproc_submitter_args( + local_files_to_upload_to_gcs=local_files, + user_args=user_args + ) + return f"java -cp {self.jar_path} {DATAPROC_ENTRY} {dataproc_command}"Also applies to: 619-634
892-899: LGTM! Good use of temporary directory.Using
tempfile.TemporaryDirectoryensures proper cleanup.
Line range hint
786-812: Add version validation for GCP jar downloads.The hardcoded version "0.1.0-SNAPSHOT" should be validated.
✅ Verification successful
Version "0.1.0-SNAPSHOT" is the project's standard development version
The version is correctly defined in
version.sbtand consistently used across the project's build system and deployment scripts.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check if the jar version exists in GCS gsutil ls gs://zipline-artifacts-*/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jarLength of output: 178
Script:
#!/bin/bash # Search for jar version references rg -g '!*.jar' "0.1.0-SNAPSHOT" # Look for build files that might define versions fd "build.sbt|pom.xml|build.gradle" --type fLength of output: 6627
## Summary Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files Should be rebased and tested off of #276 ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update - [x] Not yet tested, should test after merging 276 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added a builder pattern for `JavaFetcher` to enhance object creation flexibility. - Introduced an optional `ExecutionContext` parameter across multiple classes to improve execution context management. - **Improvements** - Updated constructor signatures in Java and Scala classes to support more configurable execution contexts. - Enhanced instantiation process with more structured object creation methods. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]> Co-authored-by: Thomas Chow <[email protected]>
## Summary ### Setup: Download spark and set SPARK_HOME. > wget https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz > tar -xzf spark-3.5.4-bin-hadoop3.tgz In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var ``` export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3 # choose dir where you unpacked spark ``` The reason for setting spark for the moment here is because of the way Driver.scala is setup. It's a big melting pot of spark subcommands and others. Because of the other spark sub commands, we have to have spark here. ### Test: ``` (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY /Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME /Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3 (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quickstart/purchases.v1 -k '{"user_id":"5"}' Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None} Setting env variables: From <common_env> setting VERSION=latest From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit From <common_env> setting JOB_MODE=local[*] From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT> From <common_env> setting PARTITION_COLUMN=ds From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd From <common_env> setting CUSTOMER_ID=canary From <common_env> setting GCP_PROJECT_ID=canary-443022 From <common_env> setting GCP_REGION=us-central1 From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance From <cli_args> setting APP_NAME=chronon_fetch From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar Downloading chronon gcp jar from GCS... Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from bucket zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar. Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --conf-type=group_bys --type group-by --name quickstart/purchases.v1 -k {"user_id":"5"} WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. --- [FETCHED RESULT] --- { "purchase_price_average_14d" : 72.5, "purchase_price_average_30d" : 250.6, "purchase_price_average_3d" : null, "purchase_price_count_14d" : 2, "purchase_price_count_30d" : 5, "purchase_price_count_3d" : null, "purchase_price_last10" : [ 76, 69, 367, 466, 275 ], "purchase_price_sum_14d" : 145, "purchase_price_sum_30d" : 1253, "purchase_price_sum_3d" : null } ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Updated GCP Dataproc cluster name - Modified namespace settings in test configuration - **Runtime Improvements** - Enhanced JAR download and management functionality - Improved runtime configuration handling - Added new methods for JAR retrieval - **Fetcher Enhancements** - Updated fetcher initialization process - Added result printing for better visibility <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files Should be rebased and tested off of #276 ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update - [x] Not yet tested, should test after merging 276 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added a builder pattern for `JavaFetcher` to enhance object creation flexibility. - Introduced an optional `ExecutionContext` parameter across multiple classes to improve execution context management. - **Improvements** - Updated constructor signatures in Java and Scala classes to support more configurable execution contexts. - Enhanced instantiation process with more structured object creation methods. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]> Co-authored-by: Thomas Chow <[email protected]>
## Summary ### Setup: Download spark and set SPARK_HOME. > wget https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz > tar -xzf spark-3.5.4-bin-hadoop3.tgz In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var ``` export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3 # choose dir where you unpacked spark ``` The reason for setting spark for the moment here is because of the way Driver.scala is setup. It's a big melting pot of spark subcommands and others. Because of the other spark sub commands, we have to have spark here. ### Test: ``` (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY /Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME /Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3 (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quickstart/purchases.v1 -k '{"user_id":"5"}' Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None} Setting env variables: From <common_env> setting VERSION=latest From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit From <common_env> setting JOB_MODE=local[*] From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT> From <common_env> setting PARTITION_COLUMN=ds From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd From <common_env> setting CUSTOMER_ID=canary From <common_env> setting GCP_PROJECT_ID=canary-443022 From <common_env> setting GCP_REGION=us-central1 From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance From <cli_args> setting APP_NAME=chronon_fetch From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar Downloading chronon gcp jar from GCS... Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from bucket zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar. Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --conf-type=group_bys --type group-by --name quickstart/purchases.v1 -k {"user_id":"5"} WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. --- [FETCHED RESULT] --- { "purchase_price_average_14d" : 72.5, "purchase_price_average_30d" : 250.6, "purchase_price_average_3d" : null, "purchase_price_count_14d" : 2, "purchase_price_count_30d" : 5, "purchase_price_count_3d" : null, "purchase_price_last10" : [ 76, 69, 367, 466, 275 ], "purchase_price_sum_14d" : 145, "purchase_price_sum_30d" : 1253, "purchase_price_sum_3d" : null } ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Updated GCP Dataproc cluster name - Modified namespace settings in test configuration - **Runtime Improvements** - Enhanced JAR download and management functionality - Improved runtime configuration handling - Added new methods for JAR retrieval - **Fetcher Enhancements** - Updated fetcher initialization process - Added result printing for better visibility <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files Should be rebased and tested off of #276 ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update - [x] Not yet tested, should test after merging 276 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added a builder pattern for `JavaFetcher` to enhance object creation flexibility. - Introduced an optional `ExecutionContext` parameter across multiple classes to improve execution context management. - **Improvements** - Updated constructor signatures in Java and Scala classes to support more configurable execution contexts. - Enhanced instantiation process with more structured object creation methods. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]> Co-authored-by: Thomas Chow <[email protected]>
## Summary ### Setup: Download spark and set SPARK_HOME. > wget https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz > tar -xzf spark-3.5.4-bin-hadoop3.tgz In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var ``` export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3 # choose dir where you unpacked spark ``` The reason for setting spark for the moment here is because of the way Driver.scala is setup. It's a big melting pot of spark subcommands and others. Because of the other spark sub commands, we have to have spark here. ### Test: ``` (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY /Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME /Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3 (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quickstart/purchases.v1 -k '{"user_id":"5"}' Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None} Setting env variables: From <common_env> setting VERSION=latest From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit From <common_env> setting JOB_MODE=local[*] From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT> From <common_env> setting PARTITION_COLUMN=ds From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd From <common_env> setting CUSTOMER_ID=canary From <common_env> setting GCP_PROJECT_ID=canary-443022 From <common_env> setting GCP_REGION=us-central1 From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance From <cli_args> setting APP_NAME=chronon_fetch From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar Downloading chronon gcp jar from GCS... Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from bucket zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar. Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --conf-type=group_bys --type group-by --name quickstart/purchases.v1 -k {"user_id":"5"} WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. --- [FETCHED RESULT] --- { "purchase_price_average_14d" : 72.5, "purchase_price_average_30d" : 250.6, "purchase_price_average_3d" : null, "purchase_price_count_14d" : 2, "purchase_price_count_30d" : 5, "purchase_price_count_3d" : null, "purchase_price_last10" : [ 76, 69, 367, 466, 275 ], "purchase_price_sum_14d" : 145, "purchase_price_sum_30d" : 1253, "purchase_price_sum_3d" : null } ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Updated GCP Dataproc cluster name - Modified namespace settings in test configuration - **Runtime Improvements** - Enhanced JAR download and management functionality - Improved runtime configuration handling - Added new methods for JAR retrieval - **Fetcher Enhancements** - Updated fetcher initialization process - Added result printing for better visibility <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files Should be rebased and tested off of #276 ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update - [x] Not yet tested, should test after merging 276 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added a builder pattern for `JavaFetcher` to enhance object creation flexibility. - Introduced an optional `ExecutionContext` parameter across multiple classes to improve execution context management. - **Improvements** - Updated constructor signatures in Java and Scala classes to support more configurable execution contexts. - Enhanced instantiation process with more structured object creation methods. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]> Co-authored-by: Thomas Chow <[email protected]>
## Summary ### Setup: Download spark and set SPARK_HOME. > wget https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz > tar -xzf spark-3.5.4-bin-hadoop3.tgz In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var ``` export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3 # choose dir where you unpacked spark ``` The reason for setting spark for the moment here is because of the way Driver.scala is setup. It's a big melting pot of spark subcommands and others. Because of the other spark sub commands, we have to have spark here. ### Test: ``` (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY /Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME /Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3 (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quickstart/purchases.v1 -k '{"user_id":"5"}' Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None} Setting env variables: From <common_env> setting VERSION=latest From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit From <common_env> setting JOB_MODE=local[*] From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT> From <common_env> setting PARTITION_COLUMN=ds From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd From <common_env> setting CUSTOMER_ID=canary From <common_env> setting GCP_PROJECT_ID=canary-443022 From <common_env> setting GCP_REGION=us-central1 From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance From <cli_args> setting APP_NAME=chronon_fetch From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar Downloading chronon gcp jar from GCS... Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from bucket zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar. Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --conf-type=group_bys --type group-by --name quickstart/purchases.v1 -k {"user_id":"5"} WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. --- [FETCHED RESULT] --- { "purchase_price_average_14d" : 72.5, "purchase_price_average_30d" : 250.6, "purchase_price_average_3d" : null, "purchase_price_count_14d" : 2, "purchase_price_count_30d" : 5, "purchase_price_count_3d" : null, "purchase_price_last10" : [ 76, 69, 367, 466, 275 ], "purchase_price_sum_14d" : 145, "purchase_price_sum_30d" : 1253, "purchase_price_sum_3d" : null } ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Updated GCP Dataproc cluster name - Modified namespace settings in test configuration - **Runtime Improvements** - Enhanced JAR download and management functionality - Improved runtime configuration handling - Added new methods for JAR retrieval - **Fetcher Enhancements** - Updated fetcher initialization process - Added result printing for better visibility <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files Should be rebased and tested off of #276 ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update - [x] Not yet tested, should test after merging 276 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added a builder pattern for `JavaFetcher` to enhance object creation flexibility. - Introduced an optional `ExecutionContext` parameter across multiple classes to improve execution context management. - **Improvements** - Updated constructor signatures in Java and Scala classes to support more configurable execution contexts. - Enhanced instantiation process with more structured object creation methods. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]> Co-authored-by: Thomas Chow <[email protected]>
## Summary ### Setup: Download spark and set SPARK_HOME. > wget https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz > tar -xzf spark-3.5.4-bin-hadoop3.tgz In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var ``` export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3 # choose dir where you unpacked spark ``` The reason for setting spark for the moment here is because of the way Driver.scala is setup. It's a big melting pot of spark subcommands and others. Because of the other spark sub commands, we have to have spark here. ### Test: ``` (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY /Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME /Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3 (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quickstart/purchases.v1 -k '{"user_id":"5"}' Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None} Setting env variables: From <common_env> setting VERSION=latest From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit From <common_env> setting JOB_MODE=local[*] From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT> From <common_env> setting PARTITION_COLUMN=ds From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd From <common_env> setting CUSTOMER_ID=canary From <common_env> setting GCP_PROJECT_ID=canary-443022 From <common_env> setting GCP_REGION=us-central1 From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance From <cli_args> setting APP_NAME=chronon_fetch From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar Downloading chronon gcp jar from GCS... Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from bucket zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar. Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --conf-type=group_bys --type group-by --name quickstart/purchases.v1 -k {"user_id":"5"} WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. --- [FETCHED RESULT] --- { "purchase_price_average_14d" : 72.5, "purchase_price_average_30d" : 250.6, "purchase_price_average_3d" : null, "purchase_price_count_14d" : 2, "purchase_price_count_30d" : 5, "purchase_price_count_3d" : null, "purchase_price_last10" : [ 76, 69, 367, 466, 275 ], "purchase_price_sum_14d" : 145, "purchase_price_sum_30d" : 1253, "purchase_price_sum_3d" : null } ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Updated GCP Dataproc cluster name - Modified namespace settings in test configuration - **Runtime Improvements** - Enhanced JAR download and management functionality - Improved runtime configuration handling - Added new methods for JAR retrieval - **Fetcher Enhancements** - Updated fetcher initialization process - Added result printing for better visibility <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Manual cherry-piour clients from https://github.com/airbnb/chronon/pull/909/files Should be rebased and tested off of #276 ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update - [x] Not yet tested, should test after merging 276 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added a builder pattern for `JavaFetcher` to enhance object creation flexibility. - Introduced an optional `ExecutionContext` parameter across multiple classes to improve execution context management. - **Improvements** - Updated constructor signatures in Java and Scala classes to support more configurable execution contexts. - Enhanced instantiation process with more structured object creation methods. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]> Co-authored-by: Thomas Chow <[email protected]>
## Summary ### Setup: Download spark and set SPARK_HOME. > wget https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz > tar -xzf spark-3.5.4-bin-hadoop3.tgz In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var ``` export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3 # choose dir where you unpaour clientsed spark ``` The reason for setting spark for the moment here is because of the way Driver.scala is setup. It's a big melting pot of spark subcommands and others. Because of the other spark sub commands, we have to have spark here. ### Test: ``` (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY /Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME /Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3 (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quiour clientsstart/purchases.v1 -k '{"user_id":"5"}' Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None} Setting env variables: From <common_env> setting VERSION=latest From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit From <common_env> setting JOB_MODE=local[*] From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT> From <common_env> setting PARTITION_COLUMN=ds From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd From <common_env> setting CUSTOMER_ID=canary From <common_env> setting GCP_PROJECT_ID=canary-443022 From <common_env> setting GCP_REGION=us-central1 From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance From <cli_args> setting APP_NAME=chronon_fetch From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar Downloading chronon gcp jar from GCS... Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from buour clientset zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar. Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --conf-type=group_bys --type group-by --name quiour clientsstart/purchases.v1 -k {"user_id":"5"} WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. --- [FETCHED RESULT] --- { "purchase_price_average_14d" : 72.5, "purchase_price_average_30d" : 250.6, "purchase_price_average_3d" : null, "purchase_price_count_14d" : 2, "purchase_price_count_30d" : 5, "purchase_price_count_3d" : null, "purchase_price_last10" : [ 76, 69, 367, 466, 275 ], "purchase_price_sum_14d" : 145, "purchase_price_sum_30d" : 1253, "purchase_price_sum_3d" : null } ``` ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Updated GCP Dataproc cluster name - Modified namespace settings in test configuration - **Runtime Improvements** - Enhanced JAR download and management functionality - Improved runtime configuration handling - Added new methods for JAR retrieval - **Fetcher Enhancements** - Updated fetcher initialization process - Added result printing for better visibility <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Setup:
Download spark and set SPARK_HOME.
In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var
The reason for setting spark for the moment here is because of the way Driver.scala is setup. It's a big melting pot of spark subcommands and others. Because of the other spark sub commands, we have to have spark here.
Test:
Checklist
Summary by CodeRabbit
Configuration Updates
Runtime Improvements
Fetcher Enhancements