Skip to content

Conversation

@david-zlai
Copy link
Contributor

@david-zlai david-zlai commented Jan 26, 2025

Summary

Setup:

Download spark and set SPARK_HOME.

wget https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
tar -xzf spark-3.5.4-bin-hadoop3.tgz

In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var

export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3                      # choose dir where you unpacked spark

The reason for setting spark for the moment here is because of the way Driver.scala is setup. It's a big melting pot of spark subcommands and others. Because of the other spark sub commands, we have to have spark here.

Test:

(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY
/Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME
/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quickstart/purchases.v1 -k '{"user_id":"5"}'
Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon_fetch
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
Downloading chronon gcp jar from GCS...
Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from bucket zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar.
Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl  --conf-type=group_bys  --type group-by --name quickstart/purchases.v1 -k {"user_id":"5"}
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
--- [FETCHED RESULT] ---
{
  "purchase_price_average_14d" : 72.5,
  "purchase_price_average_30d" : 250.6,
  "purchase_price_average_3d" : null,
  "purchase_price_count_14d" : 2,
  "purchase_price_count_30d" : 5,
  "purchase_price_count_3d" : null,
  "purchase_price_last10" : [ 76, 69, 367, 466, 275 ],
  "purchase_price_sum_14d" : 145,
  "purchase_price_sum_30d" : 1253,
  "purchase_price_sum_3d" : null
}

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • Configuration Updates

    • Updated GCP Dataproc cluster name
    • Modified namespace settings in test configuration
  • Runtime Improvements

    • Enhanced JAR download and management functionality
    • Improved runtime configuration handling
    • Added new methods for JAR retrieval
  • Fetcher Enhancements

    • Updated fetcher initialization process
    • Added result printing for better visibility

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 26, 2025

Walkthrough

The pull request introduces modifications to multiple files across the Chronon project. In run.py, the changes focus on enhancing JAR download mechanisms, runtime configuration handling, and job management. The teams.json file sees minor configuration updates for GCP Dataproc cluster and namespace settings. In Driver.scala, the fetcher initialization and result reporting process has been refined, with a new approach to building fetchers and improved result visibility.

Changes

File Change Summary
api/py/ai/chronon/repo/run.py - Added conf_type parameter to Runner class
- Renamed download_dataproc_jar to download_dataproc_submitter_jar
- Added download_chronon_gcp_jar method
- Updated JAR download and runtime environment logic
api/py/test/sample/teams.json - Updated GCP_DATAPROC_CLUSTER_NAME to "zipline-canary-cluster"
- Changed namespace to "canary-443022.data"
spark/src/main/scala/ai/chronon/spark/Driver.scala - Modified fetcher initialization in FetcherCli
- Added formatted result printing

Possibly related PRs

Suggested Reviewers

  • piyush-zlai
  • nikhil-zlai

Poem

In the realm of code, where bits dance free,
JAR downloads sing a new melody 🎵
Clusters shift, configurations bright,
Chronon's magic takes playful flight! ✨
Refactored paths, a developer's glee 🚀

Warning

Review ran into problems

🔥 Problems

GitHub Actions: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

else:
# Always download the jar for now so that we can pull
# in any fixes or latest changes
# if self.dataproc:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be commented out?

Copy link
Collaborator

@varant-zlai varant-zlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming was able to get this to work locally

varant-zlai added a commit that referenced this pull request Jan 28, 2025
## Summary

Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files

Should be rebased and tested off of
#276

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
- [x] Not yet tested, should test after merging 276



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a builder pattern for `JavaFetcher` to enhance object creation
flexibility.
- Introduced an optional `ExecutionContext` parameter across multiple
classes to improve execution context management.

- **Improvements**
- Updated constructor signatures in Java and Scala classes to support
more configurable execution contexts.
- Enhanced instantiation process with more structured object creation
methods.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
Co-authored-by: Nikhil Simha <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
@david-zlai david-zlai marked this pull request as ready for review January 30, 2025 18:23
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
api/py/ai/chronon/repo/run.py (1)

415-416: Consider reducing condition complexity.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between c0f1645 and a566189.

📒 Files selected for processing (3)
  • api/py/ai/chronon/repo/run.py (10 hunks)
  • api/py/test/sample/teams.json (2 hunks)
  • spark/src/main/scala/ai/chronon/spark/Driver.scala (2 hunks)
🔇 Additional comments (13)
api/py/ai/chronon/repo/run.py (9)

208-208: All good.


411-411: Looks fine.


452-458: SPARK_HOME logic looks fine.


474-490: Remove commented-out code if not needed.


491-547: No issues spotted.


Line range hint 548-654: Line 550 is commented out; remove if unneeded.


655-664: Parallel mode looks fine.


665-665: Single call usage is fine.


900-908: Temp directory logic is clear.

spark/src/main/scala/ai/chronon/spark/Driver.scala (2)

693-693: Shifting to args.api is OK.


726-727: Neat alignment of result logging.

api/py/test/sample/teams.json (2)

18-18: Dataproc cluster name updated successfully.


64-64: Namespace change looks fine.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
api/py/ai/chronon/repo/run.py (2)

414-416: Enhance JAR validation logic.

Consider extracting the complex condition into a descriptive method for better readability.

-        if (self.mode in ONLINE_MODES) and (not args["sub_help"] and not self.dataproc) and not valid_jar and (
-                args.get("online_jar_fetch")):
+        def should_download_online_jar():
+            return (self.mode in ONLINE_MODES and 
+                    not args["sub_help"] and 
+                    not self.dataproc and 
+                    not valid_jar and 
+                    args.get("online_jar_fetch"))
+
+        if should_download_online_jar():

808-819: Enhance error messages in GCP JAR download.

Add more context to error messages to help troubleshoot GCP-specific issues.

 def download_chronon_gcp_jar(destination_dir: str, customer_id: str):
-    print("Downloading chronon gcp jar from GCS...")
+    print(f"Downloading chronon GCP JAR from gs://{bucket_name}/{source_blob_name}...")
     bucket_name = f"zipline-artifacts-{customer_id}"
     file_name = "cloud_gcp-assembly-0.1.0-SNAPSHOT.jar"
     source_blob_name = f"jars/{file_name}"
     chronon_gcp_jar_destination_path = f"{destination_dir}/{file_name}"
     download_gcs_blob(bucket_name, source_blob_name,
                      chronon_gcp_jar_destination_path)
     return chronon_gcp_jar_destination_path
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between a566189 and 0bae733.

📒 Files selected for processing (3)
  • api/py/ai/chronon/repo/run.py (10 hunks)
  • api/py/test/sample/teams.json (2 hunks)
  • spark/src/main/scala/ai/chronon/spark/Driver.scala (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • api/py/test/sample/teams.json
  • spark/src/main/scala/ai/chronon/spark/Driver.scala
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: table_utils_delta_format_spark_tests
  • GitHub Check: other_spark_tests
  • GitHub Check: mutation_spark_tests
  • GitHub Check: join_spark_tests
  • GitHub Check: fetcher_spark_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: no_spark_scala_tests
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (2)
api/py/ai/chronon/repo/run.py (2)

474-474: Remove commented code.

The commented line appears to be debugging code. Either remove it or add a clear comment explaining its purpose.


453-458: Validate SPARK_HOME path more thoroughly.

The current validation only checks if the jars directory exists. Consider validating essential JAR files too.

Comment on lines 607 to 608
# command = f"java -cp {dataproc_jar} {DATAPROC_ENTRY} {dataproc_command}"
command = f"java -cp {self.jar_path} {DATAPROC_ENTRY} {dataproc_command}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Eliminate command generation duplication.

The Dataproc command generation is duplicated. Consider extracting it into a helper method.

+    def _build_dataproc_command(self, dataproc_command: str) -> str:
+        return f"java -cp {self.jar_path} {DATAPROC_ENTRY} {dataproc_command}"
+
-            command = f"java -cp {self.jar_path} {DATAPROC_ENTRY} {dataproc_command}"
+            command = self._build_dataproc_command(dataproc_command)

Also applies to: 651-652

Comment on lines +506 to +516
running_app_map = {}
for app in running_apps:
try:
app_json = json.loads(app.strip())
app_name = app_json["app_name"].strip()
if app_name not in running_app_map:
running_app_map[app_name] = []
running_app_map[app_name].append(app_json)
except Exception as ex:
print("failed to process line into app: " + app)
print(ex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error handling in app processing.

The exception handling swallows errors silently. Consider logging the full stack trace for debugging.

         running_app_map = {}
         for app in running_apps:
             try:
                 app_json = json.loads(app.strip())
                 app_name = app_json["app_name"].strip()
                 if app_name not in running_app_map:
                     running_app_map[app_name] = []
                 running_app_map[app_name].append(app_json)
             except Exception as ex:
-                print("failed to process line into app: " + app)
-                print(ex)
+                logging.error("Failed to process app: %s", app, exc_info=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
running_app_map = {}
for app in running_apps:
try:
app_json = json.loads(app.strip())
app_name = app_json["app_name"].strip()
if app_name not in running_app_map:
running_app_map[app_name] = []
running_app_map[app_name].append(app_json)
except Exception as ex:
print("failed to process line into app: " + app)
print(ex)
running_app_map = {}
for app in running_apps:
try:
app_json = json.loads(app.strip())
app_name = app_json["app_name"].strip()
if app_name not in running_app_map:
running_app_map[app_name] = []
running_app_map[app_name].append(app_json)
except Exception as ex:
logging.error("Failed to process app: %s", app, exc_info=True)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
api/py/ai/chronon/repo/run.py (1)

411-416: Improve JAR download condition readability.

The nested conditions make the logic hard to follow.

-        if (self.mode in ONLINE_MODES) and (not args["sub_help"] and not self.dataproc) and not valid_jar and (
-                args.get("online_jar_fetch")):
+        should_download = (
+            self.mode in ONLINE_MODES
+            and not args["sub_help"]
+            and not self.dataproc
+            and not valid_jar
+            and args.get("online_jar_fetch")
+        )
+        if should_download:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 0bae733 and e8a79c2.

📒 Files selected for processing (1)
  • api/py/ai/chronon/repo/run.py (10 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: table_utils_delta_format_spark_tests
  • GitHub Check: mutation_spark_tests
  • GitHub Check: fetcher_spark_tests
  • GitHub Check: other_spark_tests
  • GitHub Check: join_spark_tests
  • GitHub Check: no_spark_scala_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (5)
api/py/ai/chronon/repo/run.py (5)

453-458: LGTM! Good enhancement for SPARK_HOME integration.

The code properly checks for SPARK_HOME existence and appends jars directory to classpath.


474-534: Improve error handling in streaming job management.

The error handling for app processing is still basic.

-                        print("failed to process line into app: " + app)
-                        print(ex)
+                        logging.error("Failed to process app: %s", app, exc_info=True)

578-584: Eliminate command generation duplication.

The Dataproc command generation logic is duplicated.

+    def _build_dataproc_command(self, user_args: str, local_files: List[str]) -> str:
+        dataproc_command = generate_dataproc_submitter_args(
+            local_files_to_upload_to_gcs=local_files,
+            user_args=user_args
+        )
+        return f"java -cp {self.jar_path} {DATAPROC_ENTRY} {dataproc_command}"

Also applies to: 619-634


892-899: LGTM! Good use of temporary directory.

Using tempfile.TemporaryDirectory ensures proper cleanup.


Line range hint 786-812: Add version validation for GCP jar downloads.

The hardcoded version "0.1.0-SNAPSHOT" should be validated.

✅ Verification successful

Version "0.1.0-SNAPSHOT" is the project's standard development version

The version is correctly defined in version.sbt and consistently used across the project's build system and deployment scripts.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the jar version exists in GCS
gsutil ls gs://zipline-artifacts-*/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar

Length of output: 178


Script:

#!/bin/bash
# Search for jar version references
rg -g '!*.jar' "0.1.0-SNAPSHOT"

# Look for build files that might define versions
fd "build.sbt|pom.xml|build.gradle" --type f

Length of output: 6627

@david-zlai david-zlai merged commit 6793d35 into main Jan 30, 2025
12 checks passed
@david-zlai david-zlai deleted the davidhan/do_fetch_2 branch January 30, 2025 19:36
nikhil-zlai added a commit that referenced this pull request Feb 4, 2025
## Summary

Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files

Should be rebased and tested off of
#276

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
- [x] Not yet tested, should test after merging 276



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a builder pattern for `JavaFetcher` to enhance object creation
flexibility.
- Introduced an optional `ExecutionContext` parameter across multiple
classes to improve execution context management.

- **Improvements**
- Updated constructor signatures in Java and Scala classes to support
more configurable execution contexts.
- Enhanced instantiation process with more structured object creation
methods.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
Co-authored-by: Nikhil Simha <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
nikhil-zlai pushed a commit that referenced this pull request Feb 4, 2025
## Summary

### Setup:
Download spark and set SPARK_HOME.
> wget
https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
> tar -xzf spark-3.5.4-bin-hadoop3.tgz

In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var
```
export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3                      # choose dir where you unpacked spark
```
The reason for setting spark for the moment here is because of the way
Driver.scala is setup. It's a big melting pot of spark subcommands and
others. Because of the other spark sub commands, we have to have spark
here.

### Test:
```
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY
/Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME
/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quickstart/purchases.v1 -k '{"user_id":"5"}'
Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon_fetch
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
Downloading chronon gcp jar from GCS...
Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from bucket zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar.
Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl  --conf-type=group_bys  --type group-by --name quickstart/purchases.v1 -k {"user_id":"5"}
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
--- [FETCHED RESULT] ---
{
  "purchase_price_average_14d" : 72.5,
  "purchase_price_average_30d" : 250.6,
  "purchase_price_average_3d" : null,
  "purchase_price_count_14d" : 2,
  "purchase_price_count_30d" : 5,
  "purchase_price_count_3d" : null,
  "purchase_price_last10" : [ 76, 69, 367, 466, 275 ],
  "purchase_price_sum_14d" : 145,
  "purchase_price_sum_30d" : 1253,
  "purchase_price_sum_3d" : null
}

```


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Configuration Updates**
	- Updated GCP Dataproc cluster name
	- Modified namespace settings in test configuration

- **Runtime Improvements**
	- Enhanced JAR download and management functionality
	- Improved runtime configuration handling
	- Added new methods for JAR retrieval

- **Fetcher Enhancements**
	- Updated fetcher initialization process
	- Added result printing for better visibility
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kumar-zlai pushed a commit that referenced this pull request Apr 25, 2025
## Summary

Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files

Should be rebased and tested off of
#276

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
- [x] Not yet tested, should test after merging 276



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a builder pattern for `JavaFetcher` to enhance object creation
flexibility.
- Introduced an optional `ExecutionContext` parameter across multiple
classes to improve execution context management.

- **Improvements**
- Updated constructor signatures in Java and Scala classes to support
more configurable execution contexts.
- Enhanced instantiation process with more structured object creation
methods.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
Co-authored-by: Nikhil Simha <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
kumar-zlai pushed a commit that referenced this pull request Apr 25, 2025
## Summary

### Setup:
Download spark and set SPARK_HOME.
> wget
https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
> tar -xzf spark-3.5.4-bin-hadoop3.tgz

In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var
```
export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3                      # choose dir where you unpacked spark
```
The reason for setting spark for the moment here is because of the way
Driver.scala is setup. It's a big melting pot of spark subcommands and
others. Because of the other spark sub commands, we have to have spark
here.

### Test:
```
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY
/Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME
/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quickstart/purchases.v1 -k '{"user_id":"5"}'
Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon_fetch
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
Downloading chronon gcp jar from GCS...
Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from bucket zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar.
Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl  --conf-type=group_bys  --type group-by --name quickstart/purchases.v1 -k {"user_id":"5"}
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
--- [FETCHED RESULT] ---
{
  "purchase_price_average_14d" : 72.5,
  "purchase_price_average_30d" : 250.6,
  "purchase_price_average_3d" : null,
  "purchase_price_count_14d" : 2,
  "purchase_price_count_30d" : 5,
  "purchase_price_count_3d" : null,
  "purchase_price_last10" : [ 76, 69, 367, 466, 275 ],
  "purchase_price_sum_14d" : 145,
  "purchase_price_sum_30d" : 1253,
  "purchase_price_sum_3d" : null
}

```


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Configuration Updates**
	- Updated GCP Dataproc cluster name
	- Modified namespace settings in test configuration

- **Runtime Improvements**
	- Enhanced JAR download and management functionality
	- Improved runtime configuration handling
	- Added new methods for JAR retrieval

- **Fetcher Enhancements**
	- Updated fetcher initialization process
	- Added result printing for better visibility
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kumar-zlai pushed a commit that referenced this pull request Apr 29, 2025
## Summary

Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files

Should be rebased and tested off of
#276

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
- [x] Not yet tested, should test after merging 276



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a builder pattern for `JavaFetcher` to enhance object creation
flexibility.
- Introduced an optional `ExecutionContext` parameter across multiple
classes to improve execution context management.

- **Improvements**
- Updated constructor signatures in Java and Scala classes to support
more configurable execution contexts.
- Enhanced instantiation process with more structured object creation
methods.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
Co-authored-by: Nikhil Simha <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
kumar-zlai pushed a commit that referenced this pull request Apr 29, 2025
## Summary

### Setup:
Download spark and set SPARK_HOME.
> wget
https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
> tar -xzf spark-3.5.4-bin-hadoop3.tgz

In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var
```
export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3                      # choose dir where you unpacked spark
```
The reason for setting spark for the moment here is because of the way
Driver.scala is setup. It's a big melting pot of spark subcommands and
others. Because of the other spark sub commands, we have to have spark
here.

### Test:
```
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY
/Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME
/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quickstart/purchases.v1 -k '{"user_id":"5"}'
Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon_fetch
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
Downloading chronon gcp jar from GCS...
Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from bucket zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar.
Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl  --conf-type=group_bys  --type group-by --name quickstart/purchases.v1 -k {"user_id":"5"}
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
--- [FETCHED RESULT] ---
{
  "purchase_price_average_14d" : 72.5,
  "purchase_price_average_30d" : 250.6,
  "purchase_price_average_3d" : null,
  "purchase_price_count_14d" : 2,
  "purchase_price_count_30d" : 5,
  "purchase_price_count_3d" : null,
  "purchase_price_last10" : [ 76, 69, 367, 466, 275 ],
  "purchase_price_sum_14d" : 145,
  "purchase_price_sum_30d" : 1253,
  "purchase_price_sum_3d" : null
}

```


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Configuration Updates**
	- Updated GCP Dataproc cluster name
	- Modified namespace settings in test configuration

- **Runtime Improvements**
	- Enhanced JAR download and management functionality
	- Improved runtime configuration handling
	- Added new methods for JAR retrieval

- **Fetcher Enhancements**
	- Updated fetcher initialization process
	- Added result printing for better visibility
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
## Summary

Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files

Should be rebased and tested off of
#276

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
- [x] Not yet tested, should test after merging 276



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a builder pattern for `JavaFetcher` to enhance object creation
flexibility.
- Introduced an optional `ExecutionContext` parameter across multiple
classes to improve execution context management.

- **Improvements**
- Updated constructor signatures in Java and Scala classes to support
more configurable execution contexts.
- Enhanced instantiation process with more structured object creation
methods.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
Co-authored-by: Nikhil Simha <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
## Summary

### Setup:
Download spark and set SPARK_HOME.
> wget
https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
> tar -xzf spark-3.5.4-bin-hadoop3.tgz

In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var
```
export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3                      # choose dir where you unpacked spark
```
The reason for setting spark for the moment here is because of the way
Driver.scala is setup. It's a big melting pot of spark subcommands and
others. Because of the other spark sub commands, we have to have spark
here.

### Test:
```
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY
/Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME
/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quickstart/purchases.v1 -k '{"user_id":"5"}'
Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon_fetch
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
Downloading chronon gcp jar from GCS...
Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from bucket zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar.
Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl  --conf-type=group_bys  --type group-by --name quickstart/purchases.v1 -k {"user_id":"5"}
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
--- [FETCHED RESULT] ---
{
  "purchase_price_average_14d" : 72.5,
  "purchase_price_average_30d" : 250.6,
  "purchase_price_average_3d" : null,
  "purchase_price_count_14d" : 2,
  "purchase_price_count_30d" : 5,
  "purchase_price_count_3d" : null,
  "purchase_price_last10" : [ 76, 69, 367, 466, 275 ],
  "purchase_price_sum_14d" : 145,
  "purchase_price_sum_30d" : 1253,
  "purchase_price_sum_3d" : null
}

```


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Configuration Updates**
	- Updated GCP Dataproc cluster name
	- Modified namespace settings in test configuration

- **Runtime Improvements**
	- Enhanced JAR download and management functionality
	- Improved runtime configuration handling
	- Added new methods for JAR retrieval

- **Fetcher Enhancements**
	- Updated fetcher initialization process
	- Added result printing for better visibility
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
## Summary

Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files

Should be rebased and tested off of
#276

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
- [x] Not yet tested, should test after merging 276



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a builder pattern for `JavaFetcher` to enhance object creation
flexibility.
- Introduced an optional `ExecutionContext` parameter across multiple
classes to improve execution context management.

- **Improvements**
- Updated constructor signatures in Java and Scala classes to support
more configurable execution contexts.
- Enhanced instantiation process with more structured object creation
methods.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
Co-authored-by: Nikhil Simha <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
## Summary

### Setup:
Download spark and set SPARK_HOME.
> wget
https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
> tar -xzf spark-3.5.4-bin-hadoop3.tgz

In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var
```
export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3                      # choose dir where you unpacked spark
```
The reason for setting spark for the moment here is because of the way
Driver.scala is setup. It's a big melting pot of spark subcommands and
others. Because of the other spark sub commands, we have to have spark
here.

### Test:
```
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY
/Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME
/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quickstart/purchases.v1 -k '{"user_id":"5"}'
Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon_fetch
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
Downloading chronon gcp jar from GCS...
Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from bucket zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar.
Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl  --conf-type=group_bys  --type group-by --name quickstart/purchases.v1 -k {"user_id":"5"}
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
--- [FETCHED RESULT] ---
{
  "purchase_price_average_14d" : 72.5,
  "purchase_price_average_30d" : 250.6,
  "purchase_price_average_3d" : null,
  "purchase_price_count_14d" : 2,
  "purchase_price_count_30d" : 5,
  "purchase_price_count_3d" : null,
  "purchase_price_last10" : [ 76, 69, 367, 466, 275 ],
  "purchase_price_sum_14d" : 145,
  "purchase_price_sum_30d" : 1253,
  "purchase_price_sum_3d" : null
}

```


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Configuration Updates**
	- Updated GCP Dataproc cluster name
	- Modified namespace settings in test configuration

- **Runtime Improvements**
	- Enhanced JAR download and management functionality
	- Improved runtime configuration handling
	- Added new methods for JAR retrieval

- **Fetcher Enhancements**
	- Updated fetcher initialization process
	- Added result printing for better visibility
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 16, 2025
## Summary

Manual cherry-piour clients from https://github.com/airbnb/chronon/pull/909/files

Should be rebased and tested off of
#276

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
- [x] Not yet tested, should test after merging 276



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a builder pattern for `JavaFetcher` to enhance object creation
flexibility.
- Introduced an optional `ExecutionContext` parameter across multiple
classes to improve execution context management.

- **Improvements**
- Updated constructor signatures in Java and Scala classes to support
more configurable execution contexts.
- Enhanced instantiation process with more structured object creation
methods.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
Co-authored-by: Nikhil Simha <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
chewy-zlai pushed a commit that referenced this pull request May 16, 2025
## Summary

### Setup:
Download spark and set SPARK_HOME.
> wget
https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
> tar -xzf spark-3.5.4-bin-hadoop3.tgz

In your ~/.bashrc or ~/.zshrc add SPARK_HOME env var
```
export SPARK_HOME=/Users/nikhilsimha/spark-3.5.4-bin-hadoop3                      # choose dir where you unpaour clientsed spark
```
The reason for setting spark for the moment here is because of the way
Driver.scala is setup. It's a big melting pot of spark subcommands and
others. Because of the other spark sub commands, we have to have spark
here.

### Test:
```
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $RUN_PY
/Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ echo $SPARK_HOME
/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3
(dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/do_fetch_2) $ python $RUN_PY --mode fetch --type group-by --name quiour clientsstart/purchases.v1 -k '{"user_id":"5"}'
Running with args: {'mode': 'fetch', 'conf': None, 'env': 'dev', 'dataproc': False, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon_fetch
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
Downloading chronon gcp jar from GCS...
Downloaded storage object jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar from buour clientset zipline-artifacts-canary to local file /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar.
Running command: java -cp /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmp9ejwzuc_/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar:/Users/davidhan/zipline/chronon/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.spark.Driver fetch --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl  --conf-type=group_bys  --type group-by --name quiour clientsstart/purchases.v1 -k {"user_id":"5"}
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
--- [FETCHED RESULT] ---
{
  "purchase_price_average_14d" : 72.5,
  "purchase_price_average_30d" : 250.6,
  "purchase_price_average_3d" : null,
  "purchase_price_count_14d" : 2,
  "purchase_price_count_30d" : 5,
  "purchase_price_count_3d" : null,
  "purchase_price_last10" : [ 76, 69, 367, 466, 275 ],
  "purchase_price_sum_14d" : 145,
  "purchase_price_sum_30d" : 1253,
  "purchase_price_sum_3d" : null
}

```


## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Configuration Updates**
	- Updated GCP Dataproc cluster name
	- Modified namespace settings in test configuration

- **Runtime Improvements**
	- Enhanced JAR download and management functionality
	- Improved runtime configuration handling
	- Added new methods for JAR retrieval

- **Fetcher Enhancements**
	- Updated fetcher initialization process
	- Added result printing for better visibility
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants