Skip to content

Conversation

@varant-zlai
Copy link
Collaborator

@varant-zlai varant-zlai commented Apr 2, 2025

Summary

Adds a flag to turn on "required columns only" for the merge job, to optimize for cases that left has large non-required-for-join columns that we don't want to carry on each shuffle.

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • New Features

    • Introduced configurable options for managing join operations and internal identifiers, offering greater flexibility.
  • Refactor

    • Enhanced the data merging process with improved handling of required columns and unique identifiers to ensure more robust join operations.
  • Bug Fixes

    • Adjusted error handling for empty data scenarios to log warnings instead of triggering failures.
  • Tests

    • Updated test configurations to validate and reflect the improved data processing and join logic, including new schema elements.

david-zlai and others added 30 commits January 31, 2025 15:07
…table. (#312)

…table.

## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Logging**
- Adjusted log levels for table accessibility and partition availability
messages from error to info.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
	- Added `bazelisk` plugin with version `1.25.0`
	- Configured `bazelisk` plugin repository and commit hash

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Thomas Chow <[email protected]>
## Summary
^^^

```
davidhan@Mac: ~/zipline/chronon/api/py/test/sample (davidhan/streaming_verb) $ echo $RUN_PY
/Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py

davidhan@Mac: ~/zipline/chronon/api/py/test/sample (davidhan/streaming_verb) $ python $RUN_PY --mode streaming --dataproc --groupby-name=etsy.listing_canary.actions_v1 --kafka-bootstrap=bootstrap.zipline-kafka-cluster.us-central1.managedkafka.canary-443022.cloud.goog:9092

```
Produced job:
https://console.cloud.google.com/dataproc/jobs/db59b95d-adae-4802-8d96-5613b0799c04/monitoring?region=us-central1&project=canary-443022

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
	- Added support for Flink streaming jobs alongside existing Spark jobs.
- Introduced new command-line options for Flink-specific job
configurations.
- Enhanced job submission capabilities for both Spark and Flink job
types.

- **Improvements**
	- Refined argument handling for job submissions.
	- Updated build and upload scripts to include Flink JAR artifacts.

- **Technical Updates**
	- Introduced new job type enumeration.
	- Added methods to support Flink streaming job execution.
- Enhanced clarity and maintainability of argument handling in job
submission logic.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

```
python distribution/run_zipline_quickstart.py
```

This runs the full zipline suite of commands against a test quickstart
groupby.

Example:
```
davidhan@Davids-MacBook-Pro: ~/zipline/chronon (davidhan/do_fetch_test) $ python3 distribution/run_zipline_quickstart.py 
Created temporary directory: /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l
+ WORKING_DIR=/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l
+ cd /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l
+ GREEN='\033[0;32m'
+ RED='\033[0;31m'
+ WHEEL_FILE=zipline_ai-0.1.0.dev0-py3-none-any.whl
+ bq rm -f -t canary-443022:data.quickstart_purchases_v1_test
+ bq rm -f -t canary-443022:data.quickstart_purchases_v1_test_upload
+ '[' -z '' ']'
+ wget https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
--2025-01-30 10:16:21--  https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400879762 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.4-bin-hadoop3.tgz’

spark-3.5.4-bin-hadoop3.tgz                                 100%[==========================================================================================================================================>] 382.31M  50.2MB/s    in 8.4s    

2025-01-30 10:16:30 (45.5 MB/s) - ‘spark-3.5.4-bin-hadoop3.tgz’ saved [400879762/400879762]

+ tar -xzf spark-3.5.4-bin-hadoop3.tgz
++ pwd
+ export SPARK_HOME=/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3
+ SPARK_HOME=/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3
+ git clone [email protected]:zipline-ai/cananry-confs.git
Cloning into 'cananry-confs'...
remote: Enumerating objects: 148, done.
remote: Counting objects: 100% (148/148), done.
remote: Compressing objects: 100% (77/77), done.
remote: Total 148 (delta 63), reused 139 (delta 60), pack-reused 0 (from 0)
Receiving objects: 100% (148/148), 93.28 KiB | 746.00 KiB/s, done.
Resolving deltas: 100% (63/63), done.
+ cd cananry-confs
+ git fetch origin davidhan/canary
From github.com:zipline-ai/cananry-confs
 * branch            davidhan/canary -> FETCH_HEAD
+ git checkout davidhan/canary
branch 'davidhan/canary' set up to track 'origin/davidhan/canary'.
Switched to a new branch 'davidhan/canary'
+ python3 -m venv tmp_chronon
+ source tmp_chronon/bin/activate
++ deactivate nondestructive
++ '[' -n '' ']'
++ '[' -n '' ']'
++ hash -r
++ '[' -n '' ']'
++ unset VIRTUAL_ENV
++ unset VIRTUAL_ENV_PROMPT
++ '[' '!' nondestructive = nondestructive ']'
++ case "$(uname)" in
+++ uname
++ export VIRTUAL_ENV=/private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/tmp_chronon
++ VIRTUAL_ENV=/private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/tmp_chronon
++ _OLD_VIRTUAL_PATH=/Users/davidhan/.asdf/plugins/python/shims:/Users/davidhan/.asdf/installs/python/3.13.0/bin:/Users/davidhan/Downloads/google-cloud-sdk/bin:/Users/davidhan/.cargo/bin:/Users/davidhan/.asdf/shims:/Users/davidhan/.asdf/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin
++ PATH=/private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/tmp_chronon/bin:/Users/davidhan/.asdf/plugins/python/shims:/Users/davidhan/.asdf/installs/python/3.13.0/bin:/Users/davidhan/Downloads/google-cloud-sdk/bin:/Users/davidhan/.cargo/bin:/Users/davidhan/.asdf/shims:/Users/davidhan/.asdf/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin
++ export PATH
++ VIRTUAL_ENV_PROMPT=tmp_chronon
++ export VIRTUAL_ENV_PROMPT
++ '[' -n '' ']'
++ '[' -z '' ']'
++ _OLD_VIRTUAL_PS1=
++ PS1='(tmp_chronon) '
++ export PS1
++ hash -r
+ gcloud storage cp gs://zipline-artifacts-canary/jars/zipline_ai-0.1.0.dev0-py3-none-any.whl .
Copying gs://zipline-artifacts-canary/jars/zipline_ai-0.1.0.dev0-py3-none-any.whl to file://./zipline_ai-0.1.0.dev0-py3-none-any.whl
  Completed files 1/1 | 371.1kiB/371.1kiB                                                                                                                                                                                                     
+ pip uninstall zipline-ai
WARNING: Skipping zipline-ai as it is not installed.
+ pip install --force-reinstall zipline_ai-0.1.0.dev0-py3-none-any.whl
Processing ./zipline_ai-0.1.0.dev0-py3-none-any.whl
Collecting click (from zipline-ai==0.1.0.dev0)
  Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting thrift==0.21.0 (from zipline-ai==0.1.0.dev0)
  Using cached thrift-0.21.0-cp313-cp313-macosx_15_0_arm64.whl
Collecting google-cloud-storage==2.19.0 (from zipline-ai==0.1.0.dev0)
  Using cached google_cloud_storage-2.19.0-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting google-auth<3.0dev,>=2.26.1 (from google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached google_auth-2.38.0-py2.py3-none-any.whl.metadata (4.8 kB)
Collecting google-api-core<3.0.0dev,>=2.15.0 (from google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached google_api_core-2.24.1-py3-none-any.whl.metadata (3.0 kB)
Collecting google-cloud-core<3.0dev,>=2.3.0 (from google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached google_cloud_core-2.4.1-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting google-resumable-media>=2.7.2 (from google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached google_resumable_media-2.7.2-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting requests<3.0.0dev,>=2.18.0 (from google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting google-crc32c<2.0dev,>=1.0 (from google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached google_crc32c-1.6.0-py3-none-any.whl
Collecting six>=1.7.2 (from thrift==0.21.0->zipline-ai==0.1.0.dev0)
  Using cached six-1.17.0-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting googleapis-common-protos<2.0.dev0,>=1.56.2 (from google-api-core<3.0.0dev,>=2.15.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached googleapis_common_protos-1.66.0-py2.py3-none-any.whl.metadata (1.5 kB)
Collecting protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0.dev0,>=3.19.5 (from google-api-core<3.0.0dev,>=2.15.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached protobuf-5.29.3-cp38-abi3-macosx_10_9_universal2.whl.metadata (592 bytes)
Collecting proto-plus<2.0.0dev,>=1.22.3 (from google-api-core<3.0.0dev,>=2.15.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached proto_plus-1.26.0-py3-none-any.whl.metadata (2.2 kB)
Collecting cachetools<6.0,>=2.0.0 (from google-auth<3.0dev,>=2.26.1->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached cachetools-5.5.1-py3-none-any.whl.metadata (5.4 kB)
Collecting pyasn1-modules>=0.2.1 (from google-auth<3.0dev,>=2.26.1->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached pyasn1_modules-0.4.1-py3-none-any.whl.metadata (3.5 kB)
Collecting rsa<5,>=3.1.4 (from google-auth<3.0dev,>=2.26.1->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached rsa-4.9-py3-none-any.whl.metadata (4.2 kB)
Collecting charset-normalizer<4,>=2 (from requests<3.0.0dev,>=2.18.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached charset_normalizer-3.4.1-cp313-cp313-macosx_10_13_universal2.whl.metadata (35 kB)
Collecting idna<4,>=2.5 (from requests<3.0.0dev,>=2.18.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached idna-3.10-py3-none-any.whl.metadata (10 kB)
Collecting urllib3<3,>=1.21.1 (from requests<3.0.0dev,>=2.18.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached urllib3-2.3.0-py3-none-any.whl.metadata (6.5 kB)
Collecting certifi>=2017.4.17 (from requests<3.0.0dev,>=2.18.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached certifi-2024.12.14-py3-none-any.whl.metadata (2.3 kB)
Collecting pyasn1<0.7.0,>=0.4.6 (from pyasn1-modules>=0.2.1->google-auth<3.0dev,>=2.26.1->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0)
  Using cached pyasn1-0.6.1-py3-none-any.whl.metadata (8.4 kB)
Using cached google_cloud_storage-2.19.0-py2.py3-none-any.whl (131 kB)
Using cached click-8.1.8-py3-none-any.whl (98 kB)
Using cached google_api_core-2.24.1-py3-none-any.whl (160 kB)
Using cached google_auth-2.38.0-py2.py3-none-any.whl (210 kB)
Using cached google_cloud_core-2.4.1-py2.py3-none-any.whl (29 kB)
Using cached google_resumable_media-2.7.2-py2.py3-none-any.whl (81 kB)
Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Using cached six-1.17.0-py2.py3-none-any.whl (11 kB)
Using cached cachetools-5.5.1-py3-none-any.whl (9.5 kB)
Using cached certifi-2024.12.14-py3-none-any.whl (164 kB)
Using cached charset_normalizer-3.4.1-cp313-cp313-macosx_10_13_universal2.whl (195 kB)
Using cached googleapis_common_protos-1.66.0-py2.py3-none-any.whl (221 kB)
Using cached idna-3.10-py3-none-any.whl (70 kB)
Using cached proto_plus-1.26.0-py3-none-any.whl (50 kB)
Using cached protobuf-5.29.3-cp38-abi3-macosx_10_9_universal2.whl (417 kB)
Using cached pyasn1_modules-0.4.1-py3-none-any.whl (181 kB)
Using cached rsa-4.9-py3-none-any.whl (34 kB)
Using cached urllib3-2.3.0-py3-none-any.whl (128 kB)
Using cached pyasn1-0.6.1-py3-none-any.whl (83 kB)
Installing collected packages: urllib3, six, pyasn1, protobuf, idna, google-crc32c, click, charset-normalizer, certifi, cachetools, thrift, rsa, requests, pyasn1-modules, proto-plus, googleapis-common-protos, google-resumable-media, google-auth, google-api-core, google-cloud-core, google-cloud-storage, zipline-ai
Successfully installed cachetools-5.5.1 certifi-2024.12.14 charset-normalizer-3.4.1 click-8.1.8 google-api-core-2.24.1 google-auth-2.38.0 google-cloud-core-2.4.1 google-cloud-storage-2.19.0 google-crc32c-1.6.0 google-resumable-media-2.7.2 googleapis-common-protos-1.66.0 idna-3.10 proto-plus-1.26.0 protobuf-5.29.3 pyasn1-0.6.1 pyasn1-modules-0.4.1 requests-2.32.3 rsa-4.9 six-1.17.0 thrift-0.21.0 urllib3-2.3.0 zipline-ai-0.1.0.dev0

[notice] A new release of pip is available: 24.2 -> 25.0
[notice] To update, run: pip install --upgrade pip
++ pwd
+ export PYTHONPATH=:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs
+ PYTHONPATH=:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs
+ DATAPROC_SUBMITTER_ID_STR='Dataproc submitter job id'
+ echo -e '\033[0;32m<<<<<.....................................COMPILE.....................................>>>>>\033[0m'
<<<<<.....................................COMPILE.....................................>>>>>
+ zipline compile --conf=group_bys/quickstart/purchases.py
  Using chronon root path - /private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs
     Input group_bys from - /private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/group_bys/quickstart/purchases.py
             GroupBy Team - quickstart
             GroupBy Name - purchases.v1
       Writing GroupBy to - /private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/production/group_bys/quickstart/purchases.v1
             GroupBy Team - quickstart
             GroupBy Name - purchases.v1_test
       Writing GroupBy to - /private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/production/group_bys/quickstart/purchases.v1_test
Successfully wrote 2 GroupBy objects to /private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/production
+ echo -e '\033[0;32m<<<<<.....................................BACKFILL.....................................>>>>>\033[0m'
<<<<<.....................................BACKFILL.....................................>>>>>
+ touch tmp_backfill.out
+ zipline run --conf production/group_bys/quickstart/purchases.v1_test --dataproc
+ tee /dev/tty tmp_backfill.out
Running with args: {'conf': 'production/group_bys/quickstart/purchases.v1_test', 'dataproc': True, 'env': 'dev', 'mode': None, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Running with args: {'conf': 'production/group_bys/quickstart/purchases.v1_test', 'dataproc': True, 'env': 'dev', 'mode': None, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Array(group-by-backfill, --conf-path=purchases.v1_test, --end-date=2025-01-30, --conf-type=group_bys, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance)Array(group-by-backfill, --conf-path=purchases.v1_test, --end-date=2025-01-30, --conf-type=group_bys, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance)

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
Dataproc submitter job id: 945d836f-20d8-4768-97fb-0889c00ed87bDataproc submitter job id: 945d836f-20d8-4768-97fb-0889c00ed87b

Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary.
Running command: java -cp /Users/davidhan/zipline/chronon/cloud_gcp_submitter/target/scala-2.12/cloud_gcp_submitter-assembly-0.1.0-SNAPSHOT.jar:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.integrations.cloud_gcp.DataprocSubmitter group-by-backfill --conf-path=purchases.v1_test --end-date=2025-01-30  --conf-type=group_bys    --additional-conf-path=additional-confs.yaml --gcs_files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml --chronon_jar_uri=gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary.
Running command: java -cp /Users/davidhan/zipline/chronon/cloud_gcp_submitter/target/scala-2.12/cloud_gcp_submitter-assembly-0.1.0-SNAPSHOT.jar:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.integrations.cloud_gcp.DataprocSubmitter group-by-backfill --conf-path=purchases.v1_test --end-date=2025-01-30  --conf-type=group_bys    --additional-conf-path=additional-confs.yaml --gcs_files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml --chronon_jar_uri=gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
++ cat tmp_backfill.out
++ grep 'Dataproc submitter job id'
++ cut -d ' ' -f5
+ BACKFILL_JOB_ID=945d836f-20d8-4768-97fb-0889c00ed87b
+ check_dataproc_job_state 945d836f-20d8-4768-97fb-0889c00ed87b
+ JOB_ID=945d836f-20d8-4768-97fb-0889c00ed87b
+ '[' -z 945d836f-20d8-4768-97fb-0889c00ed87b ']'
+ gcloud dataproc jobs wait 945d836f-20d8-4768-97fb-0889c00ed87b --region=us-central1
Waiting for job output...
25/01/30 18:16:47 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/01/30 18:16:47 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
Using warehouse dir: /tmp/945d836f-20d8-4768-97fb-0889c00ed87b/local_warehouse
25/01/30 18:16:50 INFO HiveConf: Found configuration file file:/etc/hive/conf.dist/hive-site.xml
25/01/30 18:16:50 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/01/30 18:16:50 INFO SparkEnv: Registering MapOutputTracker
25/01/30 18:16:50 INFO SparkEnv: Registering BlockManagerMaster
25/01/30 18:16:50 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/01/30 18:16:50 INFO SparkEnv: Registering OutputCommitCoordinator
25/01/30 18:16:51 INFO DataprocSparkPlugin: Registered 188 driver metrics
25/01/30 18:16:51 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:8032
25/01/30 18:16:51 INFO AHSProxy: Connecting to Application History server at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:10200
25/01/30 18:16:51 INFO Configuration: resource-types.xml not found
25/01/30 18:16:51 INFO ResourceUtils: Unable to find 'resource-types.xml'.
25/01/30 18:16:52 INFO YarnClientImpl: Submitted application application_1738197659103_0011
25/01/30 18:16:53 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/01/30 18:16:53 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:8030
25/01/30 18:16:55 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
25/01/30 18:16:55 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-us-central1-703996152583-pqtvfptb/5d9e94ed-7649-4828-8b64-e3d58632a5d0/spark-job-history/application_1738197659103_0011.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]
2025/01/30 18:16:55 INFO  SparkSessionBuilder.scala:76 - Chronon logging system initialized. Overrides spark's configuration
2025/01/30 18:16:58 ERROR TableUtils.scala:188 - Table canary-443022.data.quickstart_purchases_v1_test is not reachable. Returning empty partitions.
2025/01/30 18:17:15 INFO  TableUtils.scala:200 - Found 30, between (2023-11-01, 2023-11-30) partitions for table: data.purchases
2025/01/30 18:17:15 INFO  TableUtils.scala:622 - 
Unfilled range computation:
   Output table: canary-443022.data.quickstart_purchases_v1_test
   Missing output partitions: [2023-11-01,2023-11-02,2023-11-03,2023-11-04,2023-11-05,2023-11-06,2023-11-07,2023-11-08,2023-11-09,2023-11-10,2023-11-11,2023-11-12,2023-11-13,2023-11-14,2023-11-15,2023-11-16,2023-11-17,2023-11-18,2023-11-19,2023-11-20,2023-11-21,2023-11-22,2023-11-23,2023-11-24,2023-11-25,2023-11-26,2023-11-27,2023-11-28,2023-11-29,2023-11-30,2023-12-01,2023-12-02,2023-12-03,2023-12-04,2023-12-05,2023-12-06,2023-12-07,2023-12-08,2023-12-09,2023-12-10,2023-12-11,2023-12-12,2023-12-13,2023-12-14,2023-12-15,2023-12-16,2023-12-17,2023-12-18,2023-12-19,2023-12-20,2023-12-21,2023-12-22,2023-12-23,2023-12-24,2023-12-25,2023-12-26,2023-12-27,2023-12-28,2023-12-29,2023-12-30,2023-12-31,2024-01-01,2024-01-02,2024-01-03,2024-01-04,2024-01-05,2024-01-06,2024-01-07,2024-01-08,2024-01-09,2024-01-10,2024-01-11,2024-01-12,2024-01-13,2024-01-14,2024-01-15,2024-01-16,2024-01-17,2024-01-18,2024-01-19,2024-01-20,2024-01-21,2024-01-22,2024-01-23,2024-01-24,2024-01-25,2024-01-26,2024-01-27,2024-01-28,2024-01-29,2024-01-30,2024-01-31,2024-02-01,2024-02-02,2024-02-03,2024-02-04,2024-02-05,2024-02-06,2024-02-07,2024-02-08,2024-02-09,2024-02-10,2024-02-11,2024-02-12,2024-02-13,2024-02-14,2024-02-15,2024-02-16,2024-02-17,2024-02-18,2024-02-19,2024-02-20,2024-02-21,2024-02-22,2024-02-23,2024-02-24,2024-02-25,2024-02-26,2024-02-27,2024-02-28,2024-02-29,2024-03-01,2024-03-02,2024-03-03,2024-03-04,2024-03-05,2024-03-06,2024-03-07,2024-03-08,2024-03-09,2024-03-10,2024-03-11,2024-03-12,2024-03-13,2024-03-14,2024-03-15,2024-03-16,2024-03-17,2024-03-18,2024-03-19,2024-03-20,2024-03-21,2024-03-22,2024-03-23,2024-03-24,2024-03-25,2024-03-26,2024-03-27,2024-03-28,2024-03-29,2024-03-30,2024-03-31,2024-04-01,2024-04-02,2024-04-03,2024-04-04,2024-04-05,2024-04-06,2024-04-07,2024-04-08,2024-04-09,2024-04-10,2024-04-11,2024-04-12,2024-04-13,2024-04-14,2024-04-15,2024-04-16,2024-04-17,2024-04-18,2024-04-19,2024-04-20,2024-04-21,2024-04-22,2024-04-23,2024-04-24,2024-04-25,2024-04-26,2024-04-27,2024-04-28,2024-04-29,2024-04-30,2024-05-01,2024-05-02,2024-05-03,2024-05-04,2024-05-05,2024-05-06,2024-05-07,2024-05-08,2024-05-09,2024-05-10,2024-05-11,2024-05-12,2024-05-13,2024-05-14,2024-05-15,2024-05-16,2024-05-17,2024-05-18,2024-05-19,2024-05-20,2024-05-21,2024-05-22,2024-05-23,2024-05-24,2024-05-25,2024-05-26,2024-05-27,2024-05-28,2024-05-29,2024-05-30,2024-05-31,2024-06-01,2024-06-02,2024-06-03,2024-06-04,2024-06-05,2024-06-06,2024-06-07,2024-06-08,2024-06-09,2024-06-10,2024-06-11,2024-06-12,2024-06-13,2024-06-14,2024-06-15,2024-06-16,2024-06-17,2024-06-18,2024-06-19,2024-06-20,2024-06-21,2024-06-22,2024-06-23,2024-06-24,2024-06-25,2024-06-26,2024-06-27,2024-06-28,2024-06-29,2024-06-30,2024-07-01,2024-07-02,2024-07-03,2024-07-04,2024-07-05,2024-07-06,2024-07-07,2024-07-08,2024-07-09,2024-07-10,2024-07-11,2024-07-12,2024-07-13,2024-07-14,2024-07-15,2024-07-16,2024-07-17,2024-07-18,2024-07-19,2024-07-20,2024-07-21,2024-07-22,2024-07-23,2024-07-24,2024-07-25,2024-07-26,2024-07-27,2024-07-28,2024-07-29,2024-07-30,2024-07-31,2024-08-01,2024-08-02,2024-08-03,2024-08-04,2024-08-05,2024-08-06,2024-08-07,2024-08-08,2024-08-09,2024-08-10,2024-08-11,2024-08-12,2024-08-13,2024-08-14,2024-08-15,2024-08-16,2024-08-17,2024-08-18,2024-08-19,2024-08-20,2024-08-21,2024-08-22,2024-08-23,2024-08-24,2024-08-25,2024-08-26,2024-08-27,2024-08-28,2024-08-29,2024-08-30,2024-08-31,2024-09-01,2024-09-02,2024-09-03,2024-09-04,2024-09-05,2024-09-06,2024-09-07,2024-09-08,2024-09-09,2024-09-10,2024-09-11,2024-09-12,2024-09-13,2024-09-14,2024-09-15,2024-09-16,2024-09-17,2024-09-18,2024-09-19,2024-09-20,2024-09-21,2024-09-22,2024-09-23,2024-09-24,2024-09-25,2024-09-26,2024-09-27,2024-09-28,2024-09-29,2024-09-30,2024-10-01,2024-10-02,2024-10-03,2024-10-04,2024-10-05,2024-10-06,2024-10-07,2024-10-08,2024-10-09,2024-10-10,2024-10-11,2024-10-12,2024-10-13,2024-10-14,2024-10-15,2024-10-16,2024-10-17,2024-10-18,2024-10-19,2024-10-20,2024-10-21,2024-10-22,2024-10-23,2024-10-24,2024-10-25,2024-10-26,2024-10-27,2024-10-28,2024-10-29,2024-10-30,2024-10-31,2024-11-01,2024-11-02,2024-11-03,2024-11-04,2024-11-05,2024-11-06,2024-11-07,2024-11-08,2024-11-09,2024-11-10,2024-11-11,2024-11-12,2024-11-13,2024-11-14,2024-11-15,2024-11-16,2024-11-17,2024-11-18,2024-11-19,2024-11-20,2024-11-21,2024-11-22,2024-11-23,2024-11-24,2024-11-25,2024-11-26,2024-11-27,2024-11-28,2024-11-29,2024-11-30,2024-12-01,2024-12-02,2024-12-03,2024-12-04,2024-12-05,2024-12-06,2024-12-07,2024-12-08,2024-12-09,2024-12-10,2024-12-11,2024-12-12,2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18,2024-12-19,2024-12-20,2024-12-21,2024-12-22,2024-12-23,2024-12-24,2024-12-25,2024-12-26,2024-12-27,2024-12-28,2024-12-29,2024-12-30,2024-12-31,2025-01-01,2025-01-02,2025-01-03,2025-01-04,2025-01-05,2025-01-06,2025-01-07,2025-01-08,2025-01-09,2025-01-10,2025-01-11,2025-01-12,2025-01-13,2025-01-14,2025-01-15,2025-01-16,2025-01-17,2025-01-18,2025-01-19,2025-01-20,2025-01-21,2025-01-22,2025-01-23,2025-01-24,2025-01-25,2025-01-26,2025-01-27,2025-01-28,2025-01-29,2025-01-30]
   Input tables: data.purchases
   Missing input partitions: [2023-12-01,2023-12-02,2023-12-03,2023-12-04,2023-12-05,2023-12-06,2023-12-07,2023-12-08,2023-12-09,2023-12-10,2023-12-11,2023-12-12,2023-12-13,2023-12-14,2023-12-15,2023-12-16,2023-12-17,2023-12-18,2023-12-19,2023-12-20,2023-12-21,2023-12-22,2023-12-23,2023-12-24,2023-12-25,2023-12-26,2023-12-27,2023-12-28,2023-12-29,2023-12-30,2023-12-31,2024-01-01,2024-01-02,2024-01-03,2024-01-04,2024-01-05,2024-01-06,2024-01-07,2024-01-08,2024-01-09,2024-01-10,2024-01-11,2024-01-12,2024-01-13,2024-01-14,2024-01-15,2024-01-16,2024-01-17,2024-01-18,2024-01-19,2024-01-20,2024-01-21,2024-01-22,2024-01-23,2024-01-24,2024-01-25,2024-01-26,2024-01-27,2024-01-28,2024-01-29,2024-01-30,2024-01-31,2024-02-01,2024-02-02,2024-02-03,2024-02-04,2024-02-05,2024-02-06,2024-02-07,2024-02-08,2024-02-09,2024-02-10,2024-02-11,2024-02-12,2024-02-13,2024-02-14,2024-02-15,2024-02-16,2024-02-17,2024-02-18,2024-02-19,2024-02-20,2024-02-21,2024-02-22,2024-02-23,2024-02-24,2024-02-25,2024-02-26,2024-02-27,2024-02-28,2024-02-29,2024-03-01,2024-03-02,2024-03-03,2024-03-04,2024-03-05,2024-03-06,2024-03-07,2024-03-08,2024-03-09,2024-03-10,2024-03-11,2024-03-12,2024-03-13,2024-03-14,2024-03-15,2024-03-16,2024-03-17,2024-03-18,2024-03-19,2024-03-20,2024-03-21,2024-03-22,2024-03-23,2024-03-24,2024-03-25,2024-03-26,2024-03-27,2024-03-28,2024-03-29,2024-03-30,2024-03-31,2024-04-01,2024-04-02,2024-04-03,2024-04-04,2024-04-05,2024-04-06,2024-04-07,2024-04-08,2024-04-09,2024-04-10,2024-04-11,2024-04-12,2024-04-13,2024-04-14,2024-04-15,2024-04-16,2024-04-17,2024-04-18,2024-04-19,2024-04-20,2024-04-21,2024-04-22,2024-04-23,2024-04-24,2024-04-25,2024-04-26,2024-04-27,2024-04-28,2024-04-29,2024-04-30,2024-05-01,2024-05-02,2024-05-03,2024-05-04,2024-05-05,2024-05-06,2024-05-07,2024-05-08,2024-05-09,2024-05-10,2024-05-11,2024-05-12,2024-05-13,2024-05-14,2024-05-15,2024-05-16,2024-05-17,2024-05-18,2024-05-19,2024-05-20,2024-05-21,2024-05-22,2024-05-23,2024-05-24,2024-05-25,2024-05-26,2024-05-27,2024-05-28,2024-05-29,2024-05-30,2024-05-31,2024-06-01,2024-06-02,2024-06-03,2024-06-04,2024-06-05,2024-06-06,2024-06-07,2024-06-08,2024-06-09,2024-06-10,2024-06-11,2024-06-12,2024-06-13,2024-06-14,2024-06-15,2024-06-16,2024-06-17,2024-06-18,2024-06-19,2024-06-20,2024-06-21,2024-06-22,2024-06-23,2024-06-24,2024-06-25,2024-06-26,2024-06-27,2024-06-28,2024-06-29,2024-06-30,2024-07-01,2024-07-02,2024-07-03,2024-07-04,2024-07-05,2024-07-06,2024-07-07,2024-07-08,2024-07-09,2024-07-10,2024-07-11,2024-07-12,2024-07-13,2024-07-14,2024-07-15,2024-07-16,2024-07-17,2024-07-18,2024-07-19,2024-07-20,2024-07-21,2024-07-22,2024-07-23,2024-07-24,2024-07-25,2024-07-26,2024-07-27,2024-07-28,2024-07-29,2024-07-30,2024-07-31,2024-08-01,2024-08-02,2024-08-03,2024-08-04,2024-08-05,2024-08-06,2024-08-07,2024-08-08,2024-08-09,2024-08-10,2024-08-11,2024-08-12,2024-08-13,2024-08-14,2024-08-15,2024-08-16,2024-08-17,2024-08-18,2024-08-19,2024-08-20,2024-08-21,2024-08-22,2024-08-23,2024-08-24,2024-08-25,2024-08-26,2024-08-27,2024-08-28,2024-08-29,2024-08-30,2024-08-31,2024-09-01,2024-09-02,2024-09-03,2024-09-04,2024-09-05,2024-09-06,2024-09-07,2024-09-08,2024-09-09,2024-09-10,2024-09-11,2024-09-12,2024-09-13,2024-09-14,2024-09-15,2024-09-16,2024-09-17,2024-09-18,2024-09-19,2024-09-20,2024-09-21,2024-09-22,2024-09-23,2024-09-24,2024-09-25,2024-09-26,2024-09-27,2024-09-28,2024-09-29,2024-09-30,2024-10-01,2024-10-02,2024-10-03,2024-10-04,2024-10-05,2024-10-06,2024-10-07,2024-10-08,2024-10-09,2024-10-10,2024-10-11,2024-10-12,2024-10-13,2024-10-14,2024-10-15,2024-10-16,2024-10-17,2024-10-18,2024-10-19,2024-10-20,2024-10-21,2024-10-22,2024-10-23,2024-10-24,2024-10-25,2024-10-26,2024-10-27,2024-10-28,2024-10-29,2024-10-30,2024-10-31,2024-11-01,2024-11-02,2024-11-03,2024-11-04,2024-11-05,2024-11-06,2024-11-07,2024-11-08,2024-11-09,2024-11-10,2024-11-11,2024-11-12,2024-11-13,2024-11-14,2024-11-15,2024-11-16,2024-11-17,2024-11-18,2024-11-19,2024-11-20,2024-11-21,2024-11-22,2024-11-23,2024-11-24,2024-11-25,2024-11-26,2024-11-27,2024-11-28,2024-11-29,2024-11-30,2024-12-01,2024-12-02,2024-12-03,2024-12-04,2024-12-05,2024-12-06,2024-12-07,2024-12-08,2024-12-09,2024-12-10,2024-12-11,2024-12-12,2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18,2024-12-19,2024-12-20,2024-12-21,2024-12-22,2024-12-23,2024-12-24,2024-12-25,2024-12-26,2024-12-27,2024-12-28,2024-12-29,2024-12-30,2024-12-31,2025-01-01,2025-01-02,2025-01-03,2025-01-04,2025-01-05,2025-01-06,2025-01-07,2025-01-08,2025-01-09,2025-01-10,2025-01-11,2025-01-12,2025-01-13,2025-01-14,2025-01-15,2025-01-16,2025-01-17,2025-01-18,2025-01-19,2025-01-20,2025-01-21,2025-01-22,2025-01-23,2025-01-24,2025-01-25,2025-01-26,2025-01-27,2025-01-28,2025-01-29,2025-01-30]
   Unfilled Partitions: [2023-11-01,2023-11-02,2023-11-03,2023-11-04,2023-11-05,2023-11-06,2023-11-07,2023-11-08,2023-11-09,2023-11-10,2023-11-11,2023-11-12,2023-11-13,2023-11-14,2023-11-15,2023-11-16,2023-11-17,2023-11-18,2023-11-19,2023-11-20,2023-11-21,2023-11-22,2023-11-23,2023-11-24,2023-11-25,2023-11-26,2023-11-27,2023-11-28,2023-11-29,2023-11-30]
   Unfilled ranges: [2023-11-01...2023-11-30]

2025/01/30 18:17:15 INFO  GroupBy.scala:733 - group by unfilled ranges: List([2023-11-01...2023-11-30])
2025/01/30 18:17:15 INFO  GroupBy.scala:738 - Group By ranges to compute: 
    [2023-11-01...2023-11-30]

2025/01/30 18:17:15 INFO  GroupBy.scala:743 - Computing group by for range: [2023-11-01...2023-11-30] [1/1]
2025/01/30 18:17:15 INFO  GroupBy.scala:492 - 
----[Processing GroupBy: quickstart.purchases.v1_test]----
2025/01/30 18:17:20 INFO  TableUtils.scala:200 - Found 30, between (2023-11-01, 2023-11-30) partitions for table: data.purchases
2025/01/30 18:17:20 INFO  GroupBy.scala:618 - 
Computing intersected range as:
   query range: [2023-11-01...2023-11-30]
   query window: None
   source table: data.purchases
   source data range: [2023-11-01...2023-11-30]
   source start/end: null/null
   source data model: Events
   queryable data range: [null...2023-11-30]
   intersected range: [2023-11-01...2023-11-30]

2025/01/30 18:17:20 INFO  GroupBy.scala:658 - 
Time Mapping: Some((ts,ts))

2025/01/30 18:17:20 INFO  GroupBy.scala:668 - 
Rendering source query:
   intersected/effective scan range: Some([2023-11-01...2023-11-30])
   partitionConditions: List(ds >= '2023-11-01', ds <= '2023-11-30')
   metaColumns: Map(ds -> null, ts -> ts)

2025/01/30 18:17:20 INFO  TableUtils.scala:759 -  Scanning data:
  table: data.purchases
  options: Map()
  format: Some(bigquery)
  selects:
    `ds`
    `ts`
    `user_id`
    `purchase_price`
  wheres:
    
  partition filters:
    ds >= '2023-11-01',
    ds <= '2023-11-30'

2025/01/30 18:17:20 INFO  HopsAggregator.scala:147 - Left bounds: 1d->unbounded 
minQueryTs = 2023-11-01 00:00:00
2025/01/30 18:17:20 INFO  FastHashing.scala:52 - Generating key builder over keys:
  bigint : user_id

2025/01/30 18:17:22 INFO  TableUtils.scala:459 - Repartitioning before writing...
2025/01/30 18:17:25 INFO  TableUtils.scala:494 - 2416 rows requested to be written into table canary-443022.data.quickstart_purchases_v1_test
2025/01/30 18:17:25 INFO  TableUtils.scala:531 - repartitioning data for table canary-443022.data.quickstart_purchases_v1_test by 300 spark tasks into 30 table partitions and 10 files per partition
2025/01/30 18:17:25 INFO  TableUtils.scala:536 - Sorting within partitions with cols: List(ds)
2025/01/30 18:17:33 INFO  TableUtils.scala:469 - Finished writing to canary-443022.data.quickstart_purchases_v1_test
2025/01/30 18:17:33 INFO  TableUtils.scala:440 - Cleared the dataframe cache after repartition & write to canary-443022.data.quickstart_purchases_v1_test - start @ 2025-01-30 18:17:22 end @ 2025-01-30 18:17:33
2025/01/30 18:17:33 INFO  GroupBy.scala:757 - Wrote to table canary-443022.data.quickstart_purchases_v1_test, into partitions: [2023-11-01...2023-11-30]
2025/01/30 18:17:33 INFO  GroupBy.scala:759 - Wrote to table canary-443022.data.quickstart_purchases_v1_test for range: [2023-11-01...2023-11-30]
Job [945d836f-20d8-4768-97fb-0889c00ed87b] finished successfully.
done: true
driverControlFilesUri: gs://dataproc-staging-us-central1-703996152583-lxespibx/google-cloud-dataproc-metainfo/5d9e94ed-7649-4828-8b64-e3d58632a5d0/jobs/945d836f-20d8-4768-97fb-0889c00ed87b/
driverOutputResourceUri: gs://dataproc-staging-us-central1-703996152583-lxespibx/google-cloud-dataproc-metainfo/5d9e94ed-7649-4828-8b64-e3d58632a5d0/jobs/945d836f-20d8-4768-97fb-0889c00ed87b/driveroutput
jobUuid: 945d836f-20d8-4768-97fb-0889c00ed87b
placement:
  clusterName: zipline-canary-cluster
  clusterUuid: 5d9e94ed-7649-4828-8b64-e3d58632a5d0
reference:
  jobId: 945d836f-20d8-4768-97fb-0889c00ed87b
  projectId: canary-443022
sparkJob:
  args:
  - group-by-backfill
  - --conf-path=purchases.v1_test
  - --end-date=2025-01-30
  - --conf-type=group_bys
  - --additional-conf-path=additional-confs.yaml
  - --is-gcp
  - --gcp-project-id=canary-443022
  - --gcp-bigtable-instance-id=zipline-canary-instance
  fileUris:
  - gs://zipline-warehouse-canary/metadata/purchases.v1_test
  - gs://zipline-artifacts-canary/confs/additional-confs.yaml
  jarFileUris:
  - gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
  mainClass: ai.chronon.spark.Driver
status:
  state: DONE
  stateStartTime: '2025-01-30T18:17:38.722934Z'
statusHistory:
- state: PENDING
  stateStartTime: '2025-01-30T18:16:43.326557Z'
- state: SETUP_DONE
  stateStartTime: '2025-01-30T18:16:43.353624Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2025-01-30T18:16:43.597231Z'
yarnApplications:
- name: groupBy_quickstart.purchases.v1_test_backfill
  progress: 1.0
  state: FINISHED
  trackingUrl: http://zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal.:8088/proxy/application_1738197659103_0011/
+ echo -e '\033[0;32m <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>>\033[0m'
 <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>>
++ gcloud dataproc jobs describe 945d836f-20d8-4768-97fb-0889c00ed87b --region=us-central1 --format=flattened
++ grep status.state:
+ JOB_STATE='status.state:                    DONE'
+ echo status.state: DONE
status.state: DONE
+ '[' -z 'status.state:                    DONE' ']'
+ echo -e '\033[0;32m<<<<<.....................................GROUP-BY-UPLOAD.....................................>>>>>\033[0m'
<<<<<.....................................GROUP-BY-UPLOAD.....................................>>>>>
+ touch tmp_gbu.out
+ zipline run --mode upload --conf production/group_bys/quickstart/purchases.v1_test --ds 2023-12-01 --dataproc
+ tee /dev/tty tmp_gbu.out
Running with args: {'mode': 'upload', 'conf': 'production/group_bys/quickstart/purchases.v1_test', 'ds': '2023-12-01', 'dataproc': True, 'env': 'dev', 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Running with args: {'mode': 'upload', 'conf': 'production/group_bys/quickstart/purchases.v1_test', 'ds': '2023-12-01', 'dataproc': True, 'env': 'dev', 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Array(group-by-upload, --conf-path=purchases.v1_test, --end-date=2023-12-01, --conf-type=group_bys, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance)Array(group-by-upload, --conf-path=purchases.v1_test, --end-date=2023-12-01, --conf-type=group_bys, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance)

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
Dataproc submitter job id: c672008e-7380-4a82-a121-4bb0cb46503f
Dataproc submitter job id: c672008e-7380-4a82-a121-4bb0cb46503f
Setting env variables:
From <default_env> setting EXECUTOR_CORES=1
From <default_env> setting EXECUTOR_MEMORY=8G
From <default_env> setting PARALLELISM=1000
From <default_env> setting MAX_EXECUTORS=1000
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon_group_bys_upload_dev_quickstart.purchases.v1_test
From <cli_args> setting CHRONON_CONF_PATH=./production/group_bys/quickstart/purchases.v1_test
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary.
Running command: java -cp /Users/davidhan/zipline/chronon/cloud_gcp_submitter/target/scala-2.12/cloud_gcp_submitter-assembly-0.1.0-SNAPSHOT.jar:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.integrations.cloud_gcp.DataprocSubmitter group-by-upload --conf-path=purchases.v1_test --end-date=2023-12-01  --conf-type=group_bys    --additional-conf-path=additional-confs.yaml --gcs_files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml --chronon_jar_uri=gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
Setting env variables:
From <default_env> setting EXECUTOR_CORES=1
From <default_env> setting EXECUTOR_MEMORY=8G
From <default_env> setting PARALLELISM=1000
From <default_env> setting MAX_EXECUTORS=1000
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon_group_bys_upload_dev_quickstart.purchases.v1_test
From <cli_args> setting CHRONON_CONF_PATH=./production/group_bys/quickstart/purchases.v1_test
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary.
Running command: java -cp /Users/davidhan/zipline/chronon/cloud_gcp_submitter/target/scala-2.12/cloud_gcp_submitter-assembly-0.1.0-SNAPSHOT.jar:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.integrations.cloud_gcp.DataprocSubmitter group-by-upload --conf-path=purchases.v1_test --end-date=2023-12-01  --conf-type=group_bys    --additional-conf-path=additional-confs.yaml --gcs_files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml --chronon_jar_uri=gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
++ cat tmp_gbu.out
++ grep 'Dataproc submitter job id'
++ cut -d ' ' -f5
+ GBU_JOB_ID=c672008e-7380-4a82-a121-4bb0cb46503f
+ check_dataproc_job_state c672008e-7380-4a82-a121-4bb0cb46503f
+ JOB_ID=c672008e-7380-4a82-a121-4bb0cb46503f
+ '[' -z c672008e-7380-4a82-a121-4bb0cb46503f ']'
+ gcloud dataproc jobs wait c672008e-7380-4a82-a121-4bb0cb46503f --region=us-central1
Waiting for job output...
25/01/30 18:17:48 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/01/30 18:17:48 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
Using warehouse dir: /tmp/c672008e-7380-4a82-a121-4bb0cb46503f/local_warehouse
25/01/30 18:17:50 INFO HiveConf: Found configuration file file:/etc/hive/conf.dist/hive-site.xml
25/01/30 18:17:50 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/01/30 18:17:51 INFO SparkEnv: Registering MapOutputTracker
25/01/30 18:17:51 INFO SparkEnv: Registering BlockManagerMaster
25/01/30 18:17:51 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/01/30 18:17:51 INFO SparkEnv: Registering OutputCommitCoordinator
25/01/30 18:17:51 INFO DataprocSparkPlugin: Registered 188 driver metrics
25/01/30 18:17:51 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:8032
25/01/30 18:17:52 INFO AHSProxy: Connecting to Application History server at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:10200
25/01/30 18:17:52 INFO Configuration: resource-types.xml not found
25/01/30 18:17:52 INFO ResourceUtils: Unable to find 'resource-types.xml'.
25/01/30 18:17:53 INFO YarnClientImpl: Submitted application application_1738197659103_0012
25/01/30 18:17:54 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/01/30 18:17:54 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:8030
25/01/30 18:17:55 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
25/01/30 18:17:56 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-us-central1-703996152583-pqtvfptb/5d9e94ed-7649-4828-8b64-e3d58632a5d0/spark-job-history/application_1738197659103_0012.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]
2025/01/30 18:17:56 INFO  SparkSessionBuilder.scala:76 - Chronon logging system initialized. Overrides spark's configuration
2025/01/30 18:17:57 INFO  GroupByUpload.scala:229 - 
GroupBy upload for: quickstart.quickstart.purchases.v1_test
Accuracy: SNAPSHOT
Data Model: Events

2025/01/30 18:17:57 INFO  GroupBy.scala:492 - 
----[Processing GroupBy: quickstart.purchases.v1_test]----
2025/01/30 18:18:14 INFO  TableUtils.scala:200 - Found 30, between (2023-11-01, 2023-11-30) partitions for table: data.purchases
2025/01/30 18:18:14 INFO  GroupBy.scala:618 - 
Computing intersected range as:
   query range: [2023-12-01...2023-12-01]
   query window: None
   source table: data.purchases
   source data range: [2023-11-01...2023-12-01]
   source start/end: null/null
   source data model: Events
   queryable data range: [null...2023-12-01]
   intersected range: [2023-11-01...2023-12-01]

2025/01/30 18:18:14 INFO  GroupBy.scala:658 - 
Time Mapping: Some((ts,ts))

2025/01/30 18:18:14 INFO  GroupBy.scala:668 - 
Rendering source query:
   intersected/effective scan range: Some([2023-11-01...2023-12-01])
   partitionConditions: List(ds >= '2023-11-01', ds <= '2023-12-01')
   metaColumns: Map(ds -> null, ts -> ts)

2025/01/30 18:18:14 INFO  TableUtils.scala:759 -  Scanning data:
  table: data.purchases
  options: Map()
  format: Some(bigquery)
  selects:
    `ds`
    `ts`
    `user_id`
    `purchase_price`
  wheres:
    
  partition filters:
    ds >= '2023-11-01',
    ds <= '2023-12-01'

2025/01/30 18:18:14 INFO  HopsAggregator.scala:147 - Left bounds: 1d->unbounded 
minQueryTs = 2023-12-01 00:00:00
2025/01/30 18:18:14 INFO  FastHashing.scala:52 - Generating key builder over keys:
  bigint : user_id

2025/01/30 18:18:15 INFO  KvRdd.scala:102 - 
key schema:
  {
  "type" : "record",
  "name" : "Key",
  "namespace" : "ai.chronon.data",
  "doc" : "",
  "fields" : [ {
    "name" : "user_id",
    "type" : [ "null", "long" ],
    "doc" : ""
  } ]
}
value schema:
  {
  "type" : "record",
  "name" : "Value",
  "namespace" : "ai.chronon.data",
  "doc" : "",
  "fields" : [ {
    "name" : "purchase_price_sum_3d",
    "type" : [ "null", "long" ],
    "doc" : ""
  }, {
    "name" : "purchase_price_sum_14d",
    "type" : [ "null", "long" ],
    "doc" : ""
  }, {
    "name" : "purchase_price_sum_30d",
    "type" : [ "null", "long" ],
    "doc" : ""
  }, {
    "name" : "purchase_price_count_3d",
    "type" : [ "null", "long" ],
    "doc" : ""
  }, {
    "name" : "purchase_price_count_14d",
    "type" : [ "null", "long" ],
    "doc" : ""
  }, {
    "name" : "purchase_price_count_30d",
    "type" : [ "null", "long" ],
    "doc" : ""
  }, {
    "name" : "purchase_price_average_3d",
    "type" : [ "null", "double" ],
    "doc" : ""
  }, {
    "name" : "purchase_price_average_14d",
    "type" : [ "null", "double" ],
    "doc" : ""
  }, {
    "name" : "purchase_price_average_30d",
    "type" : [ "null", "double" ],
    "doc" : ""
  }, {
    "name" : "purchase_price_last10",
    "type" : [ "null", {
      "type" : "array",
      "items" : "long"
    } ],
    "doc" : ""
  } ]
}

2025/01/30 18:18:15 INFO  GroupBy.scala:492 - 
----[Processing GroupBy: quickstart.purchases.v1_test]----
2025/01/30 18:18:19 INFO  TableUtils.scala:200 - Found 30, between (2023-11-01, 2023-11-30) partitions for table: data.purchases
2025/01/30 18:18:19 INFO  GroupBy.scala:618 - 
Computing intersected range as:
   query range: [2023-12-01...2023-12-01]
   query window: None
   source table: data.purchases
   source data range: [2023-11-01...2023-12-01]
   source start/end: null/null
   source data model: Events
   queryable data range: [null...2023-12-01]
   intersected range: [2023-11-01...2023-12-01]

2025/01/30 18:18:19 INFO  GroupBy.scala:658 - 
Time Mapping: Some((ts,ts))

2025/01/30 18:18:19 INFO  GroupBy.scala:668 - 
Rendering source query:
   intersected/effective scan range: Some([2023-11-01...2023-12-01])
   partitionConditions: List(ds >= '2023-11-01', ds <= '2023-12-01')
   metaColumns: Map(ds -> null, ts -> ts)

2025/01/30 18:18:20 INFO  TableUtils.scala:759 -  Scanning data:
  table: data.purchases
  options: Map()
  format: Some(bigquery)
  selects:
    `ds`
    `ts`
    `user_id`
    `purchase_price`
  wheres:
    
  partition filters:
    ds >= '2023-11-01',
    ds <= '2023-12-01'

2025/01/30 18:18:20 INFO  GroupByUpload.scala:175 - Not setting InputAvroSchema to GroupByServingInfo as there is no streaming source defined.
2025/01/30 18:18:20 INFO  GroupByUpload.scala:188 - 
Built GroupByServingInfo for quickstart.purchases.v1_test:
table: data.purchases / data-model: Events
     keySchema: Success(struct<user_id:bigint>)
   valueSchema: Success(struct<purchase_price:bigint>)
mutationSchema: Failure(java.lang.NullPointerException)
   inputSchema: Failure(java.lang.NullPointerException)
selectedSchema: Success(struct<purchase_price:bigint>)
  streamSchema: Failure(java.lang.NullPointerException)

2025/01/30 18:18:20 INFO  TableUtils.scala:459 - Repartitioning before writing...
2025/01/30 18:18:24 INFO  TableUtils.scala:494 - 102 rows requested to be written into table canary-443022.data.quickstart_purchases_v1_test_upload
2025/01/30 18:18:24 INFO  TableUtils.scala:531 - repartitioning data for table canary-443022.data.quickstart_purchases_v1_test_upload by 200 spark tasks into 1 table partitions and 10 files per partition
2025/01/30 18:18:24 INFO  TableUtils.scala:536 - Sorting within partitions with cols: List(ds)
2025/01/30 18:18:30 INFO  TableUtils.scala:469 - Finished writing to canary-443022.data.quickstart_purchases_v1_test_upload
2025/01/30 18:18:30 INFO  TableUtils.scala:440 - Cleared the dataframe cache after repartition & write to canary-443022.data.quickstart_purchases_v1_test_upload - start @ 2025-01-30 18:18:20 end @ 2025-01-30 18:18:30
Job [c672008e-7380-4a82-a121-4bb0cb46503f] finished successfully.
done: true
driverControlFilesUri: gs://dataproc-staging-us-central1-703996152583-lxespibx/google-cloud-dataproc-metainfo/5d9e94ed-7649-4828-8b64-e3d58632a5d0/jobs/c672008e-7380-4a82-a121-4bb0cb46503f/
driverOutputResourceUri: gs://dataproc-staging-us-central1-703996152583-lxespibx/google-cloud-dataproc-metainfo/5d9e94ed-7649-4828-8b64-e3d58632a5d0/jobs/c672008e-7380-4a82-a121-4bb0cb46503f/driveroutput
jobUuid: c672008e-7380-4a82-a121-4bb0cb46503f
placement:
  clusterName: zipline-canary-cluster
  clusterUuid: 5d9e94ed-7649-4828-8b64-e3d58632a5d0
reference:
  jobId: c672008e-7380-4a82-a121-4bb0cb46503f
  projectId: canary-443022
sparkJob:
  args:
  - group-by-upload
  - --conf-path=purchases.v1_test
  - --end-date=2023-12-01
  - --conf-type=group_bys
  - --additional-conf-path=additional-confs.yaml
  - --is-gcp
  - --gcp-project-id=canary-443022
  - --gcp-bigtable-instance-id=zipline-canary-instance
  fileUris:
  - gs://zipline-warehouse-canary/metadata/purchases.v1_test
  - gs://zipline-artifacts-canary/confs/additional-confs.yaml
  jarFileUris:
  - gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
  mainClass: ai.chronon.spark.Driver
status:
  state: DONE
  stateStartTime: '2025-01-30T18:18:33.742458Z'
statusHistory:
- state: PENDING
  stateStartTime: '2025-01-30T18:17:44.197477Z'
- state: SETUP_DONE
  stateStartTime: '2025-01-30T18:17:44.223246Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2025-01-30T18:17:44.438240Z'
yarnApplications:
- name: group-by-upload
  progress: 1.0
  state: FINISHED
  trackingUrl: http://zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal.:8088/proxy/application_1738197659103_0012/
+ echo -e '\033[0;32m <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>>\033[0m'
 <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>>
++ gcloud dataproc jobs describe c672008e-7380-4a82-a121-4bb0cb46503f --region=us-central1 --format=flattened
++ grep status.state:
+ JOB_STATE='status.state:                    DONE'
+ echo status.state: DONE
status.state: DONE
+ '[' -z 'status.state:                    DONE' ']'
+ echo -e '\033[0;32m<<<<<.....................................UPLOAD-TO-KV.....................................>>>>>\033[0m'
<<<<<.....................................UPLOAD-TO-KV.....................................>>>>>
+ touch tmp_upload_to_kv.out
+ zipline run --mode upload-to-kv --conf production/group_bys/quickstart/purchases.v1_test --partition-string=2023-12-01 --dataproc
+ tee /dev/tty tmp_upload_to_kv.out
Running with args: {'mode': 'upload-to-kv', 'conf': 'production/group_bys/quickstart/purchases.v1_test', 'dataproc': True, 'env': 'dev', 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Running with args: {'mode': 'upload-to-kv', 'conf': 'production/group_bys/quickstart/purchases.v1_test', 'dataproc': True, 'env': 'dev', 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None}
Array(groupby-upload-bulk-load, --conf-path=purchases.v1_test, --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar, --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl, --conf-type=group_bys, --partition-string=2023-12-01, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance)Array(groupby-upload-bulk-load, --conf-path=purchases.v1_test, --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar, --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl, --conf-type=group_bys, --partition-string=2023-12-01, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance)

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
Dataproc submitter job id: c29097e9-b845-4ad7-843a-c89b622c5cfe
Dataproc submitter job id: c29097e9-b845-4ad7-843a-c89b622c5cfe
Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon_group_bys_upload-to-kv_dev_quickstart.purchases.v1_test
From <cli_args> setting CHRONON_CONF_PATH=./production/group_bys/quickstart/purchases.v1_test
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary.
Running command: java -cp /Users/davidhan/zipline/chronon/cloud_gcp_submitter/target/scala-2.12/cloud_gcp_submitter-assembly-0.1.0-SNAPSHOT.jar:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.integrations.cloud_gcp.DataprocSubmitter groupby-upload-bulk-load --conf-path=purchases.v1_test --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl  --conf-type=group_bys  --partition-string=2023-12-01  --additional-conf-path=additional-confs.yaml --gcs_files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml --chronon_jar_uri=gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon_group_bys_upload-to-kv_dev_quickstart.purchases.v1_test
From <cli_args> setting CHRONON_CONF_PATH=./production/group_bys/quickstart/purchases.v1_test
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary.
Running command: java -cp /Users/davidhan/zipline/chronon/cloud_gcp_submitter/target/scala-2.12/cloud_gcp_submitter-assembly-0.1.0-SNAPSHOT.jar:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.integrations.cloud_gcp.DataprocSubmitter groupby-upload-bulk-load --conf-path=purchases.v1_test --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl  --conf-type=group_bys  --partition-string=2023-12-01  --additional-conf-path=additional-confs.yaml --gcs_files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml --chronon_jar_uri=gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar
++ cat tmp_upload_to_kv.out
++ grep 'Dataproc submitter job id'
++ cut -d ' ' -f5
+ UPLOAD_TO_KV_JOB_ID=c29097e9-b845-4ad7-843a-c89b622c5cfe
+ check_dataproc_job_state c29097e9-b845-4ad7-843a-c89b622c5cfe
+ JOB_ID=c29097e9-b845-4ad7-843a-c89b622c5cfe
+ '[' -z c29097e9-b845-4ad7-843a-c89b622c5cfe ']'
+ gcloud dataproc jobs wait c29097e9-b845-4ad7-843a-c89b622c5cfe --region=us-central1
Waiting for job output...
25/01/30 18:18:42 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/01/30 18:18:42 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/01/30 18:18:45 INFO Driver$GroupByUploadToKVBulkLoad$: Triggering bulk load for GroupBy: quickstart.purchases.v1_test for partition: 2023-12-01 from table: canary-443022.data.quickstart_purchases_v1_test_upload
25/01/30 18:18:47 INFO BigTableKVStoreImpl: Kicking off bulkLoad with query:

EXPORT DATA OPTIONS (
  format='CLOUD_BIGTABLE',
  overwrite=true,
  uri="https://bigtable.googleapis.com/projects/canary-443022/instances/zipline-canary-instance/appProfiles/GROUPBY_INGEST/tables/GROUPBY_BATCH",
  bigtable_options='''{
   "columnFamilies" : [
      {
        "familyId": "cf",
        "encoding": "BINARY",
        "columns": [
           {"qualifierString": "value", "fieldName": ""}
        ]
      }
   ]
}'''
) AS
SELECT
  CONCAT(CAST(CONCAT('QUICKSTART_PURCHASES_V1_TEST_BATCH', '#') AS BYTES), key_bytes) as rowkey,
  value_bytes as cf,
  TIMESTAMP_MILLIS(1701475200000) as _CHANGE_TIMESTAMP
FROM canary-443022.data.quickstart_purchases_v1_test_upload
WHERE ds = '2023-12-01'

25/01/30 18:18:48 INFO BigTableKVStoreImpl: Export job started with Id: JobId{project=canary-443022, job=export_canary_443022_data_quickstart_purchases_v1_test_upload_to_bigtable_2023-12-01_1738261127353, location=null} and link: https://bigquery.googleapis.com/bigquery/v2/projects/canary-443022/jobs/export_canary_443022_data_quickstart_purchases_v1_test_upload_to_bigtable_2023-12-01_1738261127353?location=us-central1
25/01/30 18:18:48 INFO BigTableKVStoreImpl: …
## Summary
Fixed all the test failures after bazel migration. Known test failures
are 3 tests in Spark which deal with Resource loading of test data so we
plan to temporarily disable them for now, also we have a way out to fix
them after killing sbt.

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

Based on the comprehensive summary of changes, here are the concise
release notes:

## Release Notes

- **New Features**
  - Added support for Bazelisk installation in Docker environment
  - Enhanced Scala and Java dependency management
  - Expanded Protobuf and gRPC support
  - Introduced new testing utilities and frameworks

- **Dependency Updates**
  - Updated Flink, Spark, and Kafka-related dependencies
  - Added new Maven artifacts for Avro, Thrift, and testing libraries
  - Upgraded various library versions

- **Testing Improvements**
  - Introduced `scala_junit_suite` for more flexible test execution
  - Added new test resources and configurations
  - Enhanced test coverage and dependency management

- **Build System**
  - Updated Bazel build configurations
  - Improved dependency resolution and repository management
  - Added new build rules and scripts

- **Code Quality**
  - Refactored package imports and type conversions
  - Improved code formatting and readability
  - Streamlined dependency handling across modules
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: nikhil-zlai <[email protected]>
## Summary
Some updates to get our Flink jobs running on the Etsy side:
* Configure schema registry via host/port/scheme instead of URL 
* Explicitly set the task slots per task manager
* Configure checkpoint directory based on teams.json

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested - Kicked off the job on the Etsy cluster and
confirmed it's up and running
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced an enhanced configuration option for Flink job submissions
by adding a state URI parameter for improved job state management.
- Expanded schema registry configuration, enabling greater flexibility
with host, port, and scheme settings.

- **Chores**
- Adjusted logging levels and refined error messaging to support better
troubleshooting.

- **Documentation**
- Updated configuration guidance to aid in setting up schema registry
integration.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
- This migrates over our artifact upload + run.py to leverage
bazel-built jars. Only for the batch side for now, streaming will
follow.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
	- Updated default JAR file names for online and Dataproc submissions.
- Migrated build process from `sbt` to `bazel` for GCP artifact
generation.
	- Added new `submitter` binary target for Dataproc submission.
- Added dependency for Scala-specific features of the Jackson library in
the online library.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Streamlined the build process with more consistent naming conventions
for cloud deployment artifacts.
- **New Features**
- Enhanced support for macOS environments by introducing
platform-specific handling during the build, ensuring improved
compatibility.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Thomas Chow <[email protected]>
## Summary
To add missing dependencies for flink module coming from our recent
changes to keep it in sync with sbt

Tested locally by running Flink jobs from DataprocSubmitterTest

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Expanded Kafka integration with enhanced authentication, client
functionality, and Protobuf serialization support.
  - Improved JSON processing support for Scala-based operations.
- Adjusted dependency versions to ensure better compatibility and
stability with Kafka and cloud services.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Newer version with better ergonomics, and more maintainable codebase -
smaller files, simpler logic - with little indirection etc with user
facing simplification

- [x] always compile the whole repo
- [x] teams thrift and teams python
- [x] compile context as its own object
- [ ] integration test on sample
- [ ] progress bar for compile
- [ ] sync to remote w/ progress bar
- [ ] display workflow
- [ ] display workflow progress

## Checklist
- [x] Added Unit Tests
- [ ] Covered by existing CI
- [x] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New CLI Commands**
- Introduced streamlined commands for synchronizing, backfilling, and
deploying operations for easier management.

- **Enhanced Logging**
- Improved colored, structured log outputs for clearer real-time
monitoring and debugging.

- **Configuration & Validation Upgrades**
- Strengthened configuration management and validation processes to
ensure reliable operations.
- Added a comprehensive validation framework for Chronon API thrift
objects.

- **Build & Infrastructure Improvements**
- Transitioned to a new container base and modernized testing/build
systems for better performance and stability.

- **Team & Utility Enhancements**
- Expanded team configuration options and refined utility processes to
streamline overall workflows.
- Introduced new data classes and methods for improved configuration and
compilation context management.
- Enhanced templating functionality for dynamic replacements based on
object properties.
- Improved handling of Git operations and enhanced error logging for
better traceability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Ken Morton <[email protected]>
Co-authored-by: Kumar Teja Chippala <[email protected]>
Co-authored-by: Kumar Teja Chippala <[email protected]>
## Summary
Update the Flink job code on the tiling path to use the TileKey. I
haven't wired up the KV store side of things yet (can do the write and
read side of the KV store collaboratively with Thomas as they need to go
together to keep the tests happy).
The tiling version of the Flink job isn't in use so these changes should
be safe to go and keeps things incremental.

## Checklist
- [ ] Added Unit Tests
- [X] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a utility to determine the start timestamp for a defined time
window.

- **Refactor/Enhancements**
- Streamlined time window handling by providing a default one-day
resolution when none is specified.
- Improved tiled data processing with consistent tiling window sizing
and enriched metadata management.

- **Tests**
- Updated integration tests to validate the new tile processing and time
window behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Updated dev notes with instructions for Bazel setup and some useful
commands. Also updated bazel target names so the intemediate/uber jar
names don't conflict

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Streamlined naming conventions and updated dependency references
across multiple project modules for improved consistency.
- **Documentation**
- Expanded build documentation with a new, comprehensive Bazel Setup
section detailing configuration, caching, testing, and deployment
instructions.
- **Build System Enhancements**
- Introduced updated source-generation rules to support future
multi-language integration and more robust build workflows.
- **Workflow Updates**
- Modified test target names in CI workflows to reflect updated naming
conventions, enhancing clarity and consistency in test execution.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
… fix (#322)

## Summary
Updated dev notes with Bazel installation instructions and java error
fix

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Documentation**
- Enhanced setup guidance for Bazel installation on both Mac and Linux.
- Provided clear instructions for resolving Java-related issues on Mac.
- Updated testing procedures by replacing previous instructions with
streamlined Bazel commands.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Increased timeout and java heap size for spark tests to avoid flaky test
failures

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Tests**
- Extended test timeout settings to 900 seconds for enhanced testing
robustness.
- Updated job names and workflow references for better clarity and
consistency in testing workflows.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Add jvm_binary targets for service and hub modules to build final
assembly jars for deployment

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Updated the dependency supporting Apache Spark functionality to boost
backend data processing efficiency.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Remove flink streaming scala dependency as we no longer need it
otherwise we will run into runtime error saying flink-shaded-guava
package not found

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Improved dependency handling to reduce the risk of runtime errors such
as class incompatibilities.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Solves sync failures

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Streamlined the Java build configuration by removing legacy
integration and testing support.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Hit some errors as our Spark deps pull in rocksdbjni 8.3.2 whereas we
expect an older version in Flink (6.20.3-ververica-2.0). As we rely on
user class first it seems like this newer version gets priority and when
Flink is closing tiles we hit an error -
```
2025-02-05 21:14:53,614 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Tiling for etsy.listing_canary.actions_v1 -> (Tiling Side Output Late Data for etsy.listing_canary.actions_v1, Avro conversion for etsy.listing_canary.actions_v1 -> async kvstore writes for etsy.listing_canary.actions_v1 -> Sink: Metrics Sink for etsy.listing_canary.actions_v1) (2/3) (a107444db4dad3eb79d9d02631d8696e_5627cd3c4e8c9c02fa4f114c4b3607f4_1_56) switched from RUNNING to FAILED on container_1738197659103_0039_01_000004 @ zipline-canary-cluster-w-1.us-central1-c.c.canary-443022.internal (dataPort=33465).
java.lang.NoSuchMethodError: 'void org.rocksdb.WriteBatch.remove(org.rocksdb.ColumnFamilyHandle, byte[])'
	at org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapper.remove(RocksDBWriteBatchWrapper.java:105)
```

Yanked it out from the two jars and confirmed that the Flink job seems
to be running fine + crossing over across hours (and hence tile
closures).

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- Chores
- Made internal adjustments to dependency management to improve
compatibility between libraries and enhance overall application
stability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Reverts #330

It's breaking our build as we are using `java_test_suite` for
service_commons module

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
	- Enhanced configuration to support improved Java testing capabilities.
- Expanded build system functionality with additional integrations for
JVM-based test suites.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Some of the benefits using LayerChart over echarts/uplot

- Broad support of chart types (cartesian, polar, hierarchy, graph,
force, geo)
- Simplicity in setup and customization (composable chart components)
- Responsive charts, both for viewport/container, and also theming
(light/dark, etc)
- Flexibility in design/stying (CSS variables, classes, color scales,
etc) including transitions
- Ability to opt into canvas or svg rendering context as the use case
requires. LayerChart's canvas support also has CSS variable/styling
support as well (which is unique as far as I'm aware). Html layers are
also available, which are great for multiline text (with truncation).

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced new interactive charts, including `FeaturesLineChart` and
`PercentileLineChart`, enhancing data visualization with detailed
tooltips.
- **Refactor**
- Replaced legacy ECharts components with LayerChart components,
streamlining chart interactions and state management.
- **Chores**
- Updated dependency configurations and Tailwind CSS settings for
improved styling and performance.
- Removed unused ECharts-related components and functions to simplify
the codebase.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---
- To see the specific tasks where the Asana app for GitHub is being
used, see below:
  - https://app.asana.com/0/0/1209163154826936

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Sean Lynch <[email protected]>
## Summary

Allows connecting to `app` docker container on `localhost:5005` via
IntelliJ. See
[tutorial](https://www.jetbrains.com/help/idea/tutorial-remote-debug.html)
for more details, but the summary is

- Recreate the docker containers (`docker-init/build.sh --all`)
- This will pass the
`-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005`
CLI arguments and expose `5005` on the host (i.e. your laptop)
- You will see `Listening for transport dt_socket at address: 5005` in
the container logs


![image](https://github.com/user-attachments/assets/b18c3965-43d2-4df9-8c61-302a19eecdbe)

- In IntelliJ, add a new `Remote JVM Debug` config with default settings


![image](https://github.com/user-attachments/assets/13f102c0-97b6-4e3d-a957-92854c994083)

- Set a breakpoint and then trigger it (from the frontend or calling
endpoints directly)


![image](https://github.com/user-attachments/assets/658bc0b5-9f41-4cc3-89a2-b74a62fa43fc)

This has been useful to understand the
[java.lang.NullPointerException](https://app.asana.com/0/home/1208932362205799/1209321714844239)
error when fetching summaries for some columns/features (ex.
`dim_merchant_account_type`, `dim_merchant_country`)


![image](https://github.com/user-attachments/assets/3def9c86-98d7-42b4-a7e3-74b89a84d81c)

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

Co-authored-by: Sean Lynch <[email protected]>
## Summary

Added a new workflow to verify the bazel config setup. This essentially
validates all our bazel config by pulling all the necessary dependencies
for all targets with out actually building them which is similar to the
`Sync project` option in IntelliJ. This should help us capture errors in
our bazel config setup for the CI.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- Chores
- Added a new automated CI workflow to run configuration tests with
improved concurrency management.
- Expanded CI triggers to include additional modules for broader testing
coverage.
- Tests
- Temporarily disabled a dynamic class loading test in the cloud
integrations to improve overall test stability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

The codebase previously operated under the assumption that partition
listing is a cheap operation. It is not the case for GCS Format -
partition listing is expensive on GCS external tables.

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Improved error handling to provide clearer messaging when data
retrieval issues occur.

- **Refactor**
- Streamlined internal processing for data partitions, resulting in more
consistent behavior.
  - Enhanced logging during data scanning for easier troubleshooting.
- Simplified logic for handling intersected ranges, ensuring consistent
definitions.
- Reduced the volume of test data for improved performance and resource
utilization during tests.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
the default column reader batch size is 4096 - reads that many rows into
memory buffer at once.
that causes ooms on large columns, for catalyst we only need to read one
row at a time. for interactive we set the limit to 16.

tested on etsy data.
 
## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced data processing performance by adding an optimized
configuration for reading Parquet files in Spark sessions.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
regarding https://app.asana.com/0/1208277377735902/1209321714844239

No longer are nulls put into the histogram array, instead we use the
`Constants.magicNullLong`. We will have to filter this out from the
charts as empty value in the FE.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced an additional summary retrieval for merchant account types,
providing enhanced insights.
  
- **Bug Fixes**
- Improved data processing reliability by substituting missing values
with default placeholders, ensuring more consistent results.
- Enhanced handling of null values in histogram data for more accurate
representation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Documentation**
- Added a new section for connecting remotely to a Java process in
Docker using IntelliJ's remote debugging feature.
- Clarified commit message formatting instructions for better
consistency.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Sean Lynch <[email protected]>
## Summary
We need to make sure no nulls get passed thru the transpose method.
should wrap up this ticket:
https://app.asana.com/0/1208277377735902/1209343355165466

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Enhanced error handling ensures that invalid or missing values are
consistently processed, leading to more reliable data outcomes.
- **Tests**
- Added test cases to validate the updated handling of null and
non-numeric values, ensuring robustness in data processing.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
- Introduced an enhanced drift metric toggle for selecting and exploring
drift data.
- Updated chart visualizations now accurately display drift column data.
- Join page headers have been refreshed to highlight drift series
details.

- **Bug Fixes**
- Resolved issues with data fetching and display related to drift
metrics.

- **Refactor**
- Consolidated data endpoints to focus on drift metrics, replacing older
timeseries methods.
- Improved sorting and data processing, resulting in a more consistent
and reliable display of data distributions.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Sean Lynch <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- New Features  
- Introduced the “Inter” font for enhanced typography, now available in
both regular and italic styles. The default sans-serif stack was updated
to incorporate Inter for a more modern visual experience.

- Documentation  
- Provided comprehensive documentation that includes installation
instructions and licensing details for the new font, ensuring clarity on
usage rights and guidelines.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
varant-zlai and others added 7 commits April 1, 2025 22:17
…ant cols from left (#577)

## Summary

spark optimizations

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a new configuration option that lets users control time
range validation during join backfill operations, providing more
consistent data processing.
- Added a method for converting partition ranges to time ranges,
enhancing flexibility in time range handling.
- Added a new test case for data generation and validation during join
operations.

- **Refactor**
- Optimized the way data merging queries are constructed and how time
ranges are applied during join operations, ensuring enhanced precision
and performance across event data processing.
- Simplified method signatures by removing unnecessary parameters,
streamlining the process of generating Bloom filters.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
## Summary

Adding step days of 1 to source job

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Data processing is now handled in daily segments, providing more
precise and timely results.
- **Bug Fixes**
- Error messages have been refined to clearly indicate the specific day
when a query yields no results, improving clarity during
troubleshooting.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
## Summary
Using tableUtils.partitionColumn rather than hardcoded ds

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update

Co-authored-by: ezvz <[email protected]>
## Summary

Implements the simple label join logic to create a materialized table by
joining forward looking partitions from the snapshot table back of
labelJoinParts back to join output.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced an updated mechanism for formatting and standardizing label
outputs with the addition of `outputLabelTableV2`.
- Added a new distributed label join operation with the `LabelJoinV2`
class, featuring robust validations, comprehensive error handling, and
detailed logging for improved data integration.
- Implemented a comprehensive test suite for the `LabelJoinV2`
functionality to ensure accuracy and reliability of label joins.

- **Updates**
- Replaced the existing `LabelJoin` class with the new `LabelJoinV2`
class, enhancing the label join process.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
## Summary
Right now, the gcp integration tests do not recognize failures from
dataproc and continue running the commands. This has led to the
group_by_upload job hanging indefinitely. This change ensures that if
the dataproc job state is "ERROR" we fail immediately.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Enhanced job status checking by ensuring that both empty and
explicitly error-indicating states trigger consistent failure responses,
resulting in more robust job evaluation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Apr 2, 2025

Walkthrough

This pull request enhances DataFrame processing in Spark jobs. The MergeJob class is updated with improved join operations, including a unique ID column and new configuration variables. The SourceJob class now logs warnings for empty DataFrames instead of throwing exceptions. Additionally, tests are updated to include extra configuration and a new schema column.

Changes

File(s) Summary
spark/src/main/scala/ai/chronon/spark/MergeJob.scala Enhanced merge logic: updated imports, added leftSourceTable, requiredColumns, and leftDfWithUUID; modified join foldLeft and updated method signatures for processJoinedDf and padGroupByFields.
spark/src/main/scala/ai/chronon/spark/SourceJob.scala Updated import to include logger; added producedSomeData flag; changed empty DataFrame handling to log a warning and assert data presence.
spark/src/main/scala/ai/chronon/spark/TableUtils.scala Introduced new configuration parameters: carryOnlyRequiredColsFromLeftInJoin and ziplineInternalRowIdCol for join operations.
spark/src/test/scala/ai/chronon/spark/test/join/ModularJoinTest.scala Adjusted SparkSession initialization with extra config; updated schema to add some_col and modified SQL queries and grouping accordingly.

Possibly related PRs

  • Adding verbs for modular join flow to driver #575: The changes in the main PR for MergeJob.scala are related to the new MergeJobRun object introduced in the retrieved PR, as both involve modifications to the MergeJob functionality and its execution context.
  • fix: thread the table props through properly #576: The changes in the main PR for MergeJob.scala involve significant modifications to how the MergeJob class handles data merging and required columns, while the retrieved PR focuses on restructuring import statements and modifying how table properties are accessed in the same MergeJob.scala file, indicating a direct relationship in terms of code changes.

Suggested reviewers

  • varant-zlai
  • nikhil-zlai
  • piyush-zlai
  • david-zlai

Poem

In code we trust, our logic set tight,
Joins now dance with UUIDs in flight.
Warnings whisper where emptiness hides,
Tests affirm schema as change abides.
Sparked by change, our efforts shine bright!
🚀 Happy merging under CodeRabbit’s light!

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 0a9d7e6 and dadb417.

📒 Files selected for processing (1)
  • spark/src/main/scala/ai/chronon/spark/MergeJob.scala (7 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
spark/src/main/scala/ai/chronon/spark/MergeJob.scala (1)
spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (3)
  • JoinUtils (37-618)
  • computeLeftSourceTableName (604-617)
  • leftDf (68-95)
⏰ Context from checks skipped due to timeout of 90000ms (14)
  • GitHub Check: streaming_tests
  • GitHub Check: join_tests
  • GitHub Check: groupby_tests
  • GitHub Check: streaming_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: join_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: spark_tests
  • GitHub Check: groupby_tests
  • GitHub Check: spark_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (13)
spark/src/main/scala/ai/chronon/spark/MergeJob.scala (13)

3-3: Added DataModel import for Events type detection.

Added import enables the data model-specific column selection.


32-32: Added monotonically_increasing_id import for row tracking.

Needed for UUID column generation in the optimization.


51-52: Good refactoring of leftSourceTable.

Extracted variable improves readability.


69-75: Efficient required columns computation.

Smart calculation of minimal column set based on join keys and data model.


84-88: Added row ID for optimization.

Using monotonically_increasing_id creates unique identifiers to rejoin data later.


89-104: Smart optimization flag handling.

Only applies optimization when beneficial. Good logging.


106-114: Conditional DataFrame selection.

Uses minimal columns when beneficial, full dataset otherwise.


119-119: Updated fold operation with optimized DataFrame.

Uses the filtered DataFrame in join operations.


130-135: Pass column names instead of DataFrame.

Modified to work with column selection optimization.


136-146: Smart rejoining of non-required columns.

Efficiently brings back columns filtered out during join process.


147-148: Updated save to use optimized DataFrame.

Ensures final output includes all columns.


239-248: Updated padGroupByFields to handle column filtering.

Added parameter prevents padding columns that will be rejoined later.


269-276: Updated processJoinedDf signature and implementation.

Takes column names instead of DataFrame, maintaining functionality while supporting optimization.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

requiredColumnsDf
} else {
// Else we just need the base DF without the UUID
leftDfWithUUID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this include the UUID?

tchow-zlai and others added 10 commits April 2, 2025 17:10
Co-authored-by: Thomas Chow <[email protected]>
## Summary
Right now, the gcp integration tests do not recognize failures from
dataproc and continue running the commands. This has led to the
group_by_upload job hanging indefinitely. This change ensures that if
the dataproc job state is "ERROR" we fail immediately.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Enhanced job status checking by ensuring that both empty and
explicitly error-indicating states trigger consistent failure responses,
resulting in more robust job evaluation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: Thomas Chow <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>

Co-authored-by: Thomas Chow <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
def run(): Unit = {
val leftDfForSchema = tableUtils.scanDf(query = null, table = leftInputTable, range = Some(dateRange))
val leftSchema = leftDfForSchema.schema
val bootstrapInfo =
Copy link
Collaborator

@tchow-zlai tchow-zlai Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, can bootstrapInfo be computed per stepRange? I'm wondering if we should just push this logic into the dateRange.steps iteration below.

Co-authored-by: Thomas Chow <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
spark/src/main/scala/ai/chronon/spark/MergeJob.scala (3)

79-82: Bootstrap cost might be high.


91-107: Limit log size if sets are large.


121-121: Consider caching partialDf.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 785d777 and 0a9d7e6.

📒 Files selected for processing (1)
  • spark/src/main/scala/ai/chronon/spark/MergeJob.scala (6 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
spark/src/main/scala/ai/chronon/spark/MergeJob.scala (1)
spark/src/main/scala/ai/chronon/spark/BootstrapInfo.scala (3)
  • BootstrapInfo (51-74)
  • BootstrapInfo (76-346)
  • from (80-345)
⏰ Context from checks skipped due to timeout of 90000ms (14)
  • GitHub Check: streaming_tests
  • GitHub Check: streaming_tests
  • GitHub Check: spark_tests
  • GitHub Check: join_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: groupby_tests
  • GitHub Check: join_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: groupby_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: spark_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (14)
spark/src/main/scala/ai/chronon/spark/MergeJob.scala (14)

3-3: No conflicts observed.


32-32: No issues found.


51-56: Logic is clear.


65-71: Confirm entity vs. event columns.


86-86: UUID could be expensive.


88-90: Clean approach to limit columns.


108-112: Implementation looks solid.


132-134: No concerns.


136-148: Inner join drops unmatched rows; verify.


241-243: All good here.


244-244: No problems noted.


247-249: Verify no user fields are skipped.


271-271: Simplified signature is good.


274-277: Confirm custom filtering.

chewy-zlai and others added 6 commits April 3, 2025 14:33
## Summary
Update Integration Tests to update aws-crew and gcp-crew for integration
test failures.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added automated Slack notifications that alert the team immediately
upon integration test failures in both AWS and GCP environments.
  
- **Chores**
- Updated notification settings to ensure alerts are routed to the
correct Slack channel for effective troubleshooting.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
… submission logic (#540)

## Summary
Created PubSub interfaces with GCP implementation. Updated
NodeExecutionActivity implementation for job submission using PubSub
publisher to publish messages for agent to pick up. Added necessary unit
and integration tests

## Checklist
- [x] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a robust Pub/Sub module with administration, configuration,
publisher, subscriber, and manager components. The module now supports
asynchronous job submission with improved error handling and offers
flexible production/emulator configurations.
- **Documentation**
- Added comprehensive usage guidance with examples for the Pub/Sub
integration.
- **Tests**
- Expanded unit and integration tests to cover all Pub/Sub operations,
including message publishing, retrieval, and error handling scenarios.
- **Chores**
- Updated dependency versions and build configurations, including new
Maven artifacts for enhanced Google Cloud API support. Updated git
ignore rules to exclude additional directories.
- **Refactor**
- Streamlined the persistence layer by removing obsolete key
definitions.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

^^^

Ran both AWS and GCP integration tests.

Following this PR, i'll work on rebasing
#549 which applies the confs
to the submitter interface.

Previously, to compile it would be:
```
zipline compile --conf=group_bys/aws/purchases.py
```

but now `--conf` or `--input-path` is not required, just:
```
zipline compile
```
and this compiles the entire repo.

In addition, the compiled files gets persisted in the `compiled` folder.
No longer are compiled files persisted by environment like
`production/...`

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [x] Integration tested
- [ ] Documentation update

GCP and AWS integration tests pass

```
<<<<<.....................................COMPILE.....................................>>>>>
+ zipline compile --chronon_root=/Users/davidhan/zipline/chronon/api/python/test/canary
17:49:15 INFO parse_teams.py:56 - Processing teams from path /Users/davidhan/zipline/chronon/api/python/test/canary/teams.py
INFO:ai.chronon.cli.logger:Processing teams from path /Users/davidhan/zipline/chronon/api/python/test/canary/teams.py

GroupBy-s:
  Compiled 5 objects from 3 files.
  Added aws.plaid_fv.v1
  Added aws.purchases.v1_dev
  Added aws.purchases.v1_test
  Added gcp.purchases.v1_dev
  Added gcp.purchases.v1_test

....
 <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>>
+ aws emr wait step-complete --cluster-id j-13BASWFP15TLR --step-id s-02878283D08B3510ZGCZ
++ aws emr describe-step --cluster-id j-13BASWFP15TLR --step-id s-02878283D08B3510ZGCZ --query Step.Status.State
++ tr -d '"'
+ STEP_STATE=COMPLETED
+ '[' COMPLETED '!=' COMPLETED ']'
+ echo succeeded
succeeded
+ echo -e '\033[0;32m<<<<<.....................................SUCCEEDED!!!.....................................>>>>>\033[0m'
<<<<<.....................................SUCCEEDED!!!.....................................>>>>>


```


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **New Features**
- Introduced an enhanced CLI compile command with an option to specify a
custom configuration root.
- Enabled dynamic loading of team configurations from both JSON and
Python formats.
  - Added support for multiple error reporting in various components.

- **Improvements**
- Streamlined error and status messages for clearer, simplified
feedback.
  - Enhanced error handling mechanisms across the application.
- Updated quickstart scripts for AWS and GCP to utilize a unified,
precompiled configuration set.
- Modified the way environment variables are structured for improved
clarity and management.

- **Dependency Updates**
  - Added new dependencies to improve functionality.
- Upgraded several existing dependencies to boost performance and
enhance output formatting.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Nikhil Simha <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.