-
Notifications
You must be signed in to change notification settings - Fork 8
Spark opt: only carry required columns for join #581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…table. (#312) …table. ## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Logging** - Adjusted log levels for table accessibility and partition availability messages from error to info. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added `bazelisk` plugin with version `1.25.0`
- Configured `bazelisk` plugin repository and commit hash
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: Thomas Chow <[email protected]>
## Summary ^^^ ``` davidhan@Mac: ~/zipline/chronon/api/py/test/sample (davidhan/streaming_verb) $ echo $RUN_PY /Users/davidhan/zipline/chronon/api/py/ai/chronon/repo/run.py davidhan@Mac: ~/zipline/chronon/api/py/test/sample (davidhan/streaming_verb) $ python $RUN_PY --mode streaming --dataproc --groupby-name=etsy.listing_canary.actions_v1 --kafka-bootstrap=bootstrap.zipline-kafka-cluster.us-central1.managedkafka.canary-443022.cloud.goog:9092 ``` Produced job: https://console.cloud.google.com/dataproc/jobs/db59b95d-adae-4802-8d96-5613b0799c04/monitoring?region=us-central1&project=canary-443022 ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for Flink streaming jobs alongside existing Spark jobs. - Introduced new command-line options for Flink-specific job configurations. - Enhanced job submission capabilities for both Spark and Flink job types. - **Improvements** - Refined argument handling for job submissions. - Updated build and upload scripts to include Flink JAR artifacts. - **Technical Updates** - Introduced new job type enumeration. - Added methods to support Flink streaming job execution. - Enhanced clarity and maintainability of argument handling in job submission logic. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ``` python distribution/run_zipline_quickstart.py ``` This runs the full zipline suite of commands against a test quickstart groupby. Example: ``` davidhan@Davids-MacBook-Pro: ~/zipline/chronon (davidhan/do_fetch_test) $ python3 distribution/run_zipline_quickstart.py Created temporary directory: /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l + WORKING_DIR=/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l + cd /var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l + GREEN='\033[0;32m' + RED='\033[0;31m' + WHEEL_FILE=zipline_ai-0.1.0.dev0-py3-none-any.whl + bq rm -f -t canary-443022:data.quickstart_purchases_v1_test + bq rm -f -t canary-443022:data.quickstart_purchases_v1_test_upload + '[' -z '' ']' + wget https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz --2025-01-30 10:16:21-- https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132 Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 400879762 (382M) [application/x-gzip] Saving to: ‘spark-3.5.4-bin-hadoop3.tgz’ spark-3.5.4-bin-hadoop3.tgz 100%[==========================================================================================================================================>] 382.31M 50.2MB/s in 8.4s 2025-01-30 10:16:30 (45.5 MB/s) - ‘spark-3.5.4-bin-hadoop3.tgz’ saved [400879762/400879762] + tar -xzf spark-3.5.4-bin-hadoop3.tgz ++ pwd + export SPARK_HOME=/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3 + SPARK_HOME=/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3 + git clone [email protected]:zipline-ai/cananry-confs.git Cloning into 'cananry-confs'... remote: Enumerating objects: 148, done. remote: Counting objects: 100% (148/148), done. remote: Compressing objects: 100% (77/77), done. remote: Total 148 (delta 63), reused 139 (delta 60), pack-reused 0 (from 0) Receiving objects: 100% (148/148), 93.28 KiB | 746.00 KiB/s, done. Resolving deltas: 100% (63/63), done. + cd cananry-confs + git fetch origin davidhan/canary From github.com:zipline-ai/cananry-confs * branch davidhan/canary -> FETCH_HEAD + git checkout davidhan/canary branch 'davidhan/canary' set up to track 'origin/davidhan/canary'. Switched to a new branch 'davidhan/canary' + python3 -m venv tmp_chronon + source tmp_chronon/bin/activate ++ deactivate nondestructive ++ '[' -n '' ']' ++ '[' -n '' ']' ++ hash -r ++ '[' -n '' ']' ++ unset VIRTUAL_ENV ++ unset VIRTUAL_ENV_PROMPT ++ '[' '!' nondestructive = nondestructive ']' ++ case "$(uname)" in +++ uname ++ export VIRTUAL_ENV=/private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/tmp_chronon ++ VIRTUAL_ENV=/private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/tmp_chronon ++ _OLD_VIRTUAL_PATH=/Users/davidhan/.asdf/plugins/python/shims:/Users/davidhan/.asdf/installs/python/3.13.0/bin:/Users/davidhan/Downloads/google-cloud-sdk/bin:/Users/davidhan/.cargo/bin:/Users/davidhan/.asdf/shims:/Users/davidhan/.asdf/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin ++ PATH=/private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/tmp_chronon/bin:/Users/davidhan/.asdf/plugins/python/shims:/Users/davidhan/.asdf/installs/python/3.13.0/bin:/Users/davidhan/Downloads/google-cloud-sdk/bin:/Users/davidhan/.cargo/bin:/Users/davidhan/.asdf/shims:/Users/davidhan/.asdf/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin ++ export PATH ++ VIRTUAL_ENV_PROMPT=tmp_chronon ++ export VIRTUAL_ENV_PROMPT ++ '[' -n '' ']' ++ '[' -z '' ']' ++ _OLD_VIRTUAL_PS1= ++ PS1='(tmp_chronon) ' ++ export PS1 ++ hash -r + gcloud storage cp gs://zipline-artifacts-canary/jars/zipline_ai-0.1.0.dev0-py3-none-any.whl . Copying gs://zipline-artifacts-canary/jars/zipline_ai-0.1.0.dev0-py3-none-any.whl to file://./zipline_ai-0.1.0.dev0-py3-none-any.whl Completed files 1/1 | 371.1kiB/371.1kiB + pip uninstall zipline-ai WARNING: Skipping zipline-ai as it is not installed. + pip install --force-reinstall zipline_ai-0.1.0.dev0-py3-none-any.whl Processing ./zipline_ai-0.1.0.dev0-py3-none-any.whl Collecting click (from zipline-ai==0.1.0.dev0) Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB) Collecting thrift==0.21.0 (from zipline-ai==0.1.0.dev0) Using cached thrift-0.21.0-cp313-cp313-macosx_15_0_arm64.whl Collecting google-cloud-storage==2.19.0 (from zipline-ai==0.1.0.dev0) Using cached google_cloud_storage-2.19.0-py2.py3-none-any.whl.metadata (9.1 kB) Collecting google-auth<3.0dev,>=2.26.1 (from google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached google_auth-2.38.0-py2.py3-none-any.whl.metadata (4.8 kB) Collecting google-api-core<3.0.0dev,>=2.15.0 (from google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached google_api_core-2.24.1-py3-none-any.whl.metadata (3.0 kB) Collecting google-cloud-core<3.0dev,>=2.3.0 (from google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached google_cloud_core-2.4.1-py2.py3-none-any.whl.metadata (2.7 kB) Collecting google-resumable-media>=2.7.2 (from google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached google_resumable_media-2.7.2-py2.py3-none-any.whl.metadata (2.2 kB) Collecting requests<3.0.0dev,>=2.18.0 (from google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB) Collecting google-crc32c<2.0dev,>=1.0 (from google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached google_crc32c-1.6.0-py3-none-any.whl Collecting six>=1.7.2 (from thrift==0.21.0->zipline-ai==0.1.0.dev0) Using cached six-1.17.0-py2.py3-none-any.whl.metadata (1.7 kB) Collecting googleapis-common-protos<2.0.dev0,>=1.56.2 (from google-api-core<3.0.0dev,>=2.15.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached googleapis_common_protos-1.66.0-py2.py3-none-any.whl.metadata (1.5 kB) Collecting protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0.dev0,>=3.19.5 (from google-api-core<3.0.0dev,>=2.15.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached protobuf-5.29.3-cp38-abi3-macosx_10_9_universal2.whl.metadata (592 bytes) Collecting proto-plus<2.0.0dev,>=1.22.3 (from google-api-core<3.0.0dev,>=2.15.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached proto_plus-1.26.0-py3-none-any.whl.metadata (2.2 kB) Collecting cachetools<6.0,>=2.0.0 (from google-auth<3.0dev,>=2.26.1->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached cachetools-5.5.1-py3-none-any.whl.metadata (5.4 kB) Collecting pyasn1-modules>=0.2.1 (from google-auth<3.0dev,>=2.26.1->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached pyasn1_modules-0.4.1-py3-none-any.whl.metadata (3.5 kB) Collecting rsa<5,>=3.1.4 (from google-auth<3.0dev,>=2.26.1->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached rsa-4.9-py3-none-any.whl.metadata (4.2 kB) Collecting charset-normalizer<4,>=2 (from requests<3.0.0dev,>=2.18.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached charset_normalizer-3.4.1-cp313-cp313-macosx_10_13_universal2.whl.metadata (35 kB) Collecting idna<4,>=2.5 (from requests<3.0.0dev,>=2.18.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached idna-3.10-py3-none-any.whl.metadata (10 kB) Collecting urllib3<3,>=1.21.1 (from requests<3.0.0dev,>=2.18.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached urllib3-2.3.0-py3-none-any.whl.metadata (6.5 kB) Collecting certifi>=2017.4.17 (from requests<3.0.0dev,>=2.18.0->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached certifi-2024.12.14-py3-none-any.whl.metadata (2.3 kB) Collecting pyasn1<0.7.0,>=0.4.6 (from pyasn1-modules>=0.2.1->google-auth<3.0dev,>=2.26.1->google-cloud-storage==2.19.0->zipline-ai==0.1.0.dev0) Using cached pyasn1-0.6.1-py3-none-any.whl.metadata (8.4 kB) Using cached google_cloud_storage-2.19.0-py2.py3-none-any.whl (131 kB) Using cached click-8.1.8-py3-none-any.whl (98 kB) Using cached google_api_core-2.24.1-py3-none-any.whl (160 kB) Using cached google_auth-2.38.0-py2.py3-none-any.whl (210 kB) Using cached google_cloud_core-2.4.1-py2.py3-none-any.whl (29 kB) Using cached google_resumable_media-2.7.2-py2.py3-none-any.whl (81 kB) Using cached requests-2.32.3-py3-none-any.whl (64 kB) Using cached six-1.17.0-py2.py3-none-any.whl (11 kB) Using cached cachetools-5.5.1-py3-none-any.whl (9.5 kB) Using cached certifi-2024.12.14-py3-none-any.whl (164 kB) Using cached charset_normalizer-3.4.1-cp313-cp313-macosx_10_13_universal2.whl (195 kB) Using cached googleapis_common_protos-1.66.0-py2.py3-none-any.whl (221 kB) Using cached idna-3.10-py3-none-any.whl (70 kB) Using cached proto_plus-1.26.0-py3-none-any.whl (50 kB) Using cached protobuf-5.29.3-cp38-abi3-macosx_10_9_universal2.whl (417 kB) Using cached pyasn1_modules-0.4.1-py3-none-any.whl (181 kB) Using cached rsa-4.9-py3-none-any.whl (34 kB) Using cached urllib3-2.3.0-py3-none-any.whl (128 kB) Using cached pyasn1-0.6.1-py3-none-any.whl (83 kB) Installing collected packages: urllib3, six, pyasn1, protobuf, idna, google-crc32c, click, charset-normalizer, certifi, cachetools, thrift, rsa, requests, pyasn1-modules, proto-plus, googleapis-common-protos, google-resumable-media, google-auth, google-api-core, google-cloud-core, google-cloud-storage, zipline-ai Successfully installed cachetools-5.5.1 certifi-2024.12.14 charset-normalizer-3.4.1 click-8.1.8 google-api-core-2.24.1 google-auth-2.38.0 google-cloud-core-2.4.1 google-cloud-storage-2.19.0 google-crc32c-1.6.0 google-resumable-media-2.7.2 googleapis-common-protos-1.66.0 idna-3.10 proto-plus-1.26.0 protobuf-5.29.3 pyasn1-0.6.1 pyasn1-modules-0.4.1 requests-2.32.3 rsa-4.9 six-1.17.0 thrift-0.21.0 urllib3-2.3.0 zipline-ai-0.1.0.dev0 [notice] A new release of pip is available: 24.2 -> 25.0 [notice] To update, run: pip install --upgrade pip ++ pwd + export PYTHONPATH=:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs + PYTHONPATH=:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs + DATAPROC_SUBMITTER_ID_STR='Dataproc submitter job id' + echo -e '\033[0;32m<<<<<.....................................COMPILE.....................................>>>>>\033[0m' <<<<<.....................................COMPILE.....................................>>>>> + zipline compile --conf=group_bys/quickstart/purchases.py Using chronon root path - /private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs Input group_bys from - /private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/group_bys/quickstart/purchases.py GroupBy Team - quickstart GroupBy Name - purchases.v1 Writing GroupBy to - /private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/production/group_bys/quickstart/purchases.v1 GroupBy Team - quickstart GroupBy Name - purchases.v1_test Writing GroupBy to - /private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/production/group_bys/quickstart/purchases.v1_test Successfully wrote 2 GroupBy objects to /private/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/cananry-confs/production + echo -e '\033[0;32m<<<<<.....................................BACKFILL.....................................>>>>>\033[0m' <<<<<.....................................BACKFILL.....................................>>>>> + touch tmp_backfill.out + zipline run --conf production/group_bys/quickstart/purchases.v1_test --dataproc + tee /dev/tty tmp_backfill.out Running with args: {'conf': 'production/group_bys/quickstart/purchases.v1_test', 'dataproc': True, 'env': 'dev', 'mode': None, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None} Running with args: {'conf': 'production/group_bys/quickstart/purchases.v1_test', 'dataproc': True, 'env': 'dev', 'mode': None, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None} Array(group-by-backfill, --conf-path=purchases.v1_test, --end-date=2025-01-30, --conf-type=group_bys, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance)Array(group-by-backfill, --conf-path=purchases.v1_test, --end-date=2025-01-30, --conf-type=group_bys, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance) WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. Dataproc submitter job id: 945d836f-20d8-4768-97fb-0889c00ed87bDataproc submitter job id: 945d836f-20d8-4768-97fb-0889c00ed87b Setting env variables: From <common_env> setting VERSION=latest From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit From <common_env> setting JOB_MODE=local[*] From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT> From <common_env> setting PARTITION_COLUMN=ds From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd From <common_env> setting CUSTOMER_ID=canary From <common_env> setting GCP_PROJECT_ID=canary-443022 From <common_env> setting GCP_REGION=us-central1 From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance From <cli_args> setting APP_NAME=chronon From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary. Running command: java -cp /Users/davidhan/zipline/chronon/cloud_gcp_submitter/target/scala-2.12/cloud_gcp_submitter-assembly-0.1.0-SNAPSHOT.jar:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.integrations.cloud_gcp.DataprocSubmitter group-by-backfill --conf-path=purchases.v1_test --end-date=2025-01-30 --conf-type=group_bys --additional-conf-path=additional-confs.yaml --gcs_files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml --chronon_jar_uri=gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar Setting env variables: From <common_env> setting VERSION=latest From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit From <common_env> setting JOB_MODE=local[*] From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT> From <common_env> setting PARTITION_COLUMN=ds From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd From <common_env> setting CUSTOMER_ID=canary From <common_env> setting GCP_PROJECT_ID=canary-443022 From <common_env> setting GCP_REGION=us-central1 From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance From <cli_args> setting APP_NAME=chronon From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary. Running command: java -cp /Users/davidhan/zipline/chronon/cloud_gcp_submitter/target/scala-2.12/cloud_gcp_submitter-assembly-0.1.0-SNAPSHOT.jar:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.integrations.cloud_gcp.DataprocSubmitter group-by-backfill --conf-path=purchases.v1_test --end-date=2025-01-30 --conf-type=group_bys --additional-conf-path=additional-confs.yaml --gcs_files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml --chronon_jar_uri=gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar ++ cat tmp_backfill.out ++ grep 'Dataproc submitter job id' ++ cut -d ' ' -f5 + BACKFILL_JOB_ID=945d836f-20d8-4768-97fb-0889c00ed87b + check_dataproc_job_state 945d836f-20d8-4768-97fb-0889c00ed87b + JOB_ID=945d836f-20d8-4768-97fb-0889c00ed87b + '[' -z 945d836f-20d8-4768-97fb-0889c00ed87b ']' + gcloud dataproc jobs wait 945d836f-20d8-4768-97fb-0889c00ed87b --region=us-central1 Waiting for job output... 25/01/30 18:16:47 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead. 25/01/30 18:16:47 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead. Using warehouse dir: /tmp/945d836f-20d8-4768-97fb-0889c00ed87b/local_warehouse 25/01/30 18:16:50 INFO HiveConf: Found configuration file file:/etc/hive/conf.dist/hive-site.xml 25/01/30 18:16:50 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead. 25/01/30 18:16:50 INFO SparkEnv: Registering MapOutputTracker 25/01/30 18:16:50 INFO SparkEnv: Registering BlockManagerMaster 25/01/30 18:16:50 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 25/01/30 18:16:50 INFO SparkEnv: Registering OutputCommitCoordinator 25/01/30 18:16:51 INFO DataprocSparkPlugin: Registered 188 driver metrics 25/01/30 18:16:51 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:8032 25/01/30 18:16:51 INFO AHSProxy: Connecting to Application History server at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:10200 25/01/30 18:16:51 INFO Configuration: resource-types.xml not found 25/01/30 18:16:51 INFO ResourceUtils: Unable to find 'resource-types.xml'. 25/01/30 18:16:52 INFO YarnClientImpl: Submitted application application_1738197659103_0011 25/01/30 18:16:53 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead. 25/01/30 18:16:53 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:8030 25/01/30 18:16:55 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state. 25/01/30 18:16:55 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-us-central1-703996152583-pqtvfptb/5d9e94ed-7649-4828-8b64-e3d58632a5d0/spark-job-history/application_1738197659103_0011.inprogress [CONTEXT ratelimit_period="1 MINUTES" ] 2025/01/30 18:16:55 INFO SparkSessionBuilder.scala:76 - Chronon logging system initialized. Overrides spark's configuration 2025/01/30 18:16:58 ERROR TableUtils.scala:188 - Table canary-443022.data.quickstart_purchases_v1_test is not reachable. Returning empty partitions. 2025/01/30 18:17:15 INFO TableUtils.scala:200 - Found 30, between (2023-11-01, 2023-11-30) partitions for table: data.purchases 2025/01/30 18:17:15 INFO TableUtils.scala:622 - Unfilled range computation: Output table: canary-443022.data.quickstart_purchases_v1_test Missing output partitions: [2023-11-01,2023-11-02,2023-11-03,2023-11-04,2023-11-05,2023-11-06,2023-11-07,2023-11-08,2023-11-09,2023-11-10,2023-11-11,2023-11-12,2023-11-13,2023-11-14,2023-11-15,2023-11-16,2023-11-17,2023-11-18,2023-11-19,2023-11-20,2023-11-21,2023-11-22,2023-11-23,2023-11-24,2023-11-25,2023-11-26,2023-11-27,2023-11-28,2023-11-29,2023-11-30,2023-12-01,2023-12-02,2023-12-03,2023-12-04,2023-12-05,2023-12-06,2023-12-07,2023-12-08,2023-12-09,2023-12-10,2023-12-11,2023-12-12,2023-12-13,2023-12-14,2023-12-15,2023-12-16,2023-12-17,2023-12-18,2023-12-19,2023-12-20,2023-12-21,2023-12-22,2023-12-23,2023-12-24,2023-12-25,2023-12-26,2023-12-27,2023-12-28,2023-12-29,2023-12-30,2023-12-31,2024-01-01,2024-01-02,2024-01-03,2024-01-04,2024-01-05,2024-01-06,2024-01-07,2024-01-08,2024-01-09,2024-01-10,2024-01-11,2024-01-12,2024-01-13,2024-01-14,2024-01-15,2024-01-16,2024-01-17,2024-01-18,2024-01-19,2024-01-20,2024-01-21,2024-01-22,2024-01-23,2024-01-24,2024-01-25,2024-01-26,2024-01-27,2024-01-28,2024-01-29,2024-01-30,2024-01-31,2024-02-01,2024-02-02,2024-02-03,2024-02-04,2024-02-05,2024-02-06,2024-02-07,2024-02-08,2024-02-09,2024-02-10,2024-02-11,2024-02-12,2024-02-13,2024-02-14,2024-02-15,2024-02-16,2024-02-17,2024-02-18,2024-02-19,2024-02-20,2024-02-21,2024-02-22,2024-02-23,2024-02-24,2024-02-25,2024-02-26,2024-02-27,2024-02-28,2024-02-29,2024-03-01,2024-03-02,2024-03-03,2024-03-04,2024-03-05,2024-03-06,2024-03-07,2024-03-08,2024-03-09,2024-03-10,2024-03-11,2024-03-12,2024-03-13,2024-03-14,2024-03-15,2024-03-16,2024-03-17,2024-03-18,2024-03-19,2024-03-20,2024-03-21,2024-03-22,2024-03-23,2024-03-24,2024-03-25,2024-03-26,2024-03-27,2024-03-28,2024-03-29,2024-03-30,2024-03-31,2024-04-01,2024-04-02,2024-04-03,2024-04-04,2024-04-05,2024-04-06,2024-04-07,2024-04-08,2024-04-09,2024-04-10,2024-04-11,2024-04-12,2024-04-13,2024-04-14,2024-04-15,2024-04-16,2024-04-17,2024-04-18,2024-04-19,2024-04-20,2024-04-21,2024-04-22,2024-04-23,2024-04-24,2024-04-25,2024-04-26,2024-04-27,2024-04-28,2024-04-29,2024-04-30,2024-05-01,2024-05-02,2024-05-03,2024-05-04,2024-05-05,2024-05-06,2024-05-07,2024-05-08,2024-05-09,2024-05-10,2024-05-11,2024-05-12,2024-05-13,2024-05-14,2024-05-15,2024-05-16,2024-05-17,2024-05-18,2024-05-19,2024-05-20,2024-05-21,2024-05-22,2024-05-23,2024-05-24,2024-05-25,2024-05-26,2024-05-27,2024-05-28,2024-05-29,2024-05-30,2024-05-31,2024-06-01,2024-06-02,2024-06-03,2024-06-04,2024-06-05,2024-06-06,2024-06-07,2024-06-08,2024-06-09,2024-06-10,2024-06-11,2024-06-12,2024-06-13,2024-06-14,2024-06-15,2024-06-16,2024-06-17,2024-06-18,2024-06-19,2024-06-20,2024-06-21,2024-06-22,2024-06-23,2024-06-24,2024-06-25,2024-06-26,2024-06-27,2024-06-28,2024-06-29,2024-06-30,2024-07-01,2024-07-02,2024-07-03,2024-07-04,2024-07-05,2024-07-06,2024-07-07,2024-07-08,2024-07-09,2024-07-10,2024-07-11,2024-07-12,2024-07-13,2024-07-14,2024-07-15,2024-07-16,2024-07-17,2024-07-18,2024-07-19,2024-07-20,2024-07-21,2024-07-22,2024-07-23,2024-07-24,2024-07-25,2024-07-26,2024-07-27,2024-07-28,2024-07-29,2024-07-30,2024-07-31,2024-08-01,2024-08-02,2024-08-03,2024-08-04,2024-08-05,2024-08-06,2024-08-07,2024-08-08,2024-08-09,2024-08-10,2024-08-11,2024-08-12,2024-08-13,2024-08-14,2024-08-15,2024-08-16,2024-08-17,2024-08-18,2024-08-19,2024-08-20,2024-08-21,2024-08-22,2024-08-23,2024-08-24,2024-08-25,2024-08-26,2024-08-27,2024-08-28,2024-08-29,2024-08-30,2024-08-31,2024-09-01,2024-09-02,2024-09-03,2024-09-04,2024-09-05,2024-09-06,2024-09-07,2024-09-08,2024-09-09,2024-09-10,2024-09-11,2024-09-12,2024-09-13,2024-09-14,2024-09-15,2024-09-16,2024-09-17,2024-09-18,2024-09-19,2024-09-20,2024-09-21,2024-09-22,2024-09-23,2024-09-24,2024-09-25,2024-09-26,2024-09-27,2024-09-28,2024-09-29,2024-09-30,2024-10-01,2024-10-02,2024-10-03,2024-10-04,2024-10-05,2024-10-06,2024-10-07,2024-10-08,2024-10-09,2024-10-10,2024-10-11,2024-10-12,2024-10-13,2024-10-14,2024-10-15,2024-10-16,2024-10-17,2024-10-18,2024-10-19,2024-10-20,2024-10-21,2024-10-22,2024-10-23,2024-10-24,2024-10-25,2024-10-26,2024-10-27,2024-10-28,2024-10-29,2024-10-30,2024-10-31,2024-11-01,2024-11-02,2024-11-03,2024-11-04,2024-11-05,2024-11-06,2024-11-07,2024-11-08,2024-11-09,2024-11-10,2024-11-11,2024-11-12,2024-11-13,2024-11-14,2024-11-15,2024-11-16,2024-11-17,2024-11-18,2024-11-19,2024-11-20,2024-11-21,2024-11-22,2024-11-23,2024-11-24,2024-11-25,2024-11-26,2024-11-27,2024-11-28,2024-11-29,2024-11-30,2024-12-01,2024-12-02,2024-12-03,2024-12-04,2024-12-05,2024-12-06,2024-12-07,2024-12-08,2024-12-09,2024-12-10,2024-12-11,2024-12-12,2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18,2024-12-19,2024-12-20,2024-12-21,2024-12-22,2024-12-23,2024-12-24,2024-12-25,2024-12-26,2024-12-27,2024-12-28,2024-12-29,2024-12-30,2024-12-31,2025-01-01,2025-01-02,2025-01-03,2025-01-04,2025-01-05,2025-01-06,2025-01-07,2025-01-08,2025-01-09,2025-01-10,2025-01-11,2025-01-12,2025-01-13,2025-01-14,2025-01-15,2025-01-16,2025-01-17,2025-01-18,2025-01-19,2025-01-20,2025-01-21,2025-01-22,2025-01-23,2025-01-24,2025-01-25,2025-01-26,2025-01-27,2025-01-28,2025-01-29,2025-01-30] Input tables: data.purchases Missing input partitions: [2023-12-01,2023-12-02,2023-12-03,2023-12-04,2023-12-05,2023-12-06,2023-12-07,2023-12-08,2023-12-09,2023-12-10,2023-12-11,2023-12-12,2023-12-13,2023-12-14,2023-12-15,2023-12-16,2023-12-17,2023-12-18,2023-12-19,2023-12-20,2023-12-21,2023-12-22,2023-12-23,2023-12-24,2023-12-25,2023-12-26,2023-12-27,2023-12-28,2023-12-29,2023-12-30,2023-12-31,2024-01-01,2024-01-02,2024-01-03,2024-01-04,2024-01-05,2024-01-06,2024-01-07,2024-01-08,2024-01-09,2024-01-10,2024-01-11,2024-01-12,2024-01-13,2024-01-14,2024-01-15,2024-01-16,2024-01-17,2024-01-18,2024-01-19,2024-01-20,2024-01-21,2024-01-22,2024-01-23,2024-01-24,2024-01-25,2024-01-26,2024-01-27,2024-01-28,2024-01-29,2024-01-30,2024-01-31,2024-02-01,2024-02-02,2024-02-03,2024-02-04,2024-02-05,2024-02-06,2024-02-07,2024-02-08,2024-02-09,2024-02-10,2024-02-11,2024-02-12,2024-02-13,2024-02-14,2024-02-15,2024-02-16,2024-02-17,2024-02-18,2024-02-19,2024-02-20,2024-02-21,2024-02-22,2024-02-23,2024-02-24,2024-02-25,2024-02-26,2024-02-27,2024-02-28,2024-02-29,2024-03-01,2024-03-02,2024-03-03,2024-03-04,2024-03-05,2024-03-06,2024-03-07,2024-03-08,2024-03-09,2024-03-10,2024-03-11,2024-03-12,2024-03-13,2024-03-14,2024-03-15,2024-03-16,2024-03-17,2024-03-18,2024-03-19,2024-03-20,2024-03-21,2024-03-22,2024-03-23,2024-03-24,2024-03-25,2024-03-26,2024-03-27,2024-03-28,2024-03-29,2024-03-30,2024-03-31,2024-04-01,2024-04-02,2024-04-03,2024-04-04,2024-04-05,2024-04-06,2024-04-07,2024-04-08,2024-04-09,2024-04-10,2024-04-11,2024-04-12,2024-04-13,2024-04-14,2024-04-15,2024-04-16,2024-04-17,2024-04-18,2024-04-19,2024-04-20,2024-04-21,2024-04-22,2024-04-23,2024-04-24,2024-04-25,2024-04-26,2024-04-27,2024-04-28,2024-04-29,2024-04-30,2024-05-01,2024-05-02,2024-05-03,2024-05-04,2024-05-05,2024-05-06,2024-05-07,2024-05-08,2024-05-09,2024-05-10,2024-05-11,2024-05-12,2024-05-13,2024-05-14,2024-05-15,2024-05-16,2024-05-17,2024-05-18,2024-05-19,2024-05-20,2024-05-21,2024-05-22,2024-05-23,2024-05-24,2024-05-25,2024-05-26,2024-05-27,2024-05-28,2024-05-29,2024-05-30,2024-05-31,2024-06-01,2024-06-02,2024-06-03,2024-06-04,2024-06-05,2024-06-06,2024-06-07,2024-06-08,2024-06-09,2024-06-10,2024-06-11,2024-06-12,2024-06-13,2024-06-14,2024-06-15,2024-06-16,2024-06-17,2024-06-18,2024-06-19,2024-06-20,2024-06-21,2024-06-22,2024-06-23,2024-06-24,2024-06-25,2024-06-26,2024-06-27,2024-06-28,2024-06-29,2024-06-30,2024-07-01,2024-07-02,2024-07-03,2024-07-04,2024-07-05,2024-07-06,2024-07-07,2024-07-08,2024-07-09,2024-07-10,2024-07-11,2024-07-12,2024-07-13,2024-07-14,2024-07-15,2024-07-16,2024-07-17,2024-07-18,2024-07-19,2024-07-20,2024-07-21,2024-07-22,2024-07-23,2024-07-24,2024-07-25,2024-07-26,2024-07-27,2024-07-28,2024-07-29,2024-07-30,2024-07-31,2024-08-01,2024-08-02,2024-08-03,2024-08-04,2024-08-05,2024-08-06,2024-08-07,2024-08-08,2024-08-09,2024-08-10,2024-08-11,2024-08-12,2024-08-13,2024-08-14,2024-08-15,2024-08-16,2024-08-17,2024-08-18,2024-08-19,2024-08-20,2024-08-21,2024-08-22,2024-08-23,2024-08-24,2024-08-25,2024-08-26,2024-08-27,2024-08-28,2024-08-29,2024-08-30,2024-08-31,2024-09-01,2024-09-02,2024-09-03,2024-09-04,2024-09-05,2024-09-06,2024-09-07,2024-09-08,2024-09-09,2024-09-10,2024-09-11,2024-09-12,2024-09-13,2024-09-14,2024-09-15,2024-09-16,2024-09-17,2024-09-18,2024-09-19,2024-09-20,2024-09-21,2024-09-22,2024-09-23,2024-09-24,2024-09-25,2024-09-26,2024-09-27,2024-09-28,2024-09-29,2024-09-30,2024-10-01,2024-10-02,2024-10-03,2024-10-04,2024-10-05,2024-10-06,2024-10-07,2024-10-08,2024-10-09,2024-10-10,2024-10-11,2024-10-12,2024-10-13,2024-10-14,2024-10-15,2024-10-16,2024-10-17,2024-10-18,2024-10-19,2024-10-20,2024-10-21,2024-10-22,2024-10-23,2024-10-24,2024-10-25,2024-10-26,2024-10-27,2024-10-28,2024-10-29,2024-10-30,2024-10-31,2024-11-01,2024-11-02,2024-11-03,2024-11-04,2024-11-05,2024-11-06,2024-11-07,2024-11-08,2024-11-09,2024-11-10,2024-11-11,2024-11-12,2024-11-13,2024-11-14,2024-11-15,2024-11-16,2024-11-17,2024-11-18,2024-11-19,2024-11-20,2024-11-21,2024-11-22,2024-11-23,2024-11-24,2024-11-25,2024-11-26,2024-11-27,2024-11-28,2024-11-29,2024-11-30,2024-12-01,2024-12-02,2024-12-03,2024-12-04,2024-12-05,2024-12-06,2024-12-07,2024-12-08,2024-12-09,2024-12-10,2024-12-11,2024-12-12,2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18,2024-12-19,2024-12-20,2024-12-21,2024-12-22,2024-12-23,2024-12-24,2024-12-25,2024-12-26,2024-12-27,2024-12-28,2024-12-29,2024-12-30,2024-12-31,2025-01-01,2025-01-02,2025-01-03,2025-01-04,2025-01-05,2025-01-06,2025-01-07,2025-01-08,2025-01-09,2025-01-10,2025-01-11,2025-01-12,2025-01-13,2025-01-14,2025-01-15,2025-01-16,2025-01-17,2025-01-18,2025-01-19,2025-01-20,2025-01-21,2025-01-22,2025-01-23,2025-01-24,2025-01-25,2025-01-26,2025-01-27,2025-01-28,2025-01-29,2025-01-30] Unfilled Partitions: [2023-11-01,2023-11-02,2023-11-03,2023-11-04,2023-11-05,2023-11-06,2023-11-07,2023-11-08,2023-11-09,2023-11-10,2023-11-11,2023-11-12,2023-11-13,2023-11-14,2023-11-15,2023-11-16,2023-11-17,2023-11-18,2023-11-19,2023-11-20,2023-11-21,2023-11-22,2023-11-23,2023-11-24,2023-11-25,2023-11-26,2023-11-27,2023-11-28,2023-11-29,2023-11-30] Unfilled ranges: [2023-11-01...2023-11-30] 2025/01/30 18:17:15 INFO GroupBy.scala:733 - group by unfilled ranges: List([2023-11-01...2023-11-30]) 2025/01/30 18:17:15 INFO GroupBy.scala:738 - Group By ranges to compute: [2023-11-01...2023-11-30] 2025/01/30 18:17:15 INFO GroupBy.scala:743 - Computing group by for range: [2023-11-01...2023-11-30] [1/1] 2025/01/30 18:17:15 INFO GroupBy.scala:492 - ----[Processing GroupBy: quickstart.purchases.v1_test]---- 2025/01/30 18:17:20 INFO TableUtils.scala:200 - Found 30, between (2023-11-01, 2023-11-30) partitions for table: data.purchases 2025/01/30 18:17:20 INFO GroupBy.scala:618 - Computing intersected range as: query range: [2023-11-01...2023-11-30] query window: None source table: data.purchases source data range: [2023-11-01...2023-11-30] source start/end: null/null source data model: Events queryable data range: [null...2023-11-30] intersected range: [2023-11-01...2023-11-30] 2025/01/30 18:17:20 INFO GroupBy.scala:658 - Time Mapping: Some((ts,ts)) 2025/01/30 18:17:20 INFO GroupBy.scala:668 - Rendering source query: intersected/effective scan range: Some([2023-11-01...2023-11-30]) partitionConditions: List(ds >= '2023-11-01', ds <= '2023-11-30') metaColumns: Map(ds -> null, ts -> ts) 2025/01/30 18:17:20 INFO TableUtils.scala:759 - Scanning data: table: data.purchases options: Map() format: Some(bigquery) selects: `ds` `ts` `user_id` `purchase_price` wheres: partition filters: ds >= '2023-11-01', ds <= '2023-11-30' 2025/01/30 18:17:20 INFO HopsAggregator.scala:147 - Left bounds: 1d->unbounded minQueryTs = 2023-11-01 00:00:00 2025/01/30 18:17:20 INFO FastHashing.scala:52 - Generating key builder over keys: bigint : user_id 2025/01/30 18:17:22 INFO TableUtils.scala:459 - Repartitioning before writing... 2025/01/30 18:17:25 INFO TableUtils.scala:494 - 2416 rows requested to be written into table canary-443022.data.quickstart_purchases_v1_test 2025/01/30 18:17:25 INFO TableUtils.scala:531 - repartitioning data for table canary-443022.data.quickstart_purchases_v1_test by 300 spark tasks into 30 table partitions and 10 files per partition 2025/01/30 18:17:25 INFO TableUtils.scala:536 - Sorting within partitions with cols: List(ds) 2025/01/30 18:17:33 INFO TableUtils.scala:469 - Finished writing to canary-443022.data.quickstart_purchases_v1_test 2025/01/30 18:17:33 INFO TableUtils.scala:440 - Cleared the dataframe cache after repartition & write to canary-443022.data.quickstart_purchases_v1_test - start @ 2025-01-30 18:17:22 end @ 2025-01-30 18:17:33 2025/01/30 18:17:33 INFO GroupBy.scala:757 - Wrote to table canary-443022.data.quickstart_purchases_v1_test, into partitions: [2023-11-01...2023-11-30] 2025/01/30 18:17:33 INFO GroupBy.scala:759 - Wrote to table canary-443022.data.quickstart_purchases_v1_test for range: [2023-11-01...2023-11-30] Job [945d836f-20d8-4768-97fb-0889c00ed87b] finished successfully. done: true driverControlFilesUri: gs://dataproc-staging-us-central1-703996152583-lxespibx/google-cloud-dataproc-metainfo/5d9e94ed-7649-4828-8b64-e3d58632a5d0/jobs/945d836f-20d8-4768-97fb-0889c00ed87b/ driverOutputResourceUri: gs://dataproc-staging-us-central1-703996152583-lxespibx/google-cloud-dataproc-metainfo/5d9e94ed-7649-4828-8b64-e3d58632a5d0/jobs/945d836f-20d8-4768-97fb-0889c00ed87b/driveroutput jobUuid: 945d836f-20d8-4768-97fb-0889c00ed87b placement: clusterName: zipline-canary-cluster clusterUuid: 5d9e94ed-7649-4828-8b64-e3d58632a5d0 reference: jobId: 945d836f-20d8-4768-97fb-0889c00ed87b projectId: canary-443022 sparkJob: args: - group-by-backfill - --conf-path=purchases.v1_test - --end-date=2025-01-30 - --conf-type=group_bys - --additional-conf-path=additional-confs.yaml - --is-gcp - --gcp-project-id=canary-443022 - --gcp-bigtable-instance-id=zipline-canary-instance fileUris: - gs://zipline-warehouse-canary/metadata/purchases.v1_test - gs://zipline-artifacts-canary/confs/additional-confs.yaml jarFileUris: - gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar mainClass: ai.chronon.spark.Driver status: state: DONE stateStartTime: '2025-01-30T18:17:38.722934Z' statusHistory: - state: PENDING stateStartTime: '2025-01-30T18:16:43.326557Z' - state: SETUP_DONE stateStartTime: '2025-01-30T18:16:43.353624Z' - details: Agent reported job success state: RUNNING stateStartTime: '2025-01-30T18:16:43.597231Z' yarnApplications: - name: groupBy_quickstart.purchases.v1_test_backfill progress: 1.0 state: FINISHED trackingUrl: http://zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal.:8088/proxy/application_1738197659103_0011/ + echo -e '\033[0;32m <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>>\033[0m' <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>> ++ gcloud dataproc jobs describe 945d836f-20d8-4768-97fb-0889c00ed87b --region=us-central1 --format=flattened ++ grep status.state: + JOB_STATE='status.state: DONE' + echo status.state: DONE status.state: DONE + '[' -z 'status.state: DONE' ']' + echo -e '\033[0;32m<<<<<.....................................GROUP-BY-UPLOAD.....................................>>>>>\033[0m' <<<<<.....................................GROUP-BY-UPLOAD.....................................>>>>> + touch tmp_gbu.out + zipline run --mode upload --conf production/group_bys/quickstart/purchases.v1_test --ds 2023-12-01 --dataproc + tee /dev/tty tmp_gbu.out Running with args: {'mode': 'upload', 'conf': 'production/group_bys/quickstart/purchases.v1_test', 'ds': '2023-12-01', 'dataproc': True, 'env': 'dev', 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None} Running with args: {'mode': 'upload', 'conf': 'production/group_bys/quickstart/purchases.v1_test', 'ds': '2023-12-01', 'dataproc': True, 'env': 'dev', 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None} Array(group-by-upload, --conf-path=purchases.v1_test, --end-date=2023-12-01, --conf-type=group_bys, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance)Array(group-by-upload, --conf-path=purchases.v1_test, --end-date=2023-12-01, --conf-type=group_bys, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance) WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. Dataproc submitter job id: c672008e-7380-4a82-a121-4bb0cb46503f Dataproc submitter job id: c672008e-7380-4a82-a121-4bb0cb46503f Setting env variables: From <default_env> setting EXECUTOR_CORES=1 From <default_env> setting EXECUTOR_MEMORY=8G From <default_env> setting PARALLELISM=1000 From <default_env> setting MAX_EXECUTORS=1000 From <common_env> setting VERSION=latest From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit From <common_env> setting JOB_MODE=local[*] From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT> From <common_env> setting PARTITION_COLUMN=ds From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd From <common_env> setting CUSTOMER_ID=canary From <common_env> setting GCP_PROJECT_ID=canary-443022 From <common_env> setting GCP_REGION=us-central1 From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance From <cli_args> setting APP_NAME=chronon_group_bys_upload_dev_quickstart.purchases.v1_test From <cli_args> setting CHRONON_CONF_PATH=./production/group_bys/quickstart/purchases.v1_test From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary. Running command: java -cp /Users/davidhan/zipline/chronon/cloud_gcp_submitter/target/scala-2.12/cloud_gcp_submitter-assembly-0.1.0-SNAPSHOT.jar:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.integrations.cloud_gcp.DataprocSubmitter group-by-upload --conf-path=purchases.v1_test --end-date=2023-12-01 --conf-type=group_bys --additional-conf-path=additional-confs.yaml --gcs_files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml --chronon_jar_uri=gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar Setting env variables: From <default_env> setting EXECUTOR_CORES=1 From <default_env> setting EXECUTOR_MEMORY=8G From <default_env> setting PARALLELISM=1000 From <default_env> setting MAX_EXECUTORS=1000 From <common_env> setting VERSION=latest From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit From <common_env> setting JOB_MODE=local[*] From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT> From <common_env> setting PARTITION_COLUMN=ds From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd From <common_env> setting CUSTOMER_ID=canary From <common_env> setting GCP_PROJECT_ID=canary-443022 From <common_env> setting GCP_REGION=us-central1 From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance From <cli_args> setting APP_NAME=chronon_group_bys_upload_dev_quickstart.purchases.v1_test From <cli_args> setting CHRONON_CONF_PATH=./production/group_bys/quickstart/purchases.v1_test From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary. Running command: java -cp /Users/davidhan/zipline/chronon/cloud_gcp_submitter/target/scala-2.12/cloud_gcp_submitter-assembly-0.1.0-SNAPSHOT.jar:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.integrations.cloud_gcp.DataprocSubmitter group-by-upload --conf-path=purchases.v1_test --end-date=2023-12-01 --conf-type=group_bys --additional-conf-path=additional-confs.yaml --gcs_files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml --chronon_jar_uri=gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar ++ cat tmp_gbu.out ++ grep 'Dataproc submitter job id' ++ cut -d ' ' -f5 + GBU_JOB_ID=c672008e-7380-4a82-a121-4bb0cb46503f + check_dataproc_job_state c672008e-7380-4a82-a121-4bb0cb46503f + JOB_ID=c672008e-7380-4a82-a121-4bb0cb46503f + '[' -z c672008e-7380-4a82-a121-4bb0cb46503f ']' + gcloud dataproc jobs wait c672008e-7380-4a82-a121-4bb0cb46503f --region=us-central1 Waiting for job output... 25/01/30 18:17:48 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead. 25/01/30 18:17:48 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead. Using warehouse dir: /tmp/c672008e-7380-4a82-a121-4bb0cb46503f/local_warehouse 25/01/30 18:17:50 INFO HiveConf: Found configuration file file:/etc/hive/conf.dist/hive-site.xml 25/01/30 18:17:50 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead. 25/01/30 18:17:51 INFO SparkEnv: Registering MapOutputTracker 25/01/30 18:17:51 INFO SparkEnv: Registering BlockManagerMaster 25/01/30 18:17:51 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 25/01/30 18:17:51 INFO SparkEnv: Registering OutputCommitCoordinator 25/01/30 18:17:51 INFO DataprocSparkPlugin: Registered 188 driver metrics 25/01/30 18:17:51 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:8032 25/01/30 18:17:52 INFO AHSProxy: Connecting to Application History server at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:10200 25/01/30 18:17:52 INFO Configuration: resource-types.xml not found 25/01/30 18:17:52 INFO ResourceUtils: Unable to find 'resource-types.xml'. 25/01/30 18:17:53 INFO YarnClientImpl: Submitted application application_1738197659103_0012 25/01/30 18:17:54 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead. 25/01/30 18:17:54 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:8030 25/01/30 18:17:55 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state. 25/01/30 18:17:56 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-us-central1-703996152583-pqtvfptb/5d9e94ed-7649-4828-8b64-e3d58632a5d0/spark-job-history/application_1738197659103_0012.inprogress [CONTEXT ratelimit_period="1 MINUTES" ] 2025/01/30 18:17:56 INFO SparkSessionBuilder.scala:76 - Chronon logging system initialized. Overrides spark's configuration 2025/01/30 18:17:57 INFO GroupByUpload.scala:229 - GroupBy upload for: quickstart.quickstart.purchases.v1_test Accuracy: SNAPSHOT Data Model: Events 2025/01/30 18:17:57 INFO GroupBy.scala:492 - ----[Processing GroupBy: quickstart.purchases.v1_test]---- 2025/01/30 18:18:14 INFO TableUtils.scala:200 - Found 30, between (2023-11-01, 2023-11-30) partitions for table: data.purchases 2025/01/30 18:18:14 INFO GroupBy.scala:618 - Computing intersected range as: query range: [2023-12-01...2023-12-01] query window: None source table: data.purchases source data range: [2023-11-01...2023-12-01] source start/end: null/null source data model: Events queryable data range: [null...2023-12-01] intersected range: [2023-11-01...2023-12-01] 2025/01/30 18:18:14 INFO GroupBy.scala:658 - Time Mapping: Some((ts,ts)) 2025/01/30 18:18:14 INFO GroupBy.scala:668 - Rendering source query: intersected/effective scan range: Some([2023-11-01...2023-12-01]) partitionConditions: List(ds >= '2023-11-01', ds <= '2023-12-01') metaColumns: Map(ds -> null, ts -> ts) 2025/01/30 18:18:14 INFO TableUtils.scala:759 - Scanning data: table: data.purchases options: Map() format: Some(bigquery) selects: `ds` `ts` `user_id` `purchase_price` wheres: partition filters: ds >= '2023-11-01', ds <= '2023-12-01' 2025/01/30 18:18:14 INFO HopsAggregator.scala:147 - Left bounds: 1d->unbounded minQueryTs = 2023-12-01 00:00:00 2025/01/30 18:18:14 INFO FastHashing.scala:52 - Generating key builder over keys: bigint : user_id 2025/01/30 18:18:15 INFO KvRdd.scala:102 - key schema: { "type" : "record", "name" : "Key", "namespace" : "ai.chronon.data", "doc" : "", "fields" : [ { "name" : "user_id", "type" : [ "null", "long" ], "doc" : "" } ] } value schema: { "type" : "record", "name" : "Value", "namespace" : "ai.chronon.data", "doc" : "", "fields" : [ { "name" : "purchase_price_sum_3d", "type" : [ "null", "long" ], "doc" : "" }, { "name" : "purchase_price_sum_14d", "type" : [ "null", "long" ], "doc" : "" }, { "name" : "purchase_price_sum_30d", "type" : [ "null", "long" ], "doc" : "" }, { "name" : "purchase_price_count_3d", "type" : [ "null", "long" ], "doc" : "" }, { "name" : "purchase_price_count_14d", "type" : [ "null", "long" ], "doc" : "" }, { "name" : "purchase_price_count_30d", "type" : [ "null", "long" ], "doc" : "" }, { "name" : "purchase_price_average_3d", "type" : [ "null", "double" ], "doc" : "" }, { "name" : "purchase_price_average_14d", "type" : [ "null", "double" ], "doc" : "" }, { "name" : "purchase_price_average_30d", "type" : [ "null", "double" ], "doc" : "" }, { "name" : "purchase_price_last10", "type" : [ "null", { "type" : "array", "items" : "long" } ], "doc" : "" } ] } 2025/01/30 18:18:15 INFO GroupBy.scala:492 - ----[Processing GroupBy: quickstart.purchases.v1_test]---- 2025/01/30 18:18:19 INFO TableUtils.scala:200 - Found 30, between (2023-11-01, 2023-11-30) partitions for table: data.purchases 2025/01/30 18:18:19 INFO GroupBy.scala:618 - Computing intersected range as: query range: [2023-12-01...2023-12-01] query window: None source table: data.purchases source data range: [2023-11-01...2023-12-01] source start/end: null/null source data model: Events queryable data range: [null...2023-12-01] intersected range: [2023-11-01...2023-12-01] 2025/01/30 18:18:19 INFO GroupBy.scala:658 - Time Mapping: Some((ts,ts)) 2025/01/30 18:18:19 INFO GroupBy.scala:668 - Rendering source query: intersected/effective scan range: Some([2023-11-01...2023-12-01]) partitionConditions: List(ds >= '2023-11-01', ds <= '2023-12-01') metaColumns: Map(ds -> null, ts -> ts) 2025/01/30 18:18:20 INFO TableUtils.scala:759 - Scanning data: table: data.purchases options: Map() format: Some(bigquery) selects: `ds` `ts` `user_id` `purchase_price` wheres: partition filters: ds >= '2023-11-01', ds <= '2023-12-01' 2025/01/30 18:18:20 INFO GroupByUpload.scala:175 - Not setting InputAvroSchema to GroupByServingInfo as there is no streaming source defined. 2025/01/30 18:18:20 INFO GroupByUpload.scala:188 - Built GroupByServingInfo for quickstart.purchases.v1_test: table: data.purchases / data-model: Events keySchema: Success(struct<user_id:bigint>) valueSchema: Success(struct<purchase_price:bigint>) mutationSchema: Failure(java.lang.NullPointerException) inputSchema: Failure(java.lang.NullPointerException) selectedSchema: Success(struct<purchase_price:bigint>) streamSchema: Failure(java.lang.NullPointerException) 2025/01/30 18:18:20 INFO TableUtils.scala:459 - Repartitioning before writing... 2025/01/30 18:18:24 INFO TableUtils.scala:494 - 102 rows requested to be written into table canary-443022.data.quickstart_purchases_v1_test_upload 2025/01/30 18:18:24 INFO TableUtils.scala:531 - repartitioning data for table canary-443022.data.quickstart_purchases_v1_test_upload by 200 spark tasks into 1 table partitions and 10 files per partition 2025/01/30 18:18:24 INFO TableUtils.scala:536 - Sorting within partitions with cols: List(ds) 2025/01/30 18:18:30 INFO TableUtils.scala:469 - Finished writing to canary-443022.data.quickstart_purchases_v1_test_upload 2025/01/30 18:18:30 INFO TableUtils.scala:440 - Cleared the dataframe cache after repartition & write to canary-443022.data.quickstart_purchases_v1_test_upload - start @ 2025-01-30 18:18:20 end @ 2025-01-30 18:18:30 Job [c672008e-7380-4a82-a121-4bb0cb46503f] finished successfully. done: true driverControlFilesUri: gs://dataproc-staging-us-central1-703996152583-lxespibx/google-cloud-dataproc-metainfo/5d9e94ed-7649-4828-8b64-e3d58632a5d0/jobs/c672008e-7380-4a82-a121-4bb0cb46503f/ driverOutputResourceUri: gs://dataproc-staging-us-central1-703996152583-lxespibx/google-cloud-dataproc-metainfo/5d9e94ed-7649-4828-8b64-e3d58632a5d0/jobs/c672008e-7380-4a82-a121-4bb0cb46503f/driveroutput jobUuid: c672008e-7380-4a82-a121-4bb0cb46503f placement: clusterName: zipline-canary-cluster clusterUuid: 5d9e94ed-7649-4828-8b64-e3d58632a5d0 reference: jobId: c672008e-7380-4a82-a121-4bb0cb46503f projectId: canary-443022 sparkJob: args: - group-by-upload - --conf-path=purchases.v1_test - --end-date=2023-12-01 - --conf-type=group_bys - --additional-conf-path=additional-confs.yaml - --is-gcp - --gcp-project-id=canary-443022 - --gcp-bigtable-instance-id=zipline-canary-instance fileUris: - gs://zipline-warehouse-canary/metadata/purchases.v1_test - gs://zipline-artifacts-canary/confs/additional-confs.yaml jarFileUris: - gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar mainClass: ai.chronon.spark.Driver status: state: DONE stateStartTime: '2025-01-30T18:18:33.742458Z' statusHistory: - state: PENDING stateStartTime: '2025-01-30T18:17:44.197477Z' - state: SETUP_DONE stateStartTime: '2025-01-30T18:17:44.223246Z' - details: Agent reported job success state: RUNNING stateStartTime: '2025-01-30T18:17:44.438240Z' yarnApplications: - name: group-by-upload progress: 1.0 state: FINISHED trackingUrl: http://zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal.:8088/proxy/application_1738197659103_0012/ + echo -e '\033[0;32m <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>>\033[0m' <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>> ++ gcloud dataproc jobs describe c672008e-7380-4a82-a121-4bb0cb46503f --region=us-central1 --format=flattened ++ grep status.state: + JOB_STATE='status.state: DONE' + echo status.state: DONE status.state: DONE + '[' -z 'status.state: DONE' ']' + echo -e '\033[0;32m<<<<<.....................................UPLOAD-TO-KV.....................................>>>>>\033[0m' <<<<<.....................................UPLOAD-TO-KV.....................................>>>>> + touch tmp_upload_to_kv.out + zipline run --mode upload-to-kv --conf production/group_bys/quickstart/purchases.v1_test --partition-string=2023-12-01 --dataproc + tee /dev/tty tmp_upload_to_kv.out Running with args: {'mode': 'upload-to-kv', 'conf': 'production/group_bys/quickstart/purchases.v1_test', 'dataproc': True, 'env': 'dev', 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None} Running with args: {'mode': 'upload-to-kv', 'conf': 'production/group_bys/quickstart/purchases.v1_test', 'dataproc': True, 'env': 'dev', 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp-assembly-0.1.0-SNAPSHOT.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None} Array(groupby-upload-bulk-load, --conf-path=purchases.v1_test, --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar, --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl, --conf-type=group_bys, --partition-string=2023-12-01, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance)Array(groupby-upload-bulk-load, --conf-path=purchases.v1_test, --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar, --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl, --conf-type=group_bys, --partition-string=2023-12-01, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance) WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. Dataproc submitter job id: c29097e9-b845-4ad7-843a-c89b622c5cfe Dataproc submitter job id: c29097e9-b845-4ad7-843a-c89b622c5cfe Setting env variables: From <common_env> setting VERSION=latest From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit From <common_env> setting JOB_MODE=local[*] From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT> From <common_env> setting PARTITION_COLUMN=ds From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd From <common_env> setting CUSTOMER_ID=canary From <common_env> setting GCP_PROJECT_ID=canary-443022 From <common_env> setting GCP_REGION=us-central1 From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance From <cli_args> setting APP_NAME=chronon_group_bys_upload-to-kv_dev_quickstart.purchases.v1_test From <cli_args> setting CHRONON_CONF_PATH=./production/group_bys/quickstart/purchases.v1_test From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary. Running command: java -cp /Users/davidhan/zipline/chronon/cloud_gcp_submitter/target/scala-2.12/cloud_gcp_submitter-assembly-0.1.0-SNAPSHOT.jar:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.integrations.cloud_gcp.DataprocSubmitter groupby-upload-bulk-load --conf-path=purchases.v1_test --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --conf-type=group_bys --partition-string=2023-12-01 --additional-conf-path=additional-confs.yaml --gcs_files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml --chronon_jar_uri=gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar Setting env variables: From <common_env> setting VERSION=latest From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit From <common_env> setting JOB_MODE=local[*] From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT> From <common_env> setting PARTITION_COLUMN=ds From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd From <common_env> setting CUSTOMER_ID=canary From <common_env> setting GCP_PROJECT_ID=canary-443022 From <common_env> setting GCP_REGION=us-central1 From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance From <cli_args> setting APP_NAME=chronon_group_bys_upload-to-kv_dev_quickstart.purchases.v1_test From <cli_args> setting CHRONON_CONF_PATH=./production/group_bys/quickstart/purchases.v1_test From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary. Running command: java -cp /Users/davidhan/zipline/chronon/cloud_gcp_submitter/target/scala-2.12/cloud_gcp_submitter-assembly-0.1.0-SNAPSHOT.jar:/var/folders/2p/h5v8s0515xv20cgprdjngttr0000gn/T/tmpkirssr9l/spark-3.5.4-bin-hadoop3/jars/* ai.chronon.integrations.cloud_gcp.DataprocSubmitter groupby-upload-bulk-load --conf-path=purchases.v1_test --online-jar=cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --conf-type=group_bys --partition-string=2023-12-01 --additional-conf-path=additional-confs.yaml --gcs_files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml --chronon_jar_uri=gs://zipline-artifacts-canary/jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar ++ cat tmp_upload_to_kv.out ++ grep 'Dataproc submitter job id' ++ cut -d ' ' -f5 + UPLOAD_TO_KV_JOB_ID=c29097e9-b845-4ad7-843a-c89b622c5cfe + check_dataproc_job_state c29097e9-b845-4ad7-843a-c89b622c5cfe + JOB_ID=c29097e9-b845-4ad7-843a-c89b622c5cfe + '[' -z c29097e9-b845-4ad7-843a-c89b622c5cfe ']' + gcloud dataproc jobs wait c29097e9-b845-4ad7-843a-c89b622c5cfe --region=us-central1 Waiting for job output... 25/01/30 18:18:42 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead. 25/01/30 18:18:42 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead. 25/01/30 18:18:45 INFO Driver$GroupByUploadToKVBulkLoad$: Triggering bulk load for GroupBy: quickstart.purchases.v1_test for partition: 2023-12-01 from table: canary-443022.data.quickstart_purchases_v1_test_upload 25/01/30 18:18:47 INFO BigTableKVStoreImpl: Kicking off bulkLoad with query: EXPORT DATA OPTIONS ( format='CLOUD_BIGTABLE', overwrite=true, uri="https://bigtable.googleapis.com/projects/canary-443022/instances/zipline-canary-instance/appProfiles/GROUPBY_INGEST/tables/GROUPBY_BATCH", bigtable_options='''{ "columnFamilies" : [ { "familyId": "cf", "encoding": "BINARY", "columns": [ {"qualifierString": "value", "fieldName": ""} ] } ] }''' ) AS SELECT CONCAT(CAST(CONCAT('QUICKSTART_PURCHASES_V1_TEST_BATCH', '#') AS BYTES), key_bytes) as rowkey, value_bytes as cf, TIMESTAMP_MILLIS(1701475200000) as _CHANGE_TIMESTAMP FROM canary-443022.data.quickstart_purchases_v1_test_upload WHERE ds = '2023-12-01' 25/01/30 18:18:48 INFO BigTableKVStoreImpl: Export job started with Id: JobId{project=canary-443022, job=export_canary_443022_data_quickstart_purchases_v1_test_upload_to_bigtable_2023-12-01_1738261127353, location=null} and link: https://bigquery.googleapis.com/bigquery/v2/projects/canary-443022/jobs/export_canary_443022_data_quickstart_purchases_v1_test_upload_to_bigtable_2023-12-01_1738261127353?location=us-central1 25/01/30 18:18:48 INFO BigTableKVStoreImpl: …
## Summary Fixed all the test failures after bazel migration. Known test failures are 3 tests in Spark which deal with Resource loading of test data so we plan to temporarily disable them for now, also we have a way out to fix them after killing sbt. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit Based on the comprehensive summary of changes, here are the concise release notes: ## Release Notes - **New Features** - Added support for Bazelisk installation in Docker environment - Enhanced Scala and Java dependency management - Expanded Protobuf and gRPC support - Introduced new testing utilities and frameworks - **Dependency Updates** - Updated Flink, Spark, and Kafka-related dependencies - Added new Maven artifacts for Avro, Thrift, and testing libraries - Upgraded various library versions - **Testing Improvements** - Introduced `scala_junit_suite` for more flexible test execution - Added new test resources and configurations - Enhanced test coverage and dependency management - **Build System** - Updated Bazel build configurations - Improved dependency resolution and repository management - Added new build rules and scripts - **Code Quality** - Refactored package imports and type conversions - Improved code formatting and readability - Streamlined dependency handling across modules <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: nikhil-zlai <[email protected]>
## Summary Some updates to get our Flink jobs running on the Etsy side: * Configure schema registry via host/port/scheme instead of URL * Explicitly set the task slots per task manager * Configure checkpoint directory based on teams.json ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - Kicked off the job on the Etsy cluster and confirmed it's up and running - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced an enhanced configuration option for Flink job submissions by adding a state URI parameter for improved job state management. - Expanded schema registry configuration, enabling greater flexibility with host, port, and scheme settings. - **Chores** - Adjusted logging levels and refined error messaging to support better troubleshooting. - **Documentation** - Updated configuration guidance to aid in setting up schema registry integration. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
- This migrates over our artifact upload + run.py to leverage
bazel-built jars. Only for the batch side for now, streaming will
follow.
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated default JAR file names for online and Dataproc submissions.
- Migrated build process from `sbt` to `bazel` for GCP artifact
generation.
- Added new `submitter` binary target for Dataproc submission.
- Added dependency for Scala-specific features of the Jackson library in
the online library.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: Thomas Chow <[email protected]>
## Summary
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Streamlined the build process with more consistent naming conventions
for cloud deployment artifacts.
- **New Features**
- Enhanced support for macOS environments by introducing
platform-specific handling during the build, ensuring improved
compatibility.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: Thomas Chow <[email protected]>
## Summary To add missing dependencies for flink module coming from our recent changes to keep it in sync with sbt Tested locally by running Flink jobs from DataprocSubmitterTest ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Expanded Kafka integration with enhanced authentication, client functionality, and Protobuf serialization support. - Improved JSON processing support for Scala-based operations. - Adjusted dependency versions to ensure better compatibility and stability with Kafka and cloud services. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Newer version with better ergonomics, and more maintainable codebase - smaller files, simpler logic - with little indirection etc with user facing simplification - [x] always compile the whole repo - [x] teams thrift and teams python - [x] compile context as its own object - [ ] integration test on sample - [ ] progress bar for compile - [ ] sync to remote w/ progress bar - [ ] display workflow - [ ] display workflow progress ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [x] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New CLI Commands** - Introduced streamlined commands for synchronizing, backfilling, and deploying operations for easier management. - **Enhanced Logging** - Improved colored, structured log outputs for clearer real-time monitoring and debugging. - **Configuration & Validation Upgrades** - Strengthened configuration management and validation processes to ensure reliable operations. - Added a comprehensive validation framework for Chronon API thrift objects. - **Build & Infrastructure Improvements** - Transitioned to a new container base and modernized testing/build systems for better performance and stability. - **Team & Utility Enhancements** - Expanded team configuration options and refined utility processes to streamline overall workflows. - Introduced new data classes and methods for improved configuration and compilation context management. - Enhanced templating functionality for dynamic replacements based on object properties. - Improved handling of Git operations and enhanced error logging for better traceability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Ken Morton <[email protected]> Co-authored-by: Kumar Teja Chippala <[email protected]> Co-authored-by: Kumar Teja Chippala <[email protected]>
## Summary Update the Flink job code on the tiling path to use the TileKey. I haven't wired up the KV store side of things yet (can do the write and read side of the KV store collaboratively with Thomas as they need to go together to keep the tests happy). The tiling version of the Flink job isn't in use so these changes should be safe to go and keeps things incremental. ## Checklist - [ ] Added Unit Tests - [X] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added a utility to determine the start timestamp for a defined time window. - **Refactor/Enhancements** - Streamlined time window handling by providing a default one-day resolution when none is specified. - Improved tiled data processing with consistent tiling window sizing and enriched metadata management. - **Tests** - Updated integration tests to validate the new tile processing and time window behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Updated dev notes with instructions for Bazel setup and some useful commands. Also updated bazel target names so the intemediate/uber jar names don't conflict ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Refactor** - Streamlined naming conventions and updated dependency references across multiple project modules for improved consistency. - **Documentation** - Expanded build documentation with a new, comprehensive Bazel Setup section detailing configuration, caching, testing, and deployment instructions. - **Build System Enhancements** - Introduced updated source-generation rules to support future multi-language integration and more robust build workflows. - **Workflow Updates** - Modified test target names in CI workflows to reflect updated naming conventions, enhancing clarity and consistency in test execution. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
… fix (#322) ## Summary Updated dev notes with Bazel installation instructions and java error fix ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Documentation** - Enhanced setup guidance for Bazel installation on both Mac and Linux. - Provided clear instructions for resolving Java-related issues on Mac. - Updated testing procedures by replacing previous instructions with streamlined Bazel commands. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Increased timeout and java heap size for spark tests to avoid flaky test failures ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Tests** - Extended test timeout settings to 900 seconds for enhanced testing robustness. - Updated job names and workflow references for better clarity and consistency in testing workflows. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Add jvm_binary targets for service and hub modules to build final assembly jars for deployment ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Updated the dependency supporting Apache Spark functionality to boost backend data processing efficiency. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Remove flink streaming scala dependency as we no longer need it otherwise we will run into runtime error saying flink-shaded-guava package not found ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Improved dependency handling to reduce the risk of runtime errors such as class incompatibilities. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Solves sync failures ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Streamlined the Java build configuration by removing legacy integration and testing support. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Hit some errors as our Spark deps pull in rocksdbjni 8.3.2 whereas we expect an older version in Flink (6.20.3-ververica-2.0). As we rely on user class first it seems like this newer version gets priority and when Flink is closing tiles we hit an error - ``` 2025-02-05 21:14:53,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Tiling for etsy.listing_canary.actions_v1 -> (Tiling Side Output Late Data for etsy.listing_canary.actions_v1, Avro conversion for etsy.listing_canary.actions_v1 -> async kvstore writes for etsy.listing_canary.actions_v1 -> Sink: Metrics Sink for etsy.listing_canary.actions_v1) (2/3) (a107444db4dad3eb79d9d02631d8696e_5627cd3c4e8c9c02fa4f114c4b3607f4_1_56) switched from RUNNING to FAILED on container_1738197659103_0039_01_000004 @ zipline-canary-cluster-w-1.us-central1-c.c.canary-443022.internal (dataPort=33465). java.lang.NoSuchMethodError: 'void org.rocksdb.WriteBatch.remove(org.rocksdb.ColumnFamilyHandle, byte[])' at org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapper.remove(RocksDBWriteBatchWrapper.java:105) ``` Yanked it out from the two jars and confirmed that the Flink job seems to be running fine + crossing over across hours (and hence tile closures). ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Chores - Made internal adjustments to dependency management to improve compatibility between libraries and enhance overall application stability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Reverts #330 It's breaking our build as we are using `java_test_suite` for service_commons module <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced configuration to support improved Java testing capabilities. - Expanded build system functionality with additional integrations for JVM-based test suites. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Some of the benefits using LayerChart over echarts/uplot - Broad support of chart types (cartesian, polar, hierarchy, graph, force, geo) - Simplicity in setup and customization (composable chart components) - Responsive charts, both for viewport/container, and also theming (light/dark, etc) - Flexibility in design/stying (CSS variables, classes, color scales, etc) including transitions - Ability to opt into canvas or svg rendering context as the use case requires. LayerChart's canvas support also has CSS variable/styling support as well (which is unique as far as I'm aware). Html layers are also available, which are great for multiline text (with truncation). ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced new interactive charts, including `FeaturesLineChart` and `PercentileLineChart`, enhancing data visualization with detailed tooltips. - **Refactor** - Replaced legacy ECharts components with LayerChart components, streamlining chart interactions and state management. - **Chores** - Updated dependency configurations and Tailwind CSS settings for improved styling and performance. - Removed unused ECharts-related components and functions to simplify the codebase. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --- - To see the specific tasks where the Asana app for GitHub is being used, see below: - https://app.asana.com/0/0/1209163154826936 <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Sean Lynch <[email protected]>
## Summary Allows connecting to `app` docker container on `localhost:5005` via IntelliJ. See [tutorial](https://www.jetbrains.com/help/idea/tutorial-remote-debug.html) for more details, but the summary is - Recreate the docker containers (`docker-init/build.sh --all`) - This will pass the `-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005` CLI arguments and expose `5005` on the host (i.e. your laptop) - You will see `Listening for transport dt_socket at address: 5005` in the container logs  - In IntelliJ, add a new `Remote JVM Debug` config with default settings  - Set a breakpoint and then trigger it (from the frontend or calling endpoints directly)  This has been useful to understand the [java.lang.NullPointerException](https://app.asana.com/0/home/1208932362205799/1209321714844239) error when fetching summaries for some columns/features (ex. `dim_merchant_account_type`, `dim_merchant_country`)  ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> Co-authored-by: Sean Lynch <[email protected]>
## Summary Added a new workflow to verify the bazel config setup. This essentially validates all our bazel config by pulling all the necessary dependencies for all targets with out actually building them which is similar to the `Sync project` option in IntelliJ. This should help us capture errors in our bazel config setup for the CI. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Chores - Added a new automated CI workflow to run configuration tests with improved concurrency management. - Expanded CI triggers to include additional modules for broader testing coverage. - Tests - Temporarily disabled a dynamic class loading test in the cloud integrations to improve overall test stability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary The codebase previously operated under the assumption that partition listing is a cheap operation. It is not the case for GCS Format - partition listing is expensive on GCS external tables. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Improved error handling to provide clearer messaging when data retrieval issues occur. - **Refactor** - Streamlined internal processing for data partitions, resulting in more consistent behavior. - Enhanced logging during data scanning for easier troubleshooting. - Simplified logic for handling intersected ranges, ensuring consistent definitions. - Reduced the volume of test data for improved performance and resource utilization during tests. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary the default column reader batch size is 4096 - reads that many rows into memory buffer at once. that causes ooms on large columns, for catalyst we only need to read one row at a time. for interactive we set the limit to 16. tested on etsy data. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced data processing performance by adding an optimized configuration for reading Parquet files in Spark sessions. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary regarding https://app.asana.com/0/1208277377735902/1209321714844239 No longer are nulls put into the histogram array, instead we use the `Constants.magicNullLong`. We will have to filter this out from the charts as empty value in the FE. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced an additional summary retrieval for merchant account types, providing enhanced insights. - **Bug Fixes** - Improved data processing reliability by substituting missing values with default placeholders, ensuring more consistent results. - Enhanced handling of null values in histogram data for more accurate representation. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Added a new section for connecting remotely to a Java process in
Docker using IntelliJ's remote debugging feature.
- Clarified commit message formatting instructions for better
consistency.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: Sean Lynch <[email protected]>
## Summary We need to make sure no nulls get passed thru the transpose method. should wrap up this ticket: https://app.asana.com/0/1208277377735902/1209343355165466 ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Enhanced error handling ensures that invalid or missing values are consistently processed, leading to more reliable data outcomes. - **Tests** - Added test cases to validate the updated handling of null and non-numeric values, ensuring robustness in data processing. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Introduced an enhanced drift metric toggle for selecting and exploring
drift data.
- Updated chart visualizations now accurately display drift column data.
- Join page headers have been refreshed to highlight drift series
details.
- **Bug Fixes**
- Resolved issues with data fetching and display related to drift
metrics.
- **Refactor**
- Consolidated data endpoints to focus on drift metrics, replacing older
timeseries methods.
- Improved sorting and data processing, resulting in a more consistent
and reliable display of data distributions.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: Sean Lynch <[email protected]>
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Introduced the “Inter” font for enhanced typography, now available in both regular and italic styles. The default sans-serif stack was updated to incorporate Inter for a more modern visual experience. - Documentation - Provided comprehensive documentation that includes installation instructions and licensing details for the new font, ensuring clarity on usage rights and guidelines. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ant cols from left (#577) ## Summary spark optimizations ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a new configuration option that lets users control time range validation during join backfill operations, providing more consistent data processing. - Added a method for converting partition ranges to time ranges, enhancing flexibility in time range handling. - Added a new test case for data generation and validation during join operations. - **Refactor** - Optimized the way data merging queries are constructed and how time ranges are applied during join operations, ensuring enhanced precision and performance across event data processing. - Simplified method signatures by removing unnecessary parameters, streamlining the process of generating Bloom filters. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]>
## Summary Adding step days of 1 to source job ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Data processing is now handled in daily segments, providing more precise and timely results. - **Bug Fixes** - Error messages have been refined to clearly indicate the specific day when a query yields no results, improving clarity during troubleshooting. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]>
## Summary Using tableUtils.partitionColumn rather than hardcoded ds ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update Co-authored-by: ezvz <[email protected]>
## Summary Implements the simple label join logic to create a materialized table by joining forward looking partitions from the snapshot table back of labelJoinParts back to join output. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced an updated mechanism for formatting and standardizing label outputs with the addition of `outputLabelTableV2`. - Added a new distributed label join operation with the `LabelJoinV2` class, featuring robust validations, comprehensive error handling, and detailed logging for improved data integration. - Implemented a comprehensive test suite for the `LabelJoinV2` functionality to ensure accuracy and reliability of label joins. - **Updates** - Replaced the existing `LabelJoin` class with the new `LabelJoinV2` class, enhancing the label join process. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]>
## Summary Right now, the gcp integration tests do not recognize failures from dataproc and continue running the commands. This has led to the group_by_upload job hanging indefinitely. This change ensures that if the dataproc job state is "ERROR" we fail immediately. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Enhanced job status checking by ensuring that both empty and explicitly error-indicating states trigger consistent failure responses, resulting in more robust job evaluation. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
WalkthroughThis pull request enhances DataFrame processing in Spark jobs. The Changes
Possibly related PRs
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code Definitions (1)spark/src/main/scala/ai/chronon/spark/MergeJob.scala (1)
⏰ Context from checks skipped due to timeout of 90000ms (14)
🔇 Additional comments (13)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
| requiredColumnsDf | ||
| } else { | ||
| // Else we just need the base DF without the UUID | ||
| leftDfWithUUID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this include the UUID?
Co-authored-by: Thomas Chow <[email protected]>
## Summary Right now, the gcp integration tests do not recognize failures from dataproc and continue running the commands. This has led to the group_by_upload job hanging indefinitely. This change ensures that if the dataproc job state is "ERROR" we fail immediately. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Enhanced job status checking by ensuring that both empty and explicitly error-indicating states trigger consistent failure responses, resulting in more robust job evaluation. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: Thomas Chow <[email protected]>
Co-authored-by: Thomas Chow <[email protected]> Co-authored-by: Thomas Chow <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
| def run(): Unit = { | ||
| val leftDfForSchema = tableUtils.scanDf(query = null, table = leftInputTable, range = Some(dateRange)) | ||
| val leftSchema = leftDfForSchema.schema | ||
| val bootstrapInfo = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, can bootstrapInfo be computed per stepRange? I'm wondering if we should just push this logic into the dateRange.steps iteration below.
Co-authored-by: Thomas Chow <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
spark/src/main/scala/ai/chronon/spark/MergeJob.scala (3)
79-82: Bootstrap cost might be high.
91-107: Limit log size if sets are large.
121-121: Consider caching partialDf.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (1)
spark/src/main/scala/ai/chronon/spark/MergeJob.scala(6 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
spark/src/main/scala/ai/chronon/spark/MergeJob.scala (1)
spark/src/main/scala/ai/chronon/spark/BootstrapInfo.scala (3)
BootstrapInfo(51-74)BootstrapInfo(76-346)from(80-345)
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: streaming_tests
- GitHub Check: streaming_tests
- GitHub Check: spark_tests
- GitHub Check: join_tests
- GitHub Check: analyzer_tests
- GitHub Check: groupby_tests
- GitHub Check: join_tests
- GitHub Check: fetcher_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: spark_tests
- GitHub Check: fetcher_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (14)
spark/src/main/scala/ai/chronon/spark/MergeJob.scala (14)
3-3: No conflicts observed.
32-32: No issues found.
51-56: Logic is clear.
65-71: Confirm entity vs. event columns.
86-86: UUID could be expensive.
88-90: Clean approach to limit columns.
108-112: Implementation looks solid.
132-134: No concerns.
136-148: Inner join drops unmatched rows; verify.
241-243: All good here.
244-244: No problems noted.
247-249: Verify no user fields are skipped.
271-271: Simplified signature is good.
274-277: Confirm custom filtering.
## Summary Update Integration Tests to update aws-crew and gcp-crew for integration test failures. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added automated Slack notifications that alert the team immediately upon integration test failures in both AWS and GCP environments. - **Chores** - Updated notification settings to ensure alerts are routed to the correct Slack channel for effective troubleshooting. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
… submission logic (#540) ## Summary Created PubSub interfaces with GCP implementation. Updated NodeExecutionActivity implementation for job submission using PubSub publisher to publish messages for agent to pick up. Added necessary unit and integration tests ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a robust Pub/Sub module with administration, configuration, publisher, subscriber, and manager components. The module now supports asynchronous job submission with improved error handling and offers flexible production/emulator configurations. - **Documentation** - Added comprehensive usage guidance with examples for the Pub/Sub integration. - **Tests** - Expanded unit and integration tests to cover all Pub/Sub operations, including message publishing, retrieval, and error handling scenarios. - **Chores** - Updated dependency versions and build configurations, including new Maven artifacts for enhanced Google Cloud API support. Updated git ignore rules to exclude additional directories. - **Refactor** - Streamlined the persistence layer by removing obsolete key definitions. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ^^^ Ran both AWS and GCP integration tests. Following this PR, i'll work on rebasing #549 which applies the confs to the submitter interface. Previously, to compile it would be: ``` zipline compile --conf=group_bys/aws/purchases.py ``` but now `--conf` or `--input-path` is not required, just: ``` zipline compile ``` and this compiles the entire repo. In addition, the compiled files gets persisted in the `compiled` folder. No longer are compiled files persisted by environment like `production/...` ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [x] Integration tested - [ ] Documentation update GCP and AWS integration tests pass ``` <<<<<.....................................COMPILE.....................................>>>>> + zipline compile --chronon_root=/Users/davidhan/zipline/chronon/api/python/test/canary 17:49:15 INFO parse_teams.py:56 - Processing teams from path /Users/davidhan/zipline/chronon/api/python/test/canary/teams.py INFO:ai.chronon.cli.logger:Processing teams from path /Users/davidhan/zipline/chronon/api/python/test/canary/teams.py GroupBy-s: Compiled 5 objects from 3 files. Added aws.plaid_fv.v1 Added aws.purchases.v1_dev Added aws.purchases.v1_test Added gcp.purchases.v1_dev Added gcp.purchases.v1_test .... <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>> + aws emr wait step-complete --cluster-id j-13BASWFP15TLR --step-id s-02878283D08B3510ZGCZ ++ aws emr describe-step --cluster-id j-13BASWFP15TLR --step-id s-02878283D08B3510ZGCZ --query Step.Status.State ++ tr -d '"' + STEP_STATE=COMPLETED + '[' COMPLETED '!=' COMPLETED ']' + echo succeeded succeeded + echo -e '\033[0;32m<<<<<.....................................SUCCEEDED!!!.....................................>>>>>\033[0m' <<<<<.....................................SUCCEEDED!!!.....................................>>>>> ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Introduced an enhanced CLI compile command with an option to specify a custom configuration root. - Enabled dynamic loading of team configurations from both JSON and Python formats. - Added support for multiple error reporting in various components. - **Improvements** - Streamlined error and status messages for clearer, simplified feedback. - Enhanced error handling mechanisms across the application. - Updated quickstart scripts for AWS and GCP to utilize a unified, precompiled configuration set. - Modified the way environment variables are structured for improved clarity and management. - **Dependency Updates** - Added new dependencies to improve functionality. - Upgraded several existing dependencies to boost performance and enhance output formatting. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Nikhil Simha <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
Summary
Adds a flag to turn on "required columns only" for the merge job, to optimize for cases that left has large non-required-for-join columns that we don't want to carry on each shuffle.
Checklist
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Tests