-
Notifications
You must be signed in to change notification settings - Fork 8
fix: use tiling flag from groupby + code re-use #538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Summary This PR aligns the UI with the Figma design (~95% match), allowing for minor padding, color, and typography differences. ### Key Updates - **Job Status Colors:** Updated to match Figma; defined in `app.css`. - **Zipline Logo:** Replaced with the new official logo. - **Reusable Metadata Table:** Added `MetadataTable.svelte` component for flexible use. - **Support Buttons:** Adjusted behavior in the bottom-left navigation bar. - **Route Change:** Renamed `job-tracker` to `job-tracking`. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **New Features** - Added a new `MetadataTable` component for structured data display. - Introduced new icons for improved visual navigation. - Enhanced job tracking and observability pages with modular components. - **UI/UX Improvements** - Updated favicon from PNG to SVG. - Refined styling for buttons, tabs, and status indicators. - Improved navigation bar with new feedback and support buttons. - **Styling Changes** - Revamped job status color and border styling. - Adjusted component padding and margins. - Updated color variables in CSS and Tailwind configuration. - **Component Enhancements** - Added flexibility to existing components with new props. - Removed `LogsDownloadButton` and replaced with a more generic button. - Improved metadata and job tracking displays. - Enhanced collapsible sections for better consistency. - Introduced new properties for better customization in components. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Sean Lynch <[email protected]>
## Summary accept `1h`, `30d` etc as valid entries to the Windows arg of Aggregation. ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [x] Integration tested - [x] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **New Features** - Enhanced window specification in GroupBy functionality, now supporting string-based time window definitions (e.g., "7d", "1h"). - **Improvements** - Simplified window configuration syntax across multiple files. - Removed deprecated `Window` and `TimeUnit` class imports. - Improved code readability in various sample and test files. - **Tests** - Added new test case to validate string-based window specifications. - **Documentation** - Updated documentation examples to reflect new window specification method. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Switched over to using layerstack format function. This should preserve functionality (and treat dates in job tracker "as is" without converting for timezone. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Changes in Date Formatting** - Removed `formatDate` function from multiple components. - Updated date formatting approach to utilize a new utility function for consistency. - Simplified date display logic, enhancing clarity in job run information. - **Impact on User Experience** - Date representations may appear slightly different. - Consistent date formatting across the application. - Minor visual changes in job tracking and observability views. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Migrated `Flink` module to Bazel from sbt. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added Scala library definitions for the main application and testing. - Introduced new Apache Flink dependencies for streaming, metrics, and client interactions. - **Chores** - Updated build configuration to support a modular Scala project structure. - Enhanced Maven repository with additional Flink-related libraries. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Eugene changed some of the job tracker colors, they are updated here. ([link, they are below this view](https://www.figma.com/design/zM3rfODJTgJ6iu5zbXIPqk/%5BCS%5D-ZIpline-AI-Platform?node-id=1108-7802&m=dev)) I also noticed that tailwind was ignoring the hover styles (weird, since it worked for me yesterday. must be a bug), so I added a regex to catch those. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Style** - Updated color values for job state styles in the application - Added new CSS custom properties for job states, including waiting and invalid states - Enhanced visual representation of job status colors - **Chores** - Modified Tailwind CSS configuration to support dynamic job status border styles <!-- end of auto-generated comment: release notes by coderabbit.ai -->
… partitioned tables. (#257) … partitioned tables. ## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Improved BigQuery data writing method to support partitioned tables by changing the write method from "direct" to "indirect". <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Migrated `cloud_gcp` module to Bazel from sbt. Below are still pending and will fix them in a separate PR * Few tests are failing, this is actually a general problem across the project and working on it in parallel * cloud_gcp_submitter target logic also need to be migrated ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **Dependency Updates** - Downgraded `rules_jvm_external` from version 6.6 to 4.5 - Updated Mockito version to 5.12.0 - Added new Google Cloud service libraries (BigQuery, Bigtable, PubSub, Dataproc) - Added Apache Curator artifact - **Build Configuration** - Updated library visibility settings - Modified logging dependencies in Spark library - Enhanced Maven repository with additional artifacts - **Development Improvements** - Expanded testing and dependency management capabilities <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Migrated `service` module to Bazel from sbt. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Build Configuration** - Added new Bazel targets for Java library and test suite - Configured dependencies for library and testing - **Test Improvements** - Updated test code to use more concise collection initialization methods - Simplified test code structure without changing functionality - **Dependency Management** - Added Vert.x Unit testing library version 4.5.10 <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: nikhil-zlai <[email protected]> Co-authored-by: Kumar Teja Chippala <[email protected]>
## Summary Migrated `hub` module to Bazel from sbt. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **Dependency Management** - Added new dependency lists for Flink, Vert.x, and Circe libraries - Consolidated and standardized dependency management across multiple projects - Introduced new Maven artifact dependencies for Circe and Vert.x - **Build Configuration** - Refactored dependency references in multiple `BUILD.bazel` files - Updated build rules to use centralized dependency variables - Simplified library and test suite configurations - **Library Updates** - Added Scala libraries for hub project - Updated service commons library dependencies <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: nikhil-zlai <[email protected]> Co-authored-by: Kumar Teja Chippala <[email protected]>
## Summary Migrated `orchestration` module to Bazel from sbt. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added Scala library configurations for main and test code. - Introduced new Log4j dependencies for enhanced logging capabilities. - **Chores** - Updated build configuration to support Scala project structure. - Expanded Maven repository with additional logging libraries. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
eval("select ... from ... where ...") is now supported. easier for
exploration.
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [x] Integration tested
- [x] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Updated setup instructions for Chronon Python interactive runner
- Modified environment variable paths and gateway service instructions
- **New Features**
- Added ability to specify result limit when evaluating queries
- Enhanced SQL query evaluation with error handling
- **Bug Fixes**
- Simplified interactive usage example
- Updated Spark session method in DataFrame creation
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ^^^ - Did have to add `bigtables.tables.create` permission to the dataproc service account `[email protected]` since one of the tables `CHRONON_ENTITY_BY_TEAM` didn't exist at the time. See job: https://console.cloud.google.com/dataproc/jobs/a04f9ba8-583c-475e-8956-9b53d28f3ed6/monitoring?region=us-central1&inv=1&invt=AbnIoA&project=canary-443022 cc @chewy-zlai ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update Successful test in Dataproc: https://console.cloud.google.com/dataproc/jobs/04ff550c-4df6-47ce-a36b-3238c1cb63a1/monitoring?region=us-central1&inv=1&invt=AbnIzw&project=canary-443022 See: ``` (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/metadata_upload) $ cbt -project=canary-443022 -instance=zipline-canary-instance read CHRONON_METADATA 2025/01/17 21:44:34 -creds flag unset, will use gcloud credential ---------------------------------------- CHRONON_METADATA#purchases.v1 cf:value @ 2025/01/17-21:30:36.130000 "{\"metaData\":{\"name\":\"quickstart.purchases.v1\",\"online\":1,\"customJson\":\"{\\\"lag\\\": 0, \\\"groupby_tags\\\": null, \\\"column_tags\\\": {}}\",\"dependencies\":[\"{\\\"name\\\": \\\"wait_for_data.purchases_external_ds\\\", \\\"spec\\\": \\\"data.purchases_external/ds={{ ds }}\\\", \\\"start\\\": null, \\\"end\\\": null}\"],\"outputNamespace\":\"data\",\"team\":\"quickstart\",\"offlineSchedule\":\"@daily\"},\"sources\":[{\"events\":{\"table\":\"data.purchases_external\",\"query\":{\"selects\":{\"user_id\":\"user_id\",\"purchase_price\":\"purchase_price\"},\"timeColumn\":\"ts\",\"setups\":[]}}}],\"keyColumns\":[\"user_id\"],\"aggregations\":[{\"inputColumn\":\"purchase_price\",\"operation\":7,\"argMap\":{},\"windows\":[{\"length\":3,\"timeUnit\":1},{\"length\":14,\"timeUnit\":1},{\"length\":30,\"timeUnit\":1}]},{\"inputColumn\":\"purchase_price\",\"operation\":6,\"argMap\":{},\"windows\":[{\"length\":3,\"timeUnit\":1},{\"length\":14,\"timeUnit\":1},{\"length\":30,\"timeUnit\":1}]},{\"inputColumn\":\"purchase_price\",\"operation\":8,\"argMap\":{},\"windows\":[{\"length\":3,\"timeUnit\":1},{\"length\":14,\"timeUnit\":1},{\"length\":30,\"timeUnit\":1}]},{\"inputColumn\":\"purchase_price\",\"operation\":13,\"argMap\":{\"k\":\"10\"}}],\"backfillStartDate\":\"2023-11-20\"}" cf:value @ 2025/01/17-21:28:45.503000 "{\"metaData\":{\"name\":\"quickstart.purchases.v1\",\"online\":1,\"customJson\":\"{\\\"lag\\\": 0, \\\"groupby_tags\\\": null, \\\"column_tags\\\": {}}\",\"dependencies\":[\"{\\\"name\\\": \\\"wait_for_data.purchases_external_ds\\\", \\\"spec\\\": \\\"data.purchases_external/ds={{ ds }}\\\", \\\"start\\\": null, \\\"end\\\": null}\"],\"outputNamespace\":\"data\",\"team\":\"quickstart\",\"offlineSchedule\":\"@daily\"},\"sources\":[{\"events\":{\"table\":\"data.purchases_external\",\"query\":{\"selects\":{\"user_id\":\"user_id\",\"purchase_price\":\"purchase_price\"},\"timeColumn\":\"ts\",\"setups\":[]}}}],\"keyColumns\":[\"user_id\"],\"aggregations\":[{\"inputColumn\":\"purchase_price\",\"operation\":7,\"argMap\":{},\"windows\":[{\"length\":3,\"timeUnit\":1},{\"length\":14,\"timeUnit\":1},{\"length\":30,\"timeUnit\":1}]},{\"inputColumn\":\"purchase_price\",\"operation\":6,\"argMap\":{},\"windows\":[{\"length\":3,\"timeUnit\":1},{\"length\":14,\"timeUnit\":1},{\"length\":30,\"timeUnit\":1}]},{\"inputColumn\":\"purchase_price\",\"operation\":8,\"argMap\":{},\"windows\":[{\"length\":3,\"timeUnit\":1},{\"length\":14,\"timeUnit\":1},{\"length\":30,\"timeUnit\":1}]},{\"inputCol ``` ``` (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/metadata_upload) $ cbt -project=canary-443022 -instance=zipline-canary-instance read CHRONON_ENTITY_BY_TEAM 2025/01/17 21:45:43 -creds flag unset, will use gcloud credential ---------------------------------------- CHRONON_ENTITY_BY_TEAM#group_bys/quickstart cf:value @ 2025/01/17-21:30:36.621000 "cHVyY2hhc2VzLnYx" ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **New Features** - Added configuration type specification option for command-line interfaces. - Enhanced metadata processing with more flexible configuration handling. - **Improvements** - Refined command execution logic for better control flow. - Updated metadata parsing to support additional configuration types. - **Documentation** - Added clarifying comments about asynchronous task behaviors in dataset and table creation processes. The release introduces more flexible configuration management and improved command-line argument handling across various components. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Input/out schema for the Job tracker and lineage API on ZiplineHub. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a comprehensive job tracking system with support for tracking job execution, status, and metadata. - Added a new API endpoint for handling job type requests. - Defined structures for job tracking requests and responses, including job metadata. - Introduced enumerations for job processing direction, execution modes, and statuses. - Added a new handler for processing job tracking requests. - **Bug Fixes** - Corrected a typographical error in a comment within the orchestration Thrift file. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
## Summary
- Should be using the same reader as every other case.
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Improvements**
- Enhanced data loading consistency for Hive, Delta, and Iceberg table
formats
- Ensured that DataFrameReader options are consistently applied when
loading tables
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: Thomas Chow <[email protected]>
## Summary
- remove the implicit dependency on FormatProvider for CreationUtils.
Step 1 to supporting format-specific table creation.
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Code Improvements**
- Updated method signatures in `TableUtils` and `CreationUtils` to
enhance clarity and flexibility of table creation and partition
insertion processes
- Refined how file formats and table types are specified during table
operations
- Streamlined parameter handling for table creation methods
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: Thomas Chow <[email protected]>
## Summary
- set writeOptions instead of global conf to avoid mutating / races
between theads
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Configuration Management**
- Simplified BigQuery configuration handling in Spark session
- Removed configuration restoration logic after method execution
Note: The changes are primarily internal and do not directly impact
end-user functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: Thomas Chow <[email protected]>
## Summary - Initial implementation of Python `compile` -- compiles the whole repo - Placeholder implementation of `backfill` -- next step will be to wire up to the orchestrator for implementation - Documentation for the broader user API surface area and how it should behave (should maybe move this to pydocs) Currently, the interaction looks like this: <img width="729" alt="image" src="https://github.com/user-attachments/assets/a52cf582-4394-449a-8567-32c0f108d2ed" /> Next steps: - Connect with ZiplineHub for lineage, and backfill execution ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [x] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Documentation** - Expanded CLI documentation with detailed descriptions of `plan`, `backfill`, `deploy`, and `fetch` commands. - Added comprehensive usage instructions and examples for each command. - **New Features** - Enhanced repository management functions for Chronon objects. - Added methods for compiling, uploading, and managing configuration details. - Introduced new utility functions for module and variable name retrieval. - Added functionality to compute and upload differences in a local repository. - New function to convert JSON to Thrift binary format. - **Improvements** - Streamlined function signatures in extraction and compilation modules. - Added error handling and logging improvements. - Enhanced object serialization and diff computation capabilities. - Dynamic method bindings for `backfill`, `deploy`, and `info` functions in `GroupBy` and `Join`. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]>
## Summary
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Bug Fixes**
- Improved BigQuery partition loading mechanism to use more direct
method calls when retrieving partition columns and values.
Note: The changes appear technical and primarily affect internal data
loading processes, with no direct end-user visible impact.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: Thomas Chow <[email protected]>
## Summary
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated Java version to Corretto 11.0.25.9.1
- Upgraded Google Cloud SDK (gcloud) from version 504.0.1 to 507.0.0
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: Thomas Chow <[email protected]>
## Summary We have a strict dependency on java 11 for all the dataproc stuff so it's good to be consistent across our project. Currently only service_commons package has strict dependency on java 17 so made changes to be compatible with java 11. was able to successfully build using both sbt and bazel. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration** - Downgraded Java language and runtime versions from 17 to 11 in Bazel configuration - **Code Improvement** - Updated type handling in `RouteHandlerWrapper` method signature for enhanced type safety <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Set up a Flink job that can take beacon data as avro (configured in gcs) and emit it at a configurable rate to Kafka. We can use this stream in our GB streaming jobs ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested Kicked off the job and you can see events flowing in topic [test-beacon-main](https://console.cloud.google.com/managedkafka/us-central1/clusters/zipline-kafka-cluster/topics/test-beacon-main?hl=en&invt=Abnpeg&project=canary-443022) - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **New Features** - Added Kafka data ingestion capabilities using Apache Flink. - Introduced a new driver for streaming events from GCS to Kafka with configurable delay. - **Dependencies** - Added Apache Flink connectors for Kafka, Avro, and file integration. - Integrated managed Kafka authentication handler for cloud environments. - **Infrastructure** - Created new project configuration for Kafka data ingestion. - Updated build settings to support advanced streaming workflows. - Updated cluster name configuration for Dataproc submitter. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
- the way tables are created can be different depending on the
underlying storage catalog that we are using. In the case of BigQuery,
we cannot issue a spark sql statement to do this operation. Let's
abstract that out into the `Format` layer for now, but eventually we
will need a `Catalog` abstraction that supports this.
- Try to remove the dependency on a sparksession in `Format`, invert the
dependency with a HOF
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Enhanced BigQuery integration with new client parameter
- Added support for more flexible table creation methods
- Improved logging capabilities for table operations
- **Bug Fixes**
- Standardized table creation process across different formats
- Removed unsupported BigQuery table creation operations
- **Refactor**
- Simplified table creation and partition insertion logic
- Updated method signatures to support more comprehensive table
management
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: Thomas Chow <[email protected]>
## Summary - https://app.asana.com/0/1208949807589885/1209206040434612/f - Support explicit bigquery table creation. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update --- - To see the specific tasks where the Asana app for GitHub is being used, see below: - https://app.asana.com/0/0/1209206040434612 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced BigQuery table creation functionality with improved schema and partitioning support. - Streamlined table creation process in Spark's TableUtils. - **Refactor** - Simplified table existence checking logic. - Consolidated import statements for better readability. - Removed unused import in BigQuery catalog test. - Updated import statement in GcpFormatProviderTest for better integration with Spark BigQuery connector. - **Bug Fixes** - Improved error handling for table creation scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
…obs (#274) ## Summary Changes to avoid runtime errors while running Spark and Flume jobs on the cluster using Dataproc submit for the newly built bazel assembly jars. Successfully ran the DataprocSubmitterTest jobs using the newly built bazel jars for testing. Testing Steps 1. Build the assembly jars ``` bazel build //cloud_gcp:lib_deploy.jar bazel build //flink:assembly_deploy.jar bazel build //flink:kafka-assembly_deploy.jar ``` 2. Copy the jars to gcp account used by our jobs ``` gsutil cp bazel-bin/cloud_gcp/lib_deploy.jar gs://zipline-jars/bazel-cloud-gcp.jar gsutil cp bazel-bin/flink/assembly_deploy.jar gs://zipline-jars/bazel-flink.jar gsutil cp bazel-bin/flink/kafka-assembly_deploy.jar gs://zipline-jars/bazel-flink-kafka.jar ``` 3. Modify the jar paths in the DataprocSubmitterTest file and run the tests ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **New Features** - Added Kafka and Avro support for Flink. - Introduced new Flink assembly and Kafka assembly binaries. - **Dependency Management** - Updated Maven artifact dependencies, including Kafka and Avro. - Excluded specific Hadoop and Spark-related artifacts to prevent runtime conflicts. - **Build Configuration** - Enhanced build rules for Flink and Spark environments. - Improved dependency management to prevent class compatibility issues. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
The string formatting here was breaking zipline commands when I built a
new wheel:
```
File "/Users/davidhan/zipline/chronon/dev_chronon/lib/python3.11/site-packages/ai/chronon/repo/hub_uploader.py", line 73
print(f"\n\nUploading:\n {"\n".join(diffed_entities.keys())}")
^
SyntaxError: unexpected character after line continuation character
```
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Style**
- Improved string formatting and error message construction for better
code readability
- Enhanced log message clarity in upload process
Note: These changes are internal improvements that do not impact
end-user functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - With #263 we control table creation ourselves. We don't need to rely on indirect writes to then do the table creation (and partitioning) for us, we just simply use the storage API to write directly into the table we created. This should be much more performant and preferred over indirect writes because we don't need to stage data, then load as a temp BQ table, and it uses the BigQuery storage API directly. - Remove configs that are used only for indirect writes ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **Improvements** - Enhanced BigQuery data writing process with more precise configuration options. - Simplified table creation and partition insertion logic. - Improved handling of DataFrame column arrangements during data operations. - **Changes** - Updated BigQuery write method to use a direct writing approach. - Introduced a new option to prevent table creation if it does not exist. - Modified table creation process to be more format-aware. - Streamlined partition insertion mechanism. These updates improve data management and writing efficiency in cloud data processing workflows. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
## Summary
- Pull alter table properties functionality into `Format`
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added a placeholder method for altering table properties in BigQuery
format
- Introduced a new method to modify table properties across different
Spark formats
- Enhanced table creation utility to use format-specific property
alteration methods
- **Refactor**
- Improved table creation process by abstracting table property
modifications
- Standardized approach to handling table property changes across
different formats
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Thomas Chow <[email protected]>
## Summary
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added configurable repartitioning option for DataFrame writes.
- Introduced a new configuration setting to control repartitioning
behavior.
- Enhanced test suite with functionality to handle empty DataFrames.
- **Chores**
- Improved code formatting and logging for DataFrame writing process.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: tchow-zlai <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
) ## Summary we cannot represent absent data in time series with null values because the thrift serializer doesn't allow nulls in list<double> - so we create a magic double value (based on string "chronon") to represent nulls ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new constant `magicNullDouble` for handling null value representations - **Bug Fixes** - Improved null value handling in serialization and data processing workflows - **Tests** - Added new test case for `TileDriftSeries` serialization to validate null value management The changes enhance data processing consistency and provide a standardized approach to managing null values across the system. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary This fixes an issue where Infinity/NaN values in drift calculations were causing JSON parse errors in the frontend. These special values are now converted to our standard magic null value (-2.7980863399423856E16) before serialization. ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Enhanced error handling for special double values (infinity and NaN) in data processing - Improved serialization test case to handle null and special values more robustly - **Tests** - Updated test case for `TileDriftSeries` serialization to cover edge cases with special double values <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Revised the upload process so that release artifacts are now organized into distinct directories based on their type. This update separates application packages into dedicated folders, ensuring a more orderly and reliable distribution of the latest releases. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Improved cloud artifact deployment now ensures that release items are available on both legacy and updated storage paths for enhanced reliability. - Enhanced data processing now dynamically determines partition configurations and provides clear notifications when sub-partition filtering isn’t supported. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
```bash
➜ chronon git:(tchow/zipline-init) ✗ zipline init
Warning: /Users/thomaschow/zipline-ai/chronon/zipline is not empty. Proceed? [y/N]: Y
Generating scaffolding at /Users/thomaschow/zipline-ai/chronon/zipline...
Project scaffolding created successfully! 🎉
```
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Introduced a new CLI tool for safely initializing and scaffolding
projects with user confirmation to avoid accidental overwrites.
- Integrated the new command into the main interface for streamlined
project setup.
- Added a configuration file for custom sample team join and aggregation
settings.
- **Chores**
- Updated dependency management by reintroducing essential packages and
adding new ones.
- Revised package metadata and version requirements to support a newer
Python version.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: Thomas Chow <[email protected]>
## Summary
- Do not use the `DelegatingBigQueryMetastore` as a session catalog,
just have it be a custom catalog.
This will change the following configuration set
From:
```bash
spark.sql.catalog.spark_catalog.warehouse: "gs://zipline-warehouse-etsy/data/tables/"
spark.sql.catalog.spark_catalog.gcp_location: "us"
spark.sql.catalog.spark_catalog.gcp_project: "etsy-zipline-dev"
spark.sql.catalog.spark_catalog.catalog-impl: org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog
spark.sql.catalog.spark_catalog: ai.chronon.integrations.cloud_gcp.DelegatingBigQueryMetastoreCatalog
spark.sql.catalog.spark_catalog.io-impl: org.apache.iceberg.io.ResolvingFileIO
spark.sql.catalog.default_iceberg: ai.chronon.integrations.cloud_gcp.DelegatingBigQueryMetastoreCatalog
spark.sql.catalog.default_iceberg.catalog-impl: org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog
spark.sql.catalog.default_iceberg.io-impl: org.apache.iceberg.io.ResolvingFileIO
spark.sql.catalog.default_iceberg.warehouse: "gs://zipline-warehouse-etsy/data/tables/"
spark.sql.catalog.default_iceberg.gcp_location: "us"
spark.sql.catalog.default_iceberg.gcp_project: "etsy-zipline-dev"
spark.sql.defaultUrlStreamHandlerFactory.enabled: "false"
spark.kryo.registrator: "ai.chronon.integrations.cloud_gcp.ChrononIcebergKryoRegistrator"
```
to:
```bash
spark.sql.defaultCatalog: "default_iceberg"
spark.sql.catalog.default_iceberg: "ai.chronon.integrations.cloud_gcp.DelegatingBigQueryMetastoreCatalog"
spark.sql.catalog.default_iceberg.catalog-impl: "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog"
spark.sql.catalog.default_iceberg.io-impl: "org.apache.iceberg.io.ResolvingFileIO"
spark.sql.catalog.default_iceberg.warehouse: "gs://zipline-warehouse-etsy/data/tables/"
spark.sql.catalog.default_iceberg.gcp_location: "us"
spark.sql.catalog.default_iceberg.gcp_project: "etsy-zipline-dev"
spark.sql.defaultUrlStreamHandlerFactory.enabled: "false"
spark.kryo.registrator: "ai.chronon.integrations.cloud_gcp.ChrononIcebergKryoRegistrator"
spark.sql.catalog.default_bigquery: "ai.chronon.integrations.cloud_gcp.DelegatingBigQueryMetastoreCatalog"
```
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Refactor**
- Improved internal table processing by restructuring class integrations
and enhancing error messaging when a table isn’t found.
- **Tests**
- Updated integration settings and adjusted reference parameters to
ensure validations remain aligned with the new catalog implementation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: Thomas Chow <[email protected]>
…493) ## Summary Adding function stubs and thrift schemas for orchestration with column level lineage. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Streamlined CLI operations with a simplified backfill command and removal of an obsolete synchronization command. - Enhanced configuration processing that now automatically compiles and returns results. - Expanded physical planning capabilities with new mechanisms for managing computation graphs. - Extended tracking of data transformations through improved column lineage support in metadata. - Introduction of new classes and methods for managing physical nodes and graphs, enhancing data processing workflows. - New interface for orchestrators to handle configuration fetching and uploading. - New structures for detailed column lineage tracking throughout data processing stages. - Added support for generating Python code from new Thrift definitions in the testing workflow. - **Bug Fixes** - Improved documentation for clarity on the functionality and logic of certain fields. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: ezvz <[email protected]>
## Summary Initial skeleton implementation for the execution layer with the following 1. Custom Thrift payload converter for our inputs/outputs 2. NodeExecutionWorkflow implementation which utilizes NodeRunnerActivity implementation to recursively trigger and wait for dependencies before submitting the job for the node 3. NodeExecutionWorker and NodeExecutionWorkflowRunner implementation for local testing (this can be refactored and modified for the actual production deployment) 4. Added basic unit tests for the DagExecution workflow and will add more in future iterations ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a new orchestration system for executing node-based operations with dependency management, including enhanced worker and runner components. - Added a new `DummyNode` structure for improved data representation in workflows. - Implemented a `ThriftPayloadConverter` for custom serialization and deserialization of Thrift objects in workflows. - Developed a `NodeExecutionWorkflow` for managing the execution of nodes within a Directed Acyclic Graph (DAG). - Created a `NodeExecutionActivity` interface for handling node dependencies and job submissions. - Added a `TaskQueue` trait to facilitate load balancing and independent scaling of task queues. - Introduced a `WorkflowOperations` interface for managing workflow operations and statuses. - **Documentation** - Added a README.md file detailing the orchestration module's functionalities and usage instructions. - **Tests** - Expanded test coverage for workflow executions, activity handling, and serialization, ensuring improved reliability. - Introduced new unit and integration tests for `NodeExecutionWorkflow`, `NodeExecutionActivity`, and `ThriftPayloadConverter`. - **Chores** - Streamlined build configurations and updated dependency management for enhanced performance and maintainability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: david-zlai <[email protected]> Co-authored-by: chewy-zlai <[email protected]> Co-authored-by: Nikhil Simha <[email protected]> Co-authored-by: Ken Morton <[email protected]> Co-authored-by: Sean Lynch <[email protected]> Co-authored-by: Sean Lynch <[email protected]> Co-authored-by: Piyush Narang <[email protected]> Co-authored-by: Thomas Chow <[email protected]> Co-authored-by: Thomas Chow <[email protected]>
…#489) ## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Refactor** - Updated the method for loading data in tests to enhance consistency and abstraction across various cases. - **Style** - Reorganized import statements within tests for improved clarity and maintainability. *Note: These behind‑the‑scenes improvements bolster overall system quality without affecting end‑user functionality.* <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> Co-authored-by: Thomas Chow <[email protected]>
… exec nodes (#515) ## Summary The updated Etsy listing actions GroupBy consists of a select which has "explode(transform(...))". When we run that through our Validation Flink job we see no matches. This is due to the explode + transform resulting in spark splitting up the whole stage code gen into multiple stages (and one of the stages being a 'GenerateExec' node for the explode). This PR reworks our CU code quite a bit to support this. * With these changes we consistently do see the validation upfront succeeding * Added a set of tests that mimic the listing actions behavior in terms of avro shape, mock data examples and confirm they pass * There are some perf issues still - we see that we need higher //ism to keep up with the kafka stream (need 96). This is something we can tune in a follow up. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a new utility for Avro data encoding and deserialization. - Added a test suite for validating complex Avro payload processing. - **Refactor** - Streamlined data processing and transformation logic for improved performance and maintainability. - Enhanced logging for execution plan processing. - **Tests** - Expanded test coverage for Avro deserialization and SQL transformation scenarios. - Added new test cases for list and array containers. - **Chores** - Made minor formatting updates for better code clarity. - Deleted obsolete helper methods. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Nikhil Simha <[email protected]>
## Summary
Our wheel seems to be broken after we added the lineage.thrift file -
```
zipline compile --input_path group_bys/search/beacons/listing.py
Traceback (most recent call last):
File "/Users/pnarang/workspace/zipline/dev_chronon/bin/zipline", line 5, in <module>
from ai.chronon.repo.zipline import zipline
File "/Users/pnarang/workspace/zipline/dev_chronon/lib/python3.11/site-packages/ai/chronon/repo/__init__.py", line 15, in <module>
from ai.chronon.api.ttypes import GroupBy, Join, StagingQuery, Model
File "/Users/pnarang/workspace/zipline/dev_chronon/lib/python3.11/site-packages/ai/chronon/api/ttypes.py", line 16, in <module>
import ai.chronon.lineage.ttypes
ModuleNotFoundError: No module named 'ai.chronon.lineage'
```
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Enhanced the automated generation process to support all Thrift files
in the `api/thrift/` directory, improving efficiency.
- Updated the build routine to streamline the generation of Thrift
files, ensuring consistency and smooth integration.
- Consolidated ignored directory paths in the version control system for
better management.
- **Bug Fixes**
- Removed the deprecated script responsible for generating Python Thrift
files, eliminating associated issues.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Enhanced data write configuration by adding options for distribution
mode (none) and target file size (512 MB) during the write operation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: Thomas Chow <[email protected]>
## Summary This can help us dig into why our Flink jobs are slow. [Flink docs](https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/flame_graphs/) ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - Tested by running on Etsy side - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced job monitoring through the addition of flame graph generation for improved REST API insights. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Simplified tests to only have unit tests using PostgresSql test containers. Deleted integration tests as they are not truly integration tests and we had to use slightly complicated mix-in composition to remove duplicate code. Ideally we should have used test containers that starts PGAdapter connection to Spanner emulator but currently that implementation doesn't exist and creating our own had challenges as the docker image for local PGAdapter doesn't have good multi platform support and are failing on MAC (only supported on linux) and had challenges making the tests work on all platforms ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced enhanced batch processing for managing dependency relationships, strengthening orchestration performance. - **Chores** - Streamlined testing infrastructure for improved database management and tighter code encapsulation. - Retired outdated integration tests and refined dependency configurations to optimize the build process. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
- Create a `/canary` dir for integration testing
- Migrate the `gcp` integration test script.
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Introduced new configuration files and pipelines for purchase metric
aggregations across production, development, and test environments for
both AWS and GCP.
- New README files added for canary configurations and AWS Zipline
project, providing structured guides and cloud-specific instructions.
- **Documentation**
- Released new and updated guides with cloud-specific instructions and
sample pipelines.
- **Refactor**
- Enhanced clarity and consistency in variable initialization for cloud
file uploads.
- Streamlined quickstart scripts by centralizing repository paths and
removing redundant setup steps.
- **Bug Fixes**
- Improved handling of variable initialization for cloud file uploads,
ensuring consistent behavior across different operational modes.
- Consolidated customer ID handling in upload scripts for improved
usability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: Thomas Chow <[email protected]>
## Summary - We want users to be able to specify write options. In the case of iceberg tables, these can be set at the table level through table properties. The chronon API already supports table properties, so let's just ride on those rails. - Users can set write distribution options through this mechanism: https://iceberg.apache.org/docs/latest/spark-writes/#writing-distribution-modes ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enabled dynamic updates of table properties during partition insertions, offering more flexible table management. - **Refactor** - Streamlined table write operations by removing legacy configuration options to simplify data handling. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
## Summary
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Refactor**
- Streamlined internal table data handling for improved consistency.
- Removed outdated partition processing functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: Thomas Chow <[email protected]>
## Summary This updates the Push To Canary workflow to push the updated jars and wheel to the cloud canaries and run integration tests. The jobs are as follows: * **build_artifacts** builds the jars and wheel from the main branch and uploads them to github for use in separate jobs simultaneously * **push_to_aws** downloads the jars and wheel from github and uploads them to the s3 zipline-artifacts-canary bucket under version "candidate" * **push_to_gcp** downloads the jars and wheel from github and uploads them to the gcs zipline-artifacts-canary bucket under version "candidate" * **run_aws_integration_tests** runs the quickstart compile and backfill based on [scripts/distribution/run_aws_quickstart.sh](https://github.com/zipline-ai/chronon/blob/main/scripts/distribution/run_aws_quickstart.sh) * **run_gcp_integration_tests** runs the quickstart compile, backfill, group-by-upload, upload-to-kv, metadata-upload, and fetch based on [scripts/distribution/run_gcp_quickstart.sh](https://github.com/zipline-ai/chronon/blob/main/scripts/distribution/run_gcp_quickstart.sh) * **clean_up_artifacts** removes the jars and wheel from github to minimize the storage costs. * **merge_to_main-passing-tests** if the tests all pass, this will merge the latest commits to the branch "main-passing-tests". This also creates a txt file artifact marking which commit sha passed which is saved by github. * **on_fail_notify_slack** if the tests fail, this will send a notification to the "integration-tests" channel of our Slack. A test run of the workflow can be seen here: https://github.com/zipline-ai/chronon/actions/runs/13913346353 ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [x] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Summary by CodeRabbit - **Chores** - Enhanced our release pipeline to streamline package generation and distribution. - Automated deployment of new build packages across AWS and GCP. - Introduced cleanup routines to optimize the post-deployment process. - Updated job structure for improved artifact management and deployment. - Added new jobs for building artifacts and uploading them to cloud storage, including `build_artifacts`, `push_to_aws`, `push_to_gcp`, and `clean_up_artifacts`. - Defined outputs for generated artifacts to improve tracking. - Improved flexibility in deployment scripts for different environments (dev and canary). - Updated scripts to conditionally handle resource management based on environment settings. - Introduced environment-specific configurations for AWS and GCP operations. - Added integration testing jobs for AWS and GCP post-deployment. - Enhanced scripts for AWS and GCP to support environment-specific operations and configurations. - Introduced a mechanism to switch between different environments affecting resource management and command configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Pulled some code out of the CU hotpath (the row => {...} methods) that
relates to init stuff based on flamegraph debugging.
## Checklist
- [ ] Added Unit Tests
- [X] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Extended validation reports now include additional row count metrics,
offering enhanced insight into data processing performance.
- **Refactor**
- Optimized transformation logic for greater efficiency and consistency.
- Improved error handling and logging for better performance tracking.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Integrated a new linting tool into the development workflow for improved code quality. - **Refactor/Style** - Streamlined and reorganized code structure by reordering import statements and adjusting default parameters. - Enhanced error handling to provide clearer diagnostics without impacting functionality. - **Chores** - Updated configuration and build settings to standardize the codebase and CI processes. - Added a new configuration file for the Ruff linter, specifying linting settings and rules. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary This PR is mostly a continuation of the initial Tailwind 4 [upgrade](#430) with some additional best practices and simplification applied and has the goal of introducing minimal visual changes. With that said, there are some minor visual changes due to the removal of all the [custom](https://github.com/zipline-ai/chronon/pull/505/files#diff-8fdab0c1827a182dfaa61b1253086037a44105350409f6701c49ebe130f6705bL168-L243) font sizes. This removal helps to simplify the config and usage (Tailwind already provides many [sizes](https://tailwindcss.com/docs/font-size) and [weights](https://tailwindcss.com/docs/font-weight) by default). We can still refine the sizing if desired with 1-off custom sizes (ex. `text-[13px]`) but the simplification improved readability. I did manually refine a few sizes already to tighten up the design. - Switch from `@tailwindcss/postcss` to `@tailwindcss/vite` ([recommended](https://tailwindcss.com/docs/upgrade-guide#using-vite)) - Migrate config from [legacy](https://tailwindcss.com/docs/upgrade-guide#using-a-javascript-config-file) `tailwind.config.js` (js-based) to `app.css` (css-first with directives) - Cleanup / simplify - Remove unused `@tailwindcss/typography` plugin - Remove unused container config (no longer [supported](https://tailwindcss.com/docs/upgrade-guide#container-configuration) in v4 as well) - Remove use of deprecated/unsupported [safelist](https://tailwindcss.com/docs/upgrade-guide#using-a-javascript-config-file) - Simplify job tracking styling by using css directly (remove js / tailwind indirection) - Rename `MetadataTableSection` to `MetadataTable` and replace `MetadataTable` usage with simple div - `MetadataTableSection` was the actual `Table` - Removes dynamic `grid-col-{$columns}` class string which isn't [supported](https://tailwindcss.com/docs/detecting-classes-in-source-files#dynamic-class-names) by Tailwind (especially without `safelist` support) - Update [LayerChart](https://github.com/techniq/layerchart/pull/449/files) and [Svelte UX](techniq/svelte-ux#571) to `2.0.0@next` (improved compat with Tailwind 4 and [later](techniq/layerchart#458) Svelte 5) ## Screenshots ### Dark mode Before | After --- | ---  |   |   |   |  ### Light mode Before | After --- | ---  |   |   |   |  ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced enhanced theming with a comprehensive set of CSS custom properties for improved color schemes and typography. - Streamlined metadata and status displays for a cleaner and more intuitive presentation. - **Style** - Refined UI elements including navigation bars, headers, and buttons for better readability and visual consistency. - Updated layout adjustments across key pages, offering a more cohesive look and feel. - **Chores** - Upgraded dependencies and optimized build configurations while removing legacy styling tools for improved performance. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Sean Lynch <[email protected]>
## Summary
Ran the canary integration tests!
```bash
<-----------------------------------------------------------------------------------
------------------------------------------------------------------------------------
DATAPROC LOGS
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------>
++ cat tmp_metadata_upload.out
++ grep 'Dataproc submitter job id'
++ cut -d ' ' -f5
+ METADATA_UPLOAD_JOB_ID=0f0a7099-3f03-4487-b291-de26f3d56cdb
+ check_dataproc_job_state 0f0a7099-3f03-4487-b291-de26f3d56cdb
+ JOB_ID=0f0a7099-3f03-4487-b291-de26f3d56cdb
+ '[' -z 0f0a7099-3f03-4487-b291-de26f3d56cdb ']'
+ echo -e '\033[0;32m <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>>\033[0m'
<<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>>
++ gcloud dataproc jobs describe 0f0a7099-3f03-4487-b291-de26f3d56cdb --region=us-central1 --format=flattened
++ grep status.state:
+ JOB_STATE='status.state: DONE'
+ echo status.state: DONE
status.state: DONE
+ '[' -z 'status.state: DONE' ']'
+ echo -e '\033[0;32m<<<<<.....................................FETCH.....................................>>>>>\033[0m'
<<<<<.....................................FETCH.....................................>>>>>
+ touch tmp_fetch.out
+ zipline run --repo=/Users/thomaschow/zipline-ai/chronon/api/py/test/canary --mode fetch --conf=production/group_bys/gcp/purchases.v1_dev -k '{"user_id":"5"}' --name gcp.purchases.v1_dev
+ tee tmp_fetch.out
+ grep -q purchase_price_average_14d
+ cat tmp_fetch.out
+ grep purchase_price_average_14d
"purchase_price_average_14d" : 72.5,
+ '[' 0 -ne 0 ']'
+ echo -e '\033[0;32m<<<<<.....................................SUCCEEDED!!!.....................................>>>>>\033[0m'
<<<<<.....................................SUCCEEDED!!!.....................................>>>>>
```
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Upgraded several core dependency versions from 1.5.2 to 1.6.1.
- Updated artifact metadata, including revised checksums and additional
dependency entries, to enhance concurrency handling and HTTP client
support.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: Thomas Chow <[email protected]>
## Summary Adding stubs and schemas for orchestrator ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [x] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced an orchestration service with HTTP endpoints for health checks, configuration display, and diff uploads. - Expanded branch mapping capabilities and enhanced data exchange processes. - Added new data models and database access logic for configuration management. - Implemented a new upload handler for managing configuration uploads and diff operations. - Added a new `UploadHandler` class for handling uploads and diff requests. - Introduced a new `WorkflowHandler` class for future workflow management. - **Refactor** - Streamlined job orchestration for join, merge, and source operations by adopting unified node-based configurations. - Restructured metadata and persistence handling for greater clarity and efficiency. - Updated import paths to reflect new organizational structures within the codebase. - **Chore** - Updated dependency integrations and package structures. - Removed obsolete components and tests to improve overall maintainability. - Introduced new test specifications for validating database access logic. - Added new tests for the `NodeDao` class to ensure functionality. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]>
WalkthroughThis update adds a new test case in the API to verify that a base64-encoded tile key written in Flink format is properly deserialized using Changes
Possibly related PRs
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (6)
🔇 Additional comments (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
api/src/test/scala/ai/chronon/api/test/TilingUtilSpec.scala (1)
27-32: Test only verifies non-null, needs to check key propertiesTest verifies deserialization succeeds but doesn't validate key contents. Consider assertions for dataset, keyBytes, etc.
"TilingUtils" should "deser Flink written tile key" in { val base64String = "GChTRUFSQ0hfQkVBQ09OU19MSVNUSU5HX0FDVElPTlNfU1RSRUFNSU5HGWMCkJHL6Q0WgLq3AxaAyJ/uuWUA" // "Aiw=" // "ArL4u9gL" val bytes = java.util.Base64.getDecoder.decode(base64String) val deserializedKey = TilingUtils.deserializeTileKey(bytes) deserializedKey should not be null + deserializedKey.getDataset should be("SEARCH_BEACONS_LISTING_ACTIONS_STREAMING") + deserializedKey.getTileSizeMillis should be > 0L + deserializedKey.getTileStartTimestampMillis should be > 0L }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (2)
api/src/main/scala/ai/chronon/api/TilingUtils.scala(2 hunks)api/src/test/scala/ai/chronon/api/test/TilingUtilSpec.scala(1 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
api/src/test/scala/ai/chronon/api/test/TilingUtilSpec.scala (1)
api/src/main/scala/ai/chronon/api/TilingUtils.scala (2)
TilingUtils(11-35)deserializeTileKey(30-34)
🔇 Additional comments (4)
api/src/main/scala/ai/chronon/api/TilingUtils.scala (4)
17-19: Protocol change needs validationChanged from binary to compact serialization. Ensure compatibility with existing data.
22-24: Protocol change needs compatibility testingSwitching from binary to compact deserialization may affect reading existing data.
26-28: Serialization method update looks goodUpdated to use compact serializer.
30-34: Deserialization method update looks goodUpdated to use compact deserializer.
a29efd2 to
3871d23
Compare
Summary
Checklist
Summary by CodeRabbit
Tests
Refactor