fix: use tiling flag from groupby + code re-use #538

nikhil-zlai · 2025-03-25T20:38:16Z

Summary

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

Tests
- Added a test to verify that tile key decoding works reliably.
Refactor
- Adjusted method access levels to improve internal code encapsulation without affecting external behavior.

## Summary This PR aligns the UI with the Figma design (~95% match), allowing for minor padding, color, and typography differences. ### Key Updates - **Job Status Colors:** Updated to match Figma; defined in `app.css`. - **Zipline Logo:** Replaced with the new official logo. - **Reusable Metadata Table:** Added `MetadataTable.svelte` component for flexible use. - **Support Buttons:** Adjusted behavior in the bottom-left navigation bar. - **Route Change:** Renamed `job-tracker` to `job-tracking`. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **New Features** - Added a new `MetadataTable` component for structured data display. - Introduced new icons for improved visual navigation. - Enhanced job tracking and observability pages with modular components. - **UI/UX Improvements** - Updated favicon from PNG to SVG. - Refined styling for buttons, tabs, and status indicators. - Improved navigation bar with new feedback and support buttons. - **Styling Changes** - Revamped job status color and border styling. - Adjusted component padding and margins. - Updated color variables in CSS and Tailwind configuration. - **Component Enhancements** - Added flexibility to existing components with new props. - Removed `LogsDownloadButton` and replaced with a more generic button. - Improved metadata and job tracking displays. - Enhanced collapsible sections for better consistency. - Introduced new properties for better customization in components.  --------- Co-authored-by: Sean Lynch <[email protected]>

## Summary accept `1h`, `30d` etc as valid entries to the Windows arg of Aggregation. ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [x] Integration tested - [x] Documentation update  ## Summary by CodeRabbit ## Release Notes - **New Features** - Enhanced window specification in GroupBy functionality, now supporting string-based time window definitions (e.g., "7d", "1h"). - **Improvements** - Simplified window configuration syntax across multiple files. - Removed deprecated `Window` and `TimeUnit` class imports. - Improved code readability in various sample and test files. - **Tests** - Added new test case to validate string-based window specifications. - **Documentation** - Updated documentation examples to reflect new window specification method.

## Summary Switched over to using layerstack format function. This should preserve functionality (and treat dates in job tracker "as is" without converting for timezone. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Changes in Date Formatting** - Removed `formatDate` function from multiple components. - Updated date formatting approach to utilize a new utility function for consistency. - Simplified date display logic, enhancing clarity in job run information. - **Impact on User Experience** - Date representations may appear slightly different. - Consistent date formatting across the application. - Minor visual changes in job tracking and observability views.

## Summary Migrated `Flink` module to Bazel from sbt. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Added Scala library definitions for the main application and testing. - Introduced new Apache Flink dependencies for streaming, metrics, and client interactions. - **Chores** - Updated build configuration to support a modular Scala project structure. - Enhanced Maven repository with additional Flink-related libraries.

## Summary Eugene changed some of the job tracker colors, they are updated here. ([link, they are below this view](https://www.figma.com/design/zM3rfODJTgJ6iu5zbXIPqk/%5BCS%5D-ZIpline-AI-Platform?node-id=1108-7802&m=dev)) I also noticed that tailwind was ignoring the hover styles (weird, since it worked for me yesterday. must be a bug), so I added a regex to catch those. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Style** - Updated color values for job state styles in the application - Added new CSS custom properties for job states, including waiting and invalid states - Enhanced visual representation of job status colors - **Chores** - Modified Tailwind CSS configuration to support dynamic job status border styles

… partitioned tables. (#257) … partitioned tables. ## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Bug Fixes** - Improved BigQuery data writing method to support partitioned tables by changing the write method from "direct" to "indirect".

## Summary Migrated `cloud_gcp` module to Bazel from sbt. Below are still pending and will fix them in a separate PR * Few tests are failing, this is actually a general problem across the project and working on it in parallel * cloud_gcp_submitter target logic also need to be migrated ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **Dependency Updates** - Downgraded `rules_jvm_external` from version 6.6 to 4.5 - Updated Mockito version to 5.12.0 - Added new Google Cloud service libraries (BigQuery, Bigtable, PubSub, Dataproc) - Added Apache Curator artifact - **Build Configuration** - Updated library visibility settings - Modified logging dependencies in Spark library - Enhanced Maven repository with additional artifacts - **Development Improvements** - Expanded testing and dependency management capabilities

## Summary Migrated `service` module to Bazel from sbt. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Build Configuration** - Added new Bazel targets for Java library and test suite - Configured dependencies for library and testing - **Test Improvements** - Updated test code to use more concise collection initialization methods - Simplified test code structure without changing functionality - **Dependency Management** - Added Vert.x Unit testing library version 4.5.10  --------- Co-authored-by: nikhil-zlai <[email protected]> Co-authored-by: Kumar Teja Chippala <[email protected]>

## Summary Migrated `hub` module to Bazel from sbt. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **Dependency Management** - Added new dependency lists for Flink, Vert.x, and Circe libraries - Consolidated and standardized dependency management across multiple projects - Introduced new Maven artifact dependencies for Circe and Vert.x - **Build Configuration** - Refactored dependency references in multiple `BUILD.bazel` files - Updated build rules to use centralized dependency variables - Simplified library and test suite configurations - **Library Updates** - Added Scala libraries for hub project - Updated service commons library dependencies  --------- Co-authored-by: nikhil-zlai <[email protected]> Co-authored-by: Kumar Teja Chippala <[email protected]>

## Summary Migrated `orchestration` module to Bazel from sbt. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Added Scala library configurations for main and test code. - Introduced new Log4j dependencies for enhanced logging capabilities. - **Chores** - Updated build configuration to support Scala project structure. - Expanded Maven repository with additional logging libraries.

## Summary eval("select ... from ... where ...") is now supported. easier for exploration. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [x] Integration tested - [x] Documentation update  ## Summary by CodeRabbit - **Documentation** - Updated setup instructions for Chronon Python interactive runner - Modified environment variable paths and gateway service instructions - **New Features** - Added ability to specify result limit when evaluating queries - Enhanced SQL query evaluation with error handling - **Bug Fixes** - Simplified interactive usage example - Updated Spark session method in DataFrame creation

@chewy-zlai

## Summary ^^^ - Did have to add `bigtables.tables.create` permission to the dataproc service account `[email protected]` since one of the tables `CHRONON_ENTITY_BY_TEAM` didn't exist at the time. See job: https://console.cloud.google.com/dataproc/jobs/a04f9ba8-583c-475e-8956-9b53d28f3ed6/monitoring?region=us-central1&inv=1&invt=AbnIoA&project=canary-443022 cc @chewy-zlai ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update Successful test in Dataproc: https://console.cloud.google.com/dataproc/jobs/04ff550c-4df6-47ce-a36b-3238c1cb63a1/monitoring?region=us-central1&inv=1&invt=AbnIzw&project=canary-443022 See: ``` (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/metadata_upload) $ cbt -project=canary-443022 -instance=zipline-canary-instance read CHRONON_METADATA 2025/01/17 21:44:34 -creds flag unset, will use gcloud credential ---------------------------------------- CHRONON_METADATA#purchases.v1 cf:value @ 2025/01/17-21:30:36.130000 "{\"metaData\":{\"name\":\"quickstart.purchases.v1\",\"online\":1,\"customJson\":\"{\\\"lag\\\": 0, \\\"groupby_tags\\\": null, \\\"column_tags\\\": {}}\",\"dependencies\":[\"{\\\"name\\\": \\\"wait_for_data.purchases_external_ds\\\", \\\"spec\\\": \\\"data.purchases_external/ds={{ ds }}\\\", \\\"start\\\": null, \\\"end\\\": null}\"],\"outputNamespace\":\"data\",\"team\":\"quickstart\",\"offlineSchedule\":\"@daily\"},\"sources\":[{\"events\":{\"table\":\"data.purchases_external\",\"query\":{\"selects\":{\"user_id\":\"user_id\",\"purchase_price\":\"purchase_price\"},\"timeColumn\":\"ts\",\"setups\":[]}}}],\"keyColumns\":[\"user_id\"],\"aggregations\":[{\"inputColumn\":\"purchase_price\",\"operation\":7,\"argMap\":{},\"windows\":[{\"length\":3,\"timeUnit\":1},{\"length\":14,\"timeUnit\":1},{\"length\":30,\"timeUnit\":1}]},{\"inputColumn\":\"purchase_price\",\"operation\":6,\"argMap\":{},\"windows\":[{\"length\":3,\"timeUnit\":1},{\"length\":14,\"timeUnit\":1},{\"length\":30,\"timeUnit\":1}]},{\"inputColumn\":\"purchase_price\",\"operation\":8,\"argMap\":{},\"windows\":[{\"length\":3,\"timeUnit\":1},{\"length\":14,\"timeUnit\":1},{\"length\":30,\"timeUnit\":1}]},{\"inputColumn\":\"purchase_price\",\"operation\":13,\"argMap\":{\"k\":\"10\"}}],\"backfillStartDate\":\"2023-11-20\"}" cf:value @ 2025/01/17-21:28:45.503000 "{\"metaData\":{\"name\":\"quickstart.purchases.v1\",\"online\":1,\"customJson\":\"{\\\"lag\\\": 0, \\\"groupby_tags\\\": null, \\\"column_tags\\\": {}}\",\"dependencies\":[\"{\\\"name\\\": \\\"wait_for_data.purchases_external_ds\\\", \\\"spec\\\": \\\"data.purchases_external/ds={{ ds }}\\\", \\\"start\\\": null, \\\"end\\\": null}\"],\"outputNamespace\":\"data\",\"team\":\"quickstart\",\"offlineSchedule\":\"@daily\"},\"sources\":[{\"events\":{\"table\":\"data.purchases_external\",\"query\":{\"selects\":{\"user_id\":\"user_id\",\"purchase_price\":\"purchase_price\"},\"timeColumn\":\"ts\",\"setups\":[]}}}],\"keyColumns\":[\"user_id\"],\"aggregations\":[{\"inputColumn\":\"purchase_price\",\"operation\":7,\"argMap\":{},\"windows\":[{\"length\":3,\"timeUnit\":1},{\"length\":14,\"timeUnit\":1},{\"length\":30,\"timeUnit\":1}]},{\"inputColumn\":\"purchase_price\",\"operation\":6,\"argMap\":{},\"windows\":[{\"length\":3,\"timeUnit\":1},{\"length\":14,\"timeUnit\":1},{\"length\":30,\"timeUnit\":1}]},{\"inputColumn\":\"purchase_price\",\"operation\":8,\"argMap\":{},\"windows\":[{\"length\":3,\"timeUnit\":1},{\"length\":14,\"timeUnit\":1},{\"length\":30,\"timeUnit\":1}]},{\"inputCol ``` ``` (dev_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/api/py/test/sample (davidhan/metadata_upload) $ cbt -project=canary-443022 -instance=zipline-canary-instance read CHRONON_ENTITY_BY_TEAM 2025/01/17 21:45:43 -creds flag unset, will use gcloud credential ---------------------------------------- CHRONON_ENTITY_BY_TEAM#group_bys/quickstart cf:value @ 2025/01/17-21:30:36.621000 "cHVyY2hhc2VzLnYx" ```  ## Summary by CodeRabbit ## Release Notes - **New Features** - Added configuration type specification option for command-line interfaces. - Enhanced metadata processing with more flexible configuration handling. - **Improvements** - Refined command execution logic for better control flow. - Updated metadata parsing to support additional configuration types. - **Documentation** - Added clarifying comments about asynchronous task behaviors in dataset and table creation processes. The release introduces more flexible configuration management and improved command-line argument handling across various components.

## Summary Input/out schema for the Job tracker and lineage API on ZiplineHub. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a comprehensive job tracking system with support for tracking job execution, status, and metadata. - Added a new API endpoint for handling job type requests. - Defined structures for job tracking requests and responses, including job metadata. - Introduced enumerations for job processing direction, execution modes, and statuses. - Added a new handler for processing job tracking requests. - **Bug Fixes** - Corrected a typographical error in a comment within the orchestration Thrift file.  --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

## Summary - Should be using the same reader as every other case. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Improvements** - Enhanced data loading consistency for Hive, Delta, and Iceberg table formats - Ensured that DataFrameReader options are consistently applied when loading tables  Co-authored-by: Thomas Chow <[email protected]>

## Summary - remove the implicit dependency on FormatProvider for CreationUtils. Step 1 to supporting format-specific table creation. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Code Improvements** - Updated method signatures in `TableUtils` and `CreationUtils` to enhance clarity and flexibility of table creation and partition insertion processes - Refined how file formats and table types are specified during table operations - Streamlined parameter handling for table creation methods  Co-authored-by: Thomas Chow <[email protected]>

## Summary - set writeOptions instead of global conf to avoid mutating / races between theads ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Configuration Management** - Simplified BigQuery configuration handling in Spark session - Removed configuration restoration logic after method execution Note: The changes are primarily internal and do not directly impact end-user functionality.  Co-authored-by: Thomas Chow <[email protected]>

## Summary - Initial implementation of Python `compile` -- compiles the whole repo - Placeholder implementation of `backfill` -- next step will be to wire up to the orchestrator for implementation - Documentation for the broader user API surface area and how it should behave (should maybe move this to pydocs) Currently, the interaction looks like this: <img width="729" alt="image" src="https://github.com/user-attachments/assets/a52cf582-4394-449a-8567-32c0f108d2ed" /> Next steps: - Connect with ZiplineHub for lineage, and backfill execution ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [x] Documentation update  ## Summary by CodeRabbit - **Documentation** - Expanded CLI documentation with detailed descriptions of `plan`, `backfill`, `deploy`, and `fetch` commands. - Added comprehensive usage instructions and examples for each command. - **New Features** - Enhanced repository management functions for Chronon objects. - Added methods for compiling, uploading, and managing configuration details. - Introduced new utility functions for module and variable name retrieval. - Added functionality to compute and upload differences in a local repository. - New function to convert JSON to Thrift binary format. - **Improvements** - Streamlined function signatures in extraction and compilation modules. - Added error handling and logging improvements. - Enhanced object serialization and diff computation capabilities. - Dynamic method bindings for `backfill`, `deploy`, and `info` functions in `GroupBy` and `Join`.  --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Bug Fixes** - Improved BigQuery partition loading mechanism to use more direct method calls when retrieving partition columns and values. Note: The changes appear technical and primarily affect internal data loading processes, with no direct end-user visible impact.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Updated Java version to Corretto 11.0.25.9.1 - Upgraded Google Cloud SDK (gcloud) from version 504.0.1 to 507.0.0   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary We have a strict dependency on java 11 for all the dataproc stuff so it's good to be consistent across our project. Currently only service_commons package has strict dependency on java 17 so made changes to be compatible with java 11. was able to successfully build using both sbt and bazel. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Configuration** - Downgraded Java language and runtime versions from 17 to 11 in Bazel configuration - **Code Improvement** - Updated type handling in `RouteHandlerWrapper` method signature for enhanced type safety

## Summary Set up a Flink job that can take beacon data as avro (configured in gcs) and emit it at a configurable rate to Kafka. We can use this stream in our GB streaming jobs ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested Kicked off the job and you can see events flowing in topic [test-beacon-main](https://console.cloud.google.com/managedkafka/us-central1/clusters/zipline-kafka-cluster/topics/test-beacon-main?hl=en&invt=Abnpeg&project=canary-443022) - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **New Features** - Added Kafka data ingestion capabilities using Apache Flink. - Introduced a new driver for streaming events from GCS to Kafka with configurable delay. - **Dependencies** - Added Apache Flink connectors for Kafka, Avro, and file integration. - Integrated managed Kafka authentication handler for cloud environments. - **Infrastructure** - Created new project configuration for Kafka data ingestion. - Updated build settings to support advanced streaming workflows. - Updated cluster name configuration for Dataproc submitter.

## Summary - the way tables are created can be different depending on the underlying storage catalog that we are using. In the case of BigQuery, we cannot issue a spark sql statement to do this operation. Let's abstract that out into the `Format` layer for now, but eventually we will need a `Catalog` abstraction that supports this. - Try to remove the dependency on a sparksession in `Format`, invert the dependency with a HOF ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enhanced BigQuery integration with new client parameter - Added support for more flexible table creation methods - Improved logging capabilities for table operations - **Bug Fixes** - Standardized table creation process across different formats - Removed unsupported BigQuery table creation operations - **Refactor** - Simplified table creation and partition insertion logic - Updated method signatures to support more comprehensive table management   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary - https://app.asana.com/0/1208949807589885/1209206040434612/f - Support explicit bigquery table creation. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update --- - To see the specific tasks where the Asana app for GitHub is being used, see below: - https://app.asana.com/0/0/1209206040434612  ## Summary by CodeRabbit - **New Features** - Enhanced BigQuery table creation functionality with improved schema and partitioning support. - Streamlined table creation process in Spark's TableUtils. - **Refactor** - Simplified table existence checking logic. - Consolidated import statements for better readability. - Removed unused import in BigQuery catalog test. - Updated import statement in GcpFormatProviderTest for better integration with Spark BigQuery connector. - **Bug Fixes** - Improved error handling for table creation scenarios.   --------- Co-authored-by: Thomas Chow <[email protected]>

…obs (#274) ## Summary Changes to avoid runtime errors while running Spark and Flume jobs on the cluster using Dataproc submit for the newly built bazel assembly jars. Successfully ran the DataprocSubmitterTest jobs using the newly built bazel jars for testing. Testing Steps 1. Build the assembly jars ``` bazel build //cloud_gcp:lib_deploy.jar bazel build //flink:assembly_deploy.jar bazel build //flink:kafka-assembly_deploy.jar ``` 2. Copy the jars to gcp account used by our jobs ``` gsutil cp bazel-bin/cloud_gcp/lib_deploy.jar gs://zipline-jars/bazel-cloud-gcp.jar gsutil cp bazel-bin/flink/assembly_deploy.jar gs://zipline-jars/bazel-flink.jar gsutil cp bazel-bin/flink/kafka-assembly_deploy.jar gs://zipline-jars/bazel-flink-kafka.jar ``` 3. Modify the jar paths in the DataprocSubmitterTest file and run the tests ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **New Features** - Added Kafka and Avro support for Flink. - Introduced new Flink assembly and Kafka assembly binaries. - **Dependency Management** - Updated Maven artifact dependencies, including Kafka and Avro. - Excluded specific Hadoop and Spark-related artifacts to prevent runtime conflicts. - **Build Configuration** - Enhanced build rules for Flink and Spark environments. - Improved dependency management to prevent class compatibility issues.

## Summary The string formatting here was breaking zipline commands when I built a new wheel: ``` File "/Users/davidhan/zipline/chronon/dev_chronon/lib/python3.11/site-packages/ai/chronon/repo/hub_uploader.py", line 73 print(f"\n\nUploading:\n {"\n".join(diffed_entities.keys())}") ^ SyntaxError: unexpected character after line continuation character ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Style** - Improved string formatting and error message construction for better code readability - Enhanced log message clarity in upload process Note: These changes are internal improvements that do not impact end-user functionality.

## Summary - With #263 we control table creation ourselves. We don't need to rely on indirect writes to then do the table creation (and partitioning) for us, we just simply use the storage API to write directly into the table we created. This should be much more performant and preferred over indirect writes because we don't need to stage data, then load as a temp BQ table, and it uses the BigQuery storage API directly. - Remove configs that are used only for indirect writes ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **Improvements** - Enhanced BigQuery data writing process with more precise configuration options. - Simplified table creation and partition insertion logic. - Improved handling of DataFrame column arrangements during data operations. - **Changes** - Updated BigQuery write method to use a direct writing approach. - Introduced a new option to prevent table creation if it does not exist. - Modified table creation process to be more format-aware. - Streamlined partition insertion mechanism. These updates improve data management and writing efficiency in cloud data processing workflows.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary - Pull alter table properties functionality into `Format` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **New Features** - Added a placeholder method for altering table properties in BigQuery format - Introduced a new method to modify table properties across different Spark formats - Enhanced table creation utility to use format-specific property alteration methods - **Refactor** - Improved table creation process by abstracting table property modifications - Standardized approach to handling table property changes across different formats  --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Added configurable repartitioning option for DataFrame writes. - Introduced a new configuration setting to control repartitioning behavior. - Enhanced test suite with functionality to handle empty DataFrames. - **Chores** - Improved code formatting and logging for DataFrame writing process.   --------- Co-authored-by: tchow-zlai <[email protected]> Co-authored-by: Thomas Chow <[email protected]>

) ## Summary we cannot represent absent data in time series with null values because the thrift serializer doesn't allow nulls in list<double> - so we create a magic double value (based on string "chronon") to represent nulls ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new constant `magicNullDouble` for handling null value representations - **Bug Fixes** - Improved null value handling in serialization and data processing workflows - **Tests** - Added new test case for `TileDriftSeries` serialization to validate null value management The changes enhance data processing consistency and provide a standardized approach to managing null values across the system.

## Summary This fixes an issue where Infinity/NaN values in drift calculations were causing JSON parse errors in the frontend. These special values are now converted to our standard magic null value (-2.7980863399423856E16) before serialization. ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Bug Fixes** - Enhanced error handling for special double values (infinity and NaN) in data processing - Improved serialization test case to handle null and special values more robustly - **Tests** - Updated test case for `TileDriftSeries` serialization to cover edge cases with special double values

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Revised the upload process so that release artifacts are now organized into distinct directories based on their type. This update separates application packages into dedicated folders, ensuring a more orderly and reliable distribution of the latest releases.

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Improved cloud artifact deployment now ensures that release items are available on both legacy and updated storage paths for enhanced reliability. - Enhanced data processing now dynamically determines partition configurations and provides clear notifications when sub-partition filtering isn’t supported.

## Summary ```bash ➜ chronon git:(tchow/zipline-init) ✗ zipline init Warning: /Users/thomaschow/zipline-ai/chronon/zipline is not empty. Proceed? [y/N]: Y Generating scaffolding at /Users/thomaschow/zipline-ai/chronon/zipline... Project scaffolding created successfully! 🎉 ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a new CLI tool for safely initializing and scaffolding projects with user confirmation to avoid accidental overwrites. - Integrated the new command into the main interface for streamlined project setup. - Added a configuration file for custom sample team join and aggregation settings. - **Chores** - Updated dependency management by reintroducing essential packages and adding new ones. - Revised package metadata and version requirements to support a newer Python version.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary - Do not use the `DelegatingBigQueryMetastore` as a session catalog, just have it be a custom catalog. This will change the following configuration set From: ```bash spark.sql.catalog.spark_catalog.warehouse: "gs://zipline-warehouse-etsy/data/tables/" spark.sql.catalog.spark_catalog.gcp_location: "us" spark.sql.catalog.spark_catalog.gcp_project: "etsy-zipline-dev" spark.sql.catalog.spark_catalog.catalog-impl: org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog spark.sql.catalog.spark_catalog: ai.chronon.integrations.cloud_gcp.DelegatingBigQueryMetastoreCatalog spark.sql.catalog.spark_catalog.io-impl: org.apache.iceberg.io.ResolvingFileIO spark.sql.catalog.default_iceberg: ai.chronon.integrations.cloud_gcp.DelegatingBigQueryMetastoreCatalog spark.sql.catalog.default_iceberg.catalog-impl: org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog spark.sql.catalog.default_iceberg.io-impl: org.apache.iceberg.io.ResolvingFileIO spark.sql.catalog.default_iceberg.warehouse: "gs://zipline-warehouse-etsy/data/tables/" spark.sql.catalog.default_iceberg.gcp_location: "us" spark.sql.catalog.default_iceberg.gcp_project: "etsy-zipline-dev" spark.sql.defaultUrlStreamHandlerFactory.enabled: "false" spark.kryo.registrator: "ai.chronon.integrations.cloud_gcp.ChrononIcebergKryoRegistrator" ``` to: ```bash spark.sql.defaultCatalog: "default_iceberg" spark.sql.catalog.default_iceberg: "ai.chronon.integrations.cloud_gcp.DelegatingBigQueryMetastoreCatalog" spark.sql.catalog.default_iceberg.catalog-impl: "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog" spark.sql.catalog.default_iceberg.io-impl: "org.apache.iceberg.io.ResolvingFileIO" spark.sql.catalog.default_iceberg.warehouse: "gs://zipline-warehouse-etsy/data/tables/" spark.sql.catalog.default_iceberg.gcp_location: "us" spark.sql.catalog.default_iceberg.gcp_project: "etsy-zipline-dev" spark.sql.defaultUrlStreamHandlerFactory.enabled: "false" spark.kryo.registrator: "ai.chronon.integrations.cloud_gcp.ChrononIcebergKryoRegistrator" spark.sql.catalog.default_bigquery: "ai.chronon.integrations.cloud_gcp.DelegatingBigQueryMetastoreCatalog" ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Refactor** - Improved internal table processing by restructuring class integrations and enhancing error messaging when a table isn’t found. - **Tests** - Updated integration settings and adjusted reference parameters to ensure validations remain aligned with the new catalog implementation.   --------- Co-authored-by: Thomas Chow <[email protected]>

…493) ## Summary Adding function stubs and thrift schemas for orchestration with column level lineage. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Streamlined CLI operations with a simplified backfill command and removal of an obsolete synchronization command. - Enhanced configuration processing that now automatically compiles and returns results. - Expanded physical planning capabilities with new mechanisms for managing computation graphs. - Extended tracking of data transformations through improved column lineage support in metadata. - Introduction of new classes and methods for managing physical nodes and graphs, enhancing data processing workflows. - New interface for orchestrators to handle configuration fetching and uploading. - New structures for detailed column lineage tracking throughout data processing stages. - Added support for generating Python code from new Thrift definitions in the testing workflow. - **Bug Fixes** - Improved documentation for clarity on the functionality and logic of certain fields.  --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: ezvz <[email protected]>

## Summary Initial skeleton implementation for the execution layer with the following 1. Custom Thrift payload converter for our inputs/outputs 2. NodeExecutionWorkflow implementation which utilizes NodeRunnerActivity implementation to recursively trigger and wait for dependencies before submitting the job for the node 3. NodeExecutionWorker and NodeExecutionWorkflowRunner implementation for local testing (this can be refactored and modified for the actual production deployment) 4. Added basic unit tests for the DagExecution workflow and will add more in future iterations ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a new orchestration system for executing node-based operations with dependency management, including enhanced worker and runner components. - Added a new `DummyNode` structure for improved data representation in workflows. - Implemented a `ThriftPayloadConverter` for custom serialization and deserialization of Thrift objects in workflows. - Developed a `NodeExecutionWorkflow` for managing the execution of nodes within a Directed Acyclic Graph (DAG). - Created a `NodeExecutionActivity` interface for handling node dependencies and job submissions. - Added a `TaskQueue` trait to facilitate load balancing and independent scaling of task queues. - Introduced a `WorkflowOperations` interface for managing workflow operations and statuses. - **Documentation** - Added a README.md file detailing the orchestration module's functionalities and usage instructions. - **Tests** - Expanded test coverage for workflow executions, activity handling, and serialization, ensuring improved reliability. - Introduced new unit and integration tests for `NodeExecutionWorkflow`, `NodeExecutionActivity`, and `ThriftPayloadConverter`. - **Chores** - Streamlined build configurations and updated dependency management for enhanced performance and maintainability.  --------- Co-authored-by: david-zlai <[email protected]> Co-authored-by: chewy-zlai <[email protected]> Co-authored-by: Nikhil Simha <[email protected]> Co-authored-by: Ken Morton <[email protected]> Co-authored-by: Sean Lynch <[email protected]> Co-authored-by: Sean Lynch <[email protected]> Co-authored-by: Piyush Narang <[email protected]> Co-authored-by: Thomas Chow <[email protected]> Co-authored-by: Thomas Chow <[email protected]>

…#489) ## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Refactor** - Updated the method for loading data in tests to enhance consistency and abstraction across various cases. - **Style** - Reorganized import statements within tests for improved clarity and maintainability. *Note: These behind‑the‑scenes improvements bolster overall system quality without affecting end‑user functionality.*   Co-authored-by: Thomas Chow <[email protected]>

… exec nodes (#515) ## Summary The updated Etsy listing actions GroupBy consists of a select which has "explode(transform(...))". When we run that through our Validation Flink job we see no matches. This is due to the explode + transform resulting in spark splitting up the whole stage code gen into multiple stages (and one of the stages being a 'GenerateExec' node for the explode). This PR reworks our CU code quite a bit to support this. * With these changes we consistently do see the validation upfront succeeding * Added a set of tests that mimic the listing actions behavior in terms of avro shape, mock data examples and confirm they pass * There are some perf issues still - we see that we need higher //ism to keep up with the kafka stream (need 96). This is something we can tune in a follow up. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a new utility for Avro data encoding and deserialization. - Added a test suite for validating complex Avro payload processing. - **Refactor** - Streamlined data processing and transformation logic for improved performance and maintainability. - Enhanced logging for execution plan processing. - **Tests** - Expanded test coverage for Avro deserialization and SQL transformation scenarios. - Added new test cases for list and array containers. - **Chores** - Made minor formatting updates for better code clarity. - Deleted obsolete helper methods.  --------- Co-authored-by: Nikhil Simha <[email protected]>

## Summary Our wheel seems to be broken after we added the lineage.thrift file - ``` zipline compile --input_path group_bys/search/beacons/listing.py Traceback (most recent call last): File "/Users/pnarang/workspace/zipline/dev_chronon/bin/zipline", line 5, in <module> from ai.chronon.repo.zipline import zipline File "/Users/pnarang/workspace/zipline/dev_chronon/lib/python3.11/site-packages/ai/chronon/repo/__init__.py", line 15, in <module> from ai.chronon.api.ttypes import GroupBy, Join, StagingQuery, Model File "/Users/pnarang/workspace/zipline/dev_chronon/lib/python3.11/site-packages/ai/chronon/api/ttypes.py", line 16, in <module> import ai.chronon.lineage.ttypes ModuleNotFoundError: No module named 'ai.chronon.lineage' ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enhanced the automated generation process to support all Thrift files in the `api/thrift/` directory, improving efficiency. - Updated the build routine to streamline the generation of Thrift files, ensuring consistency and smooth integration. - Consolidated ignored directory paths in the version control system for better management. - **Bug Fixes** - Removed the deprecated script responsible for generating Python Thrift files, eliminating associated issues.

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enhanced data write configuration by adding options for distribution mode (none) and target file size (512 MB) during the write operation.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary This can help us dig into why our Flink jobs are slow. [Flink docs](https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/flame_graphs/) ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - Tested by running on Etsy side - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enhanced job monitoring through the addition of flame graph generation for improved REST API insights.

## Summary Simplified tests to only have unit tests using PostgresSql test containers. Deleted integration tests as they are not truly integration tests and we had to use slightly complicated mix-in composition to remove duplicate code. Ideally we should have used test containers that starts PGAdapter connection to Spanner emulator but currently that implementation doesn't exist and creating our own had challenges as the docker image for local PGAdapter doesn't have good multi platform support and are failing on MAC (only supported on linux) and had challenges making the tests work on all platforms ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced enhanced batch processing for managing dependency relationships, strengthening orchestration performance. - **Chores** - Streamlined testing infrastructure for improved database management and tighter code encapsulation. - Retired outdated integration tests and refined dependency configurations to optimize the build process.

## Summary - Create a `/canary` dir for integration testing - Migrate the `gcp` integration test script. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced new configuration files and pipelines for purchase metric aggregations across production, development, and test environments for both AWS and GCP. - New README files added for canary configurations and AWS Zipline project, providing structured guides and cloud-specific instructions. - **Documentation** - Released new and updated guides with cloud-specific instructions and sample pipelines. - **Refactor** - Enhanced clarity and consistency in variable initialization for cloud file uploads. - Streamlined quickstart scripts by centralizing repository paths and removing redundant setup steps. - **Bug Fixes** - Improved handling of variable initialization for cloud file uploads, ensuring consistent behavior across different operational modes. - Consolidated customer ID handling in upload scripts for improved usability.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary - We want users to be able to specify write options. In the case of iceberg tables, these can be set at the table level through table properties. The chronon API already supports table properties, so let's just ride on those rails. - Users can set write distribution options through this mechanism: https://iceberg.apache.org/docs/latest/spark-writes/#writing-distribution-modes ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enabled dynamic updates of table properties during partition insertions, offering more flexible table management. - **Refactor** - Streamlined table write operations by removing legacy configuration options to simplify data handling.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Refactor** - Streamlined internal table data handling for improved consistency. - Removed outdated partition processing functionality.  Co-authored-by: Thomas Chow <[email protected]>

## Summary This updates the Push To Canary workflow to push the updated jars and wheel to the cloud canaries and run integration tests. The jobs are as follows: * **build_artifacts** builds the jars and wheel from the main branch and uploads them to github for use in separate jobs simultaneously * **push_to_aws** downloads the jars and wheel from github and uploads them to the s3 zipline-artifacts-canary bucket under version "candidate" * **push_to_gcp** downloads the jars and wheel from github and uploads them to the gcs zipline-artifacts-canary bucket under version "candidate" * **run_aws_integration_tests** runs the quickstart compile and backfill based on [scripts/distribution/run_aws_quickstart.sh](https://github.com/zipline-ai/chronon/blob/main/scripts/distribution/run_aws_quickstart.sh) * **run_gcp_integration_tests** runs the quickstart compile, backfill, group-by-upload, upload-to-kv, metadata-upload, and fetch based on [scripts/distribution/run_gcp_quickstart.sh](https://github.com/zipline-ai/chronon/blob/main/scripts/distribution/run_gcp_quickstart.sh) * **clean_up_artifacts** removes the jars and wheel from github to minimize the storage costs. * **merge_to_main-passing-tests** if the tests all pass, this will merge the latest commits to the branch "main-passing-tests". This also creates a txt file artifact marking which commit sha passed which is saved by github. * **on_fail_notify_slack** if the tests fail, this will send a notification to the "integration-tests" channel of our Slack. A test run of the workflow can be seen here: https://github.com/zipline-ai/chronon/actions/runs/13913346353 ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [x] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Summary by CodeRabbit - **Chores** - Enhanced our release pipeline to streamline package generation and distribution. - Automated deployment of new build packages across AWS and GCP. - Introduced cleanup routines to optimize the post-deployment process. - Updated job structure for improved artifact management and deployment. - Added new jobs for building artifacts and uploading them to cloud storage, including `build_artifacts`, `push_to_aws`, `push_to_gcp`, and `clean_up_artifacts`. - Defined outputs for generated artifacts to improve tracking. - Improved flexibility in deployment scripts for different environments (dev and canary). - Updated scripts to conditionally handle resource management based on environment settings. - Introduced environment-specific configurations for AWS and GCP operations. - Added integration testing jobs for AWS and GCP post-deployment. - Enhanced scripts for AWS and GCP to support environment-specific operations and configurations. - Introduced a mechanism to switch between different environments affecting resource management and command configurations.

## Summary Pulled some code out of the CU hotpath (the row => {...} methods) that relates to init stuff based on flamegraph debugging. ## Checklist - [ ] Added Unit Tests - [X] Covered by existing CI - [X] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Extended validation reports now include additional row count metrics, offering enhanced insight into data processing performance. - **Refactor** - Optimized transformation logic for greater efficiency and consistency. - Improved error handling and logging for better performance tracking.

## Summary ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Integrated a new linting tool into the development workflow for improved code quality. - **Refactor/Style** - Streamlined and reorganized code structure by reordering import statements and adjusting default parameters. - Enhanced error handling to provide clearer diagnostics without impacting functionality. - **Chores** - Updated configuration and build settings to standardize the codebase and CI processes. - Added a new configuration file for the Ruff linter, specifying linting settings and rules.

## Summary This PR is mostly a continuation of the initial Tailwind 4 [upgrade](#430) with some additional best practices and simplification applied and has the goal of introducing minimal visual changes. With that said, there are some minor visual changes due to the removal of all the [custom](https://github.com/zipline-ai/chronon/pull/505/files#diff-8fdab0c1827a182dfaa61b1253086037a44105350409f6701c49ebe130f6705bL168-L243) font sizes. This removal helps to simplify the config and usage (Tailwind already provides many [sizes](https://tailwindcss.com/docs/font-size) and [weights](https://tailwindcss.com/docs/font-weight) by default). We can still refine the sizing if desired with 1-off custom sizes (ex. `text-[13px]`) but the simplification improved readability. I did manually refine a few sizes already to tighten up the design. - Switch from `@tailwindcss/postcss` to `@tailwindcss/vite` ([recommended](https://tailwindcss.com/docs/upgrade-guide#using-vite)) - Migrate config from [legacy](https://tailwindcss.com/docs/upgrade-guide#using-a-javascript-config-file) `tailwind.config.js` (js-based) to `app.css` (css-first with directives) - Cleanup / simplify - Remove unused `@tailwindcss/typography` plugin - Remove unused container config (no longer [supported](https://tailwindcss.com/docs/upgrade-guide#container-configuration) in v4 as well) - Remove use of deprecated/unsupported [safelist](https://tailwindcss.com/docs/upgrade-guide#using-a-javascript-config-file) - Simplify job tracking styling by using css directly (remove js / tailwind indirection) - Rename `MetadataTableSection` to `MetadataTable` and replace `MetadataTable` usage with simple div - `MetadataTableSection` was the actual `Table` - Removes dynamic `grid-col-{$columns}` class string which isn't [supported](https://tailwindcss.com/docs/detecting-classes-in-source-files#dynamic-class-names) by Tailwind (especially without `safelist` support) - Update [LayerChart](https://github.com/techniq/layerchart/pull/449/files) and [Svelte UX](techniq/svelte-ux#571) to `2.0.0@next` (improved compat with Tailwind 4 and [later](techniq/layerchart#458) Svelte 5) ## Screenshots ### Dark mode Before | After --- | --- ![image](https://github.com/user-attachments/assets/2dc3a358-2146-4973-b997-e5005d83bbe1) | ![image](https://github.com/user-attachments/assets/541b78b2-20a0-48e5-8142-3a396a71170d) ![image](https://github.com/user-attachments/assets/c938134b-95bc-4acf-84bb-56f9cc4fefd2) | ![image](https://github.com/user-attachments/assets/441be3de-bc18-43ad-93a5-4e1331a3c2a5) ![image](https://github.com/user-attachments/assets/550e998a-0629-49e6-9a5f-5a0ef3aa00b1) | ![image](https://github.com/user-attachments/assets/e1a331d4-f0ee-4083-991c-6ce88aeee385) ![image](https://github.com/user-attachments/assets/94a77457-b1c2-49fc-94ae-b4a2aa14b8ab) | ![image](https://github.com/user-attachments/assets/4c060367-f54f-4f44-b79e-90f2958eb019) ### Light mode Before | After --- | --- ![image](https://github.com/user-attachments/assets/953859d2-e056-4a0b-b272-b95b303a4bc9) | ![image](https://github.com/user-attachments/assets/20667afd-49cb-4182-bf9f-ec7f6f14031c) ![image](https://github.com/user-attachments/assets/de159fe6-eace-45b2-a741-4e6600aa1580) | ![image](https://github.com/user-attachments/assets/82b332ae-865b-4f1a-b61f-8b0be178ff9f) ![image](https://github.com/user-attachments/assets/deb19053-c8a9-4205-b8d4-0f594f97fb35) | ![image](https://github.com/user-attachments/assets/28819745-0598-4eca-a65b-942bd5db79e6) ![image](https://github.com/user-attachments/assets/c46b9a2f-8250-42a0-a7c1-991053df3abf) | ![image](https://github.com/user-attachments/assets/7fdfd8cb-b8c6-44e7-a225-d381956b67a6) ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **New Features** - Introduced enhanced theming with a comprehensive set of CSS custom properties for improved color schemes and typography. - Streamlined metadata and status displays for a cleaner and more intuitive presentation. - **Style** - Refined UI elements including navigation bars, headers, and buttons for better readability and visual consistency. - Updated layout adjustments across key pages, offering a more cohesive look and feel. - **Chores** - Upgraded dependencies and optimized build configurations while removing legacy styling tools for improved performance.  --------- Co-authored-by: Sean Lynch <[email protected]>

## Summary Ran the canary integration tests! ```bash <----------------------------------------------------------------------------------- ------------------------------------------------------------------------------------ DATAPROC LOGS ------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------> ++ cat tmp_metadata_upload.out ++ grep 'Dataproc submitter job id' ++ cut -d ' ' -f5 + METADATA_UPLOAD_JOB_ID=0f0a7099-3f03-4487-b291-de26f3d56cdb + check_dataproc_job_state 0f0a7099-3f03-4487-b291-de26f3d56cdb + JOB_ID=0f0a7099-3f03-4487-b291-de26f3d56cdb + '[' -z 0f0a7099-3f03-4487-b291-de26f3d56cdb ']' + echo -e '\033[0;32m <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>>\033[0m' <<<<<<<<<<<<<<<<-----------------JOB STATUS----------------->>>>>>>>>>>>>>>>> ++ gcloud dataproc jobs describe 0f0a7099-3f03-4487-b291-de26f3d56cdb --region=us-central1 --format=flattened ++ grep status.state: + JOB_STATE='status.state: DONE' + echo status.state: DONE status.state: DONE + '[' -z 'status.state: DONE' ']' + echo -e '\033[0;32m<<<<<.....................................FETCH.....................................>>>>>\033[0m' <<<<<.....................................FETCH.....................................>>>>> + touch tmp_fetch.out + zipline run --repo=/Users/thomaschow/zipline-ai/chronon/api/py/test/canary --mode fetch --conf=production/group_bys/gcp/purchases.v1_dev -k '{"user_id":"5"}' --name gcp.purchases.v1_dev + tee tmp_fetch.out + grep -q purchase_price_average_14d + cat tmp_fetch.out + grep purchase_price_average_14d "purchase_price_average_14d" : 72.5, + '[' 0 -ne 0 ']' + echo -e '\033[0;32m<<<<<.....................................SUCCEEDED!!!.....................................>>>>>\033[0m' <<<<<.....................................SUCCEEDED!!!.....................................>>>>> ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Upgraded several core dependency versions from 1.5.2 to 1.6.1. - Updated artifact metadata, including revised checksums and additional dependency entries, to enhance concurrency handling and HTTP client support.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary Adding stubs and schemas for orchestrator ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [x] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced an orchestration service with HTTP endpoints for health checks, configuration display, and diff uploads. - Expanded branch mapping capabilities and enhanced data exchange processes. - Added new data models and database access logic for configuration management. - Implemented a new upload handler for managing configuration uploads and diff operations. - Added a new `UploadHandler` class for handling uploads and diff requests. - Introduced a new `WorkflowHandler` class for future workflow management. - **Refactor** - Streamlined job orchestration for join, merge, and source operations by adopting unified node-based configurations. - Restructured metadata and persistence handling for greater clarity and efficiency. - Updated import paths to reflect new organizational structures within the codebase. - **Chore** - Updated dependency integrations and package structures. - Removed obsolete components and tests to improve overall maintainability. - Introduced new test specifications for validating database access logic. - Added new tests for the `NodeDao` class to ensure functionality.  --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]>

coderabbitai · 2025-03-25T20:38:25Z

Walkthrough

This update adds a new test case in the API to verify that a base64-encoded tile key written in Flink format is properly deserialized using TilingUtils.deserializeTileKey. Additionally, the access modifier of the avroConvertTileToPutRequest method in the Flink codec function is changed from public to private, ensuring internal encapsulation without affecting functionality.

Changes

File(s)	Change Summary
`api/src/test/scala/ai/chronon/.../TilingUtilSpec.scala`	Added a test case to decode a base64-encoded tile key and verify the non-null deserialized output using `TilingUtils.deserializeTileKey`.
`flink/src/main/scala/ai/chronon/.../AvroCodecFn.scala`	Changed the access modifier of `avroConvertTileToPutRequest` in `TiledAvroCodecFn` from public to private; functionality remains unchanged.

Possibly related PRs

Flink updates to use TileKey on tiling flow writes #307: Involves modifications in handling tile keys with deserialization logic, linking to the test coverage enhancement in this PR.
Flink/Fetcher tile key thrift interface #300: Introduced the TilingUtils object and its deserializeTileKey method, which relates to the new test for verifying tile key deserialization.

Suggested reviewers

tchow-zlai

Poem

In our code realm, a test takes flight,
A secret function tucked away from sight,
Base64 mysteries now unfurled,
Private methods shape a safer world,
CodeRabbit strides with a joyful spark!
🚀 Happy coding in the light of the dark!

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between a29efd2 and 3871d23.

📒 Files selected for processing (2)

api/src/test/scala/ai/chronon/api/test/TilingUtilSpec.scala (1 hunks)
flink/src/main/scala/ai/chronon/flink/AvroCodecFn.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

api/src/test/scala/ai/chronon/api/test/TilingUtilSpec.scala

⏰ Context from checks skipped due to timeout of 90000ms (6)

GitHub Check: non_spark_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: non_spark_tests
GitHub Check: frontend_tests
GitHub Check: build-and-push
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (1)

flink/src/main/scala/ai/chronon/flink/AvroCodecFn.scala (1)

156-156: Good encapsulation improvement.

Making this method private properly restricts its visibility to the class scope, which aligns with its internal-only usage pattern.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai plan to trigger planning for file edits and PR creation.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

api/src/test/scala/ai/chronon/api/test/TilingUtilSpec.scala (1)

27-32: Test only verifies non-null, needs to check key properties

Test verifies deserialization succeeds but doesn't validate key contents. Consider assertions for dataset, keyBytes, etc.

  "TilingUtils" should "deser Flink written tile key" in {
    val base64String = "GChTRUFSQ0hfQkVBQ09OU19MSVNUSU5HX0FDVElPTlNfU1RSRUFNSU5HGWMCkJHL6Q0WgLq3AxaAyJ/uuWUA" //  "Aiw=" // "ArL4u9gL"
    val bytes = java.util.Base64.getDecoder.decode(base64String)
    val deserializedKey = TilingUtils.deserializeTileKey(bytes)
    deserializedKey should not be null
+   deserializedKey.getDataset should be("SEARCH_BEACONS_LISTING_ACTIONS_STREAMING")
+   deserializedKey.getTileSizeMillis should be > 0L
+   deserializedKey.getTileStartTimestampMillis should be > 0L
  }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between e7706f0 and a29efd2.

📒 Files selected for processing (2)

api/src/main/scala/ai/chronon/api/TilingUtils.scala (2 hunks)
api/src/test/scala/ai/chronon/api/test/TilingUtilSpec.scala (1 hunks)

🧰 Additional context used

🧬 Code Definitions (1)

api/src/test/scala/ai/chronon/api/test/TilingUtilSpec.scala (1)

api/src/main/scala/ai/chronon/api/TilingUtils.scala (2)

TilingUtils (11-35)

deserializeTileKey (30-34)

🔇 Additional comments (4)

api/src/main/scala/ai/chronon/api/TilingUtils.scala (4)

17-19: Protocol change needs validation

Changed from binary to compact serialization. Ensure compatibility with existing data.

22-24: Protocol change needs compatibility testing

Switching from binary to compact deserialization may affect reading existing data.

26-28: Serialization method update looks good

Updated to use compact serializer.

30-34: Deserialization method update looks good

Updated to use compact deserializer.

api/src/main/scala/ai/chronon/api/TilingUtils.scala

ken-zlai and others added 30 commits January 21, 2025 09:32

david-zlai and others added 21 commits March 14, 2025 17:17

coderabbitai bot reviewed Mar 25, 2025

View reviewed changes

api/src/main/scala/ai/chronon/api/TilingUtils.scala Outdated Show resolved Hide resolved

fix: use compact serde

3871d23

nikhil-zlai force-pushed the nikhil/tile_serde_bug branch from a29efd2 to 3871d23 Compare March 25, 2025 21:09

nikhil-zlai changed the title ~~fix: use compact serde for tiling~~ chore: refactor flink code to re-use stream creation logic Mar 25, 2025

nikhil-zlai changed the title ~~chore: refactor flink code to re-use stream creation logic~~ fix: use tiling flag from groupby + code re-use Mar 25, 2025

tchow-zlai approved these changes Mar 26, 2025

View reviewed changes

kumar-zlai closed this May 1, 2025

kumar-zlai force-pushed the main branch from a969d44 to e6f8822 Compare May 1, 2025 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: use tiling flag from groupby + code re-use #538

fix: use tiling flag from groupby + code re-use #538

Uh oh!

nikhil-zlai commented Mar 25, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 25, 2025 •

edited

Loading

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

fix: use tiling flag from groupby + code re-use #538

fix: use tiling flag from groupby + code re-use #538

Uh oh!

Conversation

nikhil-zlai commented Mar 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

nikhil-zlai commented Mar 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 25, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)