Skip to content

Conversation

@chewy-zlai
Copy link
Collaborator

@chewy-zlai chewy-zlai commented May 15, 2025

Summary

Remove customer names from commit history

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • New Features

    • Introduced a comprehensive Python API and CLI for defining, compiling, and running data pipelines, including GroupBy, Join, and StagingQuery objects.
    • Added support for AWS and GCP cloud environments with sample configurations, deployment scripts, and specialized runners.
    • Provided utilities for evaluating and sampling data, managing environment variables, and handling cloud-specific job execution.
    • Included rich CLI tools for project initialization, compilation, and job orchestration.
    • Added detailed airflow dependency extraction and metadata management helpers.
  • Bug Fixes

    • Improved handling of partition columns, window specifications, and query parameters in aggregation and query definitions.
    • Enhanced error reporting and validation for configuration and compilation workflows.
    • Fixed streaming job submission and concurrency checks to prevent duplicate runs.
  • Documentation

    • Added detailed README files and cloud-specific instructions for project setup and usage.
    • Updated and expanded docstrings and usage examples throughout the API.
    • Added canary and sample pipeline documentation.
  • Refactor

    • Modernized and reorganized test and core modules for clarity and maintainability.
    • Migrated Scala test suites to ScalaTest and improved Python test structure.
    • Streamlined configuration, metadata, and environment management for better consistency.
    • Simplified join and group-by construction APIs and metadata handling.
    • Removed legacy Airflow integration and replaced with CLI-driven orchestration.
  • Chores

    • Updated and pinned Python and tool dependencies for reproducibility.
    • Added and improved configuration files for linting, build, and CI workflows.
    • Removed deprecated and unused files including legacy Airflow DAG constructors, operators, helpers, and example code.
    • Added Dockerfile and GitHub Actions workflows for CI/CD and image builds.
  • Style

    • Standardized import orders, argument formatting, and code style across modules.
    • Improved code readability and formatting in Python and Scala components.
    • Adopted consistent naming and JSON serialization conventions.
  • Tests

    • Expanded and refactored test coverage for core aggregation, join, and utility logic.
    • Added cloud-specific and sample pipeline tests for validation.
    • Migrated tests from JUnit to ScalaTest with idiomatic style.
  • Revert

    • Removed legacy Airflow DAG constructors, operators, and helpers in favor of the new unified CLI and cloud-native approach.

ken-zlai and others added 30 commits February 19, 2025 15:47
## Summary
Changed the baour clientsend code to only compute 3 percentiles (p5, p50, p95)
for returning to the frontend.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Enhanced statistical data processing to consistently handle cases with
missing values by using a robust placeholder, ensuring clearer
downstream analytics.
- Adjusted the percentile chart configuration so that the 95th, 50th,
and 5th percentiles are accurately rendered, providing more reliable
insights for users.
- Relaxed the null ratio validation in summary data, allowing for a
broader acceptance of null values, which may affect drift metric
interpretations.

- **New Features**
- Introduced methods for converting percentile strings to index values
and filtering percentiles based on user-defined requests, improving data
handling and representation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Changes to support builds/tests with both scala 2.12 and 2.13 versions.
By default we build against 2.12 version, pass "--config scala_2.13"
option to "bazel build/test" to override it.

ScalaFmt seems to be breaking for 2.13 using bazel rules_scala paour clientsage,
[fix](bazel-contrib/rules_scala#1631) is already
deployed but a release with that change is not available yet, so
temporarily disabled ScalaFmt cheour clientss for 2.13 will enable later once the
fix is released.

## Cheour clientslist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
- Enabled flexible Scala version selection (2.12 and 2.13) for smoother
builds and enhanced compatibility.
- Introduced a default Scala version constant and a repository rule for
improved version management.
- Added support for additional Scala 2.13 dependencies in the build
configuration.

- **Refactor and Improvements**
- Streamlined build and dependency management for increased stability
and performance.
- Consolidated collection conversion utilities to boost reliability in
tests and runtime processing.
- Enhanced type safety and clarity in collection handling across various
modules.
- Improved handling of Scala collections and maps throughout the
codebase for better type consistency and safety.
- Updated method implementations to ensure explicit type conversions,
enhancing clarity and preventing runtime errors.
- Modified method signatures and internal logic to utilize `Seq` for
improved type clarity and consistency.
- Enhanced the `maven_artifact` function to accept an optional version
parameter for better dependency management.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- #381 introduced the ability
to configure a partition column at the node-level. This PR simply fixes
a missed spot on the plumbing of the new StagingQuery attribute.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced the query builder to support specifying a partition column,
providing greater customization for query formation and partitioning.
- **Improvements**
- Improved handling of partition columns by introducing a fallbaour clients
mechanism to ensure valid values are used when necessary.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to traour clients
the status of staour clientss when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
To add CI cheour clientss for making sure we are able to build and test all
modules on both scala 2.12 and 2.13 versions.

## Cheour clientslist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Updated automated testing workflows to support Scala 2.12 and added
new workflows for Scala 2.13, ensuring consistent testing for both Spark
and non-Spark modules.

- **Documentation**
- Enhanced build instructions with updated commands for creating Uber
Jars and new automation shortcuts to streamline code formatting,
committing, and pushing changes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Added pinning support for both our maven and spark repositories so we
don't have to resolve them during builds.

Going forward whenever we make any updates to the artifacts in either
maven or spark repositories, we would need to re-pin the changed repos
using following commands and cheour clients-in the updated json files.

```
REPIN=1 bazel run @maven//:pin
REPIN=1 bazel run @spark//:pin
```

## Cheour clientslist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Integrated enhanced repository management for Maven and Spark,
providing improved dependency installation.
- Added support for JSON configuration files for Maven and Spark
installations.

- **Chores**
- Updated documentation to include instructions on pinning Maven
artifacts and managing dependency versions effectively.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
A VSCode plugin for feature authoring that detects errors and uses data
sampling in order to speed up the iteration cycle. The goal is to reduce
the amount of memorizing commands, typing / cliour clientsing, waiting for
clusters to be spun up, and jobs to finish.

In this example, we have a complex expression operating on nested data.
The eval button appears above Chronon types.

When you cliour clients on the Eval button, it samples your data, runs your code
and shows errors or transformed result within seconds.



![zipline_vscode_plugin](https://github.com/user-attachments/assets/5ac56764-f6e7-4998-b5aa-1f4cabde42f9)


## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [x] Integration tested (see above)
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a new Visual Studio Code extension that enhances Python
development.
- The extension displays an evaluation button alongside specific
assignment statements in Python files, allowing users to trigger
evaluation commands directly in the terminal.
- Added a command to execute evaluation actions related to Zipline AI
configurations.
  
- **Documentation**
  - Added a new LICENSE file containing the MIT License text.
  
- **Configuration**
- Introduced new configuration files for TypeScript and Webpaour clients to
support the extension's development and build processes.
  
- **Exclusions**
- Updated `.gitignore` and added `.vscodeignore` to streamline version
control and paour clientsaging processes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Moved scala dependencies to separate scala_2_12 and scala_2_13
repositories so we can load the right repo based on config instead of
loading both.

## Cheour clientslist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **Chores**
- Upgraded Scala dependencies to newer versions with updated
verification, ensuring improved stability.
- Removed outdated paour clientsage references to streamline dependency
management.
- Introduced new repository configurations for Scala 2.12 and 2.13 to
enhance dependency management.
- Added `.gitignore` entry to exclude `node_modules` in the
`authoring/vscode` path.
  - Created `LICENSE` file with MIT License text for the new extension.
  
- **New Features**
- Introduced a Visual Studio Code extension with a CodeLens provider for
Python files, allowing users to evaluate variables directly in the
editor.

- **Refactor**
- Updated dependency declarations to utilize a new method for handling
Scala artifacts, improving consistency across the project.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Nikhil Simha <[email protected]>
## Summary
Adds AWS build and push commands to the distribution script.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
  - Introduced an automated quiour clientsstart process for GCP deployments.
- Enhanced the build and upload tool with flexible command-line options,
supporting artifact creation for both AWS and GCP environments.
  - Added a new script for running the Zipline quiour clientsstart on GCP.

- **Refactor**
  - Updated the AWS quiour clientsstart process to ensure consistent execution.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…FilePath and replacing `/` to `.` in MetaData names (#398)

## Summary

^^^

Tested on the our clients laptop.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Improved error handling to explicitly report when configuration values
are missing.
- **New Features**
- Introduced standardized constants for various configuration types,
ensuring consistent key naming.
- **Refactor**
- Unified metadata processing by using direct metadata names instead of
file paths.
- Enhanced type safety in configuration options for clearer and more
reliable behavior.
- **Tests**
- Updated test cases and parameters to reflect the improved metadata and
configuration handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Reverts #373

Passing in options to push to only one customer is broken.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Streamlined the deployment process to automatically build and upload
artifacts exclusively to Google Cloud Platform.
- Removed configuration options and handling for an alternative cloud
provider, resulting in a simpler, more focused workflow.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
building join output schema should belong to metadata store - and also
reduces the size of fetcher.

## Cheour clientslist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced an optimized caching mechanism for data join operations,
resulting in improved performance and reliability.
- Added new methods to facilitate the creation and management of join
codecs.
  
- **Bug Fixes**
- Enhanced error handling for join codec operations, ensuring clearer
context for failures.
  
- **Documentation**
- Improved code readability and clarity through updated comments and
method signatures.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…#422)

## Summary
Add support to run the fetcher service in doour clientser. Also add rails to
publish to doour clientser hub as a private image -
[ziplineai/chronon-fetcher](https://hub.doour clientser.com/repository/doour clientser/ziplineai/chronon-fetcher)

I wasn't able to sort out logbaour clients / log4j2 logging as there's a lot of
deps messing things up - Vert.x supports JUL configs and that is
seemingly working so starting with that for now.

Tested with:
```
doour clientser run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json \
 -p 9000:9000 \
 -e "GCP_PROJECT_ID=canary-443022" \
 -e "GOOGLE_CLOUD_PROJECT=canary-443022" \
 -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance" \
 -e "STATSD_HOST=127.0.0.1" \
 -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json \
 ziplineai/chronon-fetcher
```

And then you can `curl http://localhost:9000/ping`

On our clients side just need to swap out the project and bt instance id and
then can curl the actual join:
```
curl -X POST http://localhost:9000/v1/fetch/join/search.ranking.v1_web_zipline_cdc_and_beacon_external -H 'Content-Type: application/json' -d '[{"listing_id":"632126370","shop_id":"53908089","shipping_profile_id":"235561688531"}]'
{"results":[{"status":"Success","entityKeys":{"listing_id":"632126370","shop_id":"53908089","shipping_profile_id":"235561688531"},"features":{...
```

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added an automation script that streamlines the container image build
and publication process with improved error handling.
- Introduced a new container configuration that installs essential
dependencies, sets environment variables, and incorporates a health
cheour clients for enhanced reliability.
- Implemented a robust logging setup that standardizes console and file
outputs with log rotation.
- Provided a startup script for the service that verifies required
settings and applies platform-specific options for seamless execution.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Adds the ability to push artifacts to aws in addition to gcp. Also adds
ability to specify specific customer ids to push to.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a new automation script that streamlines the process of
building artifacts and deploying them to both AWS and GCP with improved
error handling and user confirmation.

- **Chores**
- Removed a legacy artifact upload script that previously handled only
GCP deployments.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- Supporting StagingQueries for configurable compute engines. To support
BigQuery, the simplest way is to just write bigquery sql and run it on
bq to create the final table. Let's first make the API change.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **New Features**
- Added an option for users to specify the compute engine when
processing queries, offering choices such as Spark and BigQuery.
- Introduced validation to ensure that queries run only with the
designated engine.

- **Style**
  - Streamlined code organization for enhanced readability.
  - Consolidated and reordered import statements for improved clarity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to traour clients
the status of staour clientss when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
fetcher has grown over time into a large file with many large functions
that are hard to work with. This refactoring doesn't change any
functionality - just placement.

Made some of the scala code more idiomatic - if(try.isFailed) - vs
try.recoverWith
Made Metadata methods more explicit
FetcherBase -> JoinPartFetcher + GroupByFetcher + GroupByResponseHandler
Added fetch context - to replace 10 constructor params


## Cheour clientslist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
- Introduced a unified configuration context that enhances data
fetching, including improved group-by and join operations with more
robust error handling.
- Added a new `FetchContext` class to manage fetching operations and
execution contexts.
- Implemented a new `GroupByFetcher` class for efficient group-by data
retrieval.
- **Refactor**
- Upgraded serialization and deserialization to use a more efficient,
compact protocol.
- Standardized API definitions and type declarations across modules to
improve clarity and maintainability.
- Enhanced error handling in various methods to provide more informative
messages.
- **Chores**
	- Removed outdated utilities and reorganized dependency imports.
	- Updated test suites to align with the refactored architecture.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- Staging query should in theory already work for external tables
without additional code changes as long as we do some setup work to pin
up a view first.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to traour clients
the status of staour clientss when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
The existing aggregations configure the items sketch incorrectly. Split
it into two one that works purely with skewed data, and one that tries
to best-effort collect most frequent items.

## Cheour clientslist
- [x] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced new utility functions to streamline expression composition
and cleanup.
  - Enhanced aggregation descriptions for clearer operation choices.
  - Added new aggregation types for improved data analysis.

- **Refactor**
- Revamped frequency analysis logic with improved error handling and
optimized sizing.
- Replaced legacy histogram approaches with a more robust frequent item
detection mechanism.

- **Tests**
- Added tests to validate heavy hitter detection and skewed data
scenarios, while removing obsolete histogram tests.
  - Updated existing tests to reflect changes in aggregation parameters.

- **Chores**
  - Removed deprecated interactive modules for a leaner deployment.

- **Configuration**
- Adjusted default aggregation parameters for more consistent
processing, including changes to the `k` value in multiple
configurations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Add a couple of APIs to help with the our clients ***REMOVED*** integration. One is to
list out all online joins and the second is to retrieve the join schema
details for a given Join.

As part of wiring up list support, I tweaked a couple of properties like
the list pagination key / list call limit to make things consistent
between DynamoDB and BigTable.

For the BT implementation we issue a range query under the 'joins/'
prefix. Subsequent calls (in case of pagination) continue off this range
(verified this via unit tests and also basic sanity cheour clientss on our clients).

APIs added are:
* /v1/joins -> Return the list of online joins
* /v1/join/schema/join-name -> Return a payload consisting of
{"joinName": "..", "keySchema": "avro schema", "valueSchema": "avro
schema", "schemaHash": "hash"}

Tested by dropping the doour clientser container and confirming things on the
our clients side:
```
$ curl http://localhost:9000/v1/joins                                                                                                                                              
{"joinNames":["search.ranking.v1_web_zipline_cdc_and_beacon_external" ...}
```

And
```
curl http://localhost:9000/v1/join/schema/search.ranking.v1_web_zipline_cdc_and_beacon_external
{ big payload }
```

## Cheour clientslist
- [X] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced new API endpoints that let users list available joins and
retrieve detailed join schema information.
- Added enhanced configuration options to support complex join
workflows.
- New test cases for validating join listing and schema retrieval
functionalities.
  - Added new constants for pagination and entity type handling.

- **Improvements**
- Standardized pagination and entity handling across cloud integrations,
ensuring a consistent and reliable data listing experience.
- Enhanced error handling and response formatting for join-related
requests.
- Expanded testing capabilities with additional dependencies and
resource inclusion.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
#398 updated the module path from `"/"` to `"."`, but not all code was
migrated to the new convention, causing frontend API calls to fail when
retrieving joins.

@david-zlai – Can you review the code to ensure it fully aligns with the
new convention?
@sean-zlai – Can you tear down all Doour clientser images and rebuild on this
branch to confirm observability works as expected?

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Streamlined how configuration names are handled in observability
views. Names are now displayed as originally provided without extra
formatting, ensuring a consistent and straightforward presentation. The
fallbaour clients label remains “Unknown” when a name is not available.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- Everywhere else we want to handle partitions that could be non-string
types. This is similar to the change in:
https://github.com/zipline-ai/chronon/blob/0d78a99e44f97f95d05e528a749837bc9a38b32e/cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala#L122-L128

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced partition date display by introducing configurable date
formatting.
- Partition dates are now consistently formatted based on user
configuration, ensuring reliable and predictable output across the
system.
- Improved retrieval of partition format for BigQuery operations,
allowing for broader usage across different paour clientsages.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to traour clients
the status of staour clientss when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
Enable batch IR caching by default & fix an issue where our Vertx init
code tries to connect to BT at startup and takes a second or two on the
worker threads (and results in the warning - 'Thread
Thread[vert.x-eventloop-thread-1,5,main] has been bloour clientsed for 2976 ms,
time limit is 2000 ms').

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Streamlined caching configuration and logic with a consistent default
setting for improved behavior.
- Enhanced service startup by shifting to asynchronous initialization
with better error handling for a more robust launch.

- **Tests**
- Removed an outdated test case that validated previous caching
behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
This PR allows the frontend to specify which percentiles it retrieves
from the baour clientsend. The percentiles can be passed as a query parameter:

```
percentiles=p0,p10,p90
```
If omitted, the default percentiles are used:  
```
percentiles=p5,p50,p95
```

### Example Requests *(App must be running)*  

#### Default (uses `p5,p50,p95`)  
```sh
curl "http://localhost:5173/api/v1/join/risk.user_transactions.txn_join/column/txn_by_user_transaction_amount_count_1h/summary?startTs=1672531200000&endTs=1677628800000"
```

#### Equivalent Explicit Default  
```sh
curl "http://localhost:5173/api/v1/join/risk.user_transactions.txn_join/column/txn_by_user_transaction_amount_count_1h/summary?startTs=1672531200000&endTs=1677628800000&percentiles=p5,p50,p95"
```

#### Custom Percentiles (`p0,p10,p90`)  
```sh
curl "http://localhost:5173/api/v1/join/risk.user_transactions.txn_join/column/txn_by_user_transaction_amount_count_1h/summary?startTs=1672531200000&endTs=1677628800000&percentiles=p0,p10,p90"
```

### Notes  
- Omitting the `percentiles` parameter is the same as explicitly setting
`percentiles=p5,p50,p95`.
- You can test using `curl` or Postman.  
- We need to let users change these percentiles via cheour clientsboxes or
another UI control.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for customizable percentile parameters in summary data
requests, with a default setting of "p5, p50, p95".
- Enhanced the ability to retrieve detailed statistical summaries by
allowing users to specify percentile values when querying data.
  - Introduced two new optional dependencies for improved functionality.

- **Bug Fixes**
- Adjusted method signatures to ensure compatibility with the new
percentile parameters in various components.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
I noticed we were missing the core chronon fetcher logs during feature
lookup requests. As we anyway wanted to rip out the JUL & logbaour clients, I
went ahead and dropped those for a log4j2 properties file.

Confirmed that I am seeing the relevant fetcher logs from classes like
the SawtoothOnlineAggregator etc when I hit the service with a feature
look up request.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Consolidated service deployment paths and streamlined startup
configuration.
- Improved metrics handling by conditionally enabling reporting based on
environment settings.

- **Chores**
  - Optimized resource paour clientsaging and removed legacy dependencies.
- Upgraded logging configuration to enhance performance and log
management.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
#438)

## Summary

1. added offset and bound support to staging query macros `{{ start_date
}}` is valid as before, now `{{ start_date(offset=-10,
lower_bound='2023-01-01') }}` is also valid

2. Previously we required users to pass in quotes around the macro
separately. This pr removes the need for it
`{{ start_date }}` used to become `2023-01-01`, it now becomes
`'2023-01-01'`

2. added a unified top level module `api.chronon.types` that contain
everything that users need.

3. added wrappers on source sub types to directly return sources 
```py
ttypes.Source(events=ttypes.EventSource(...))

# now becomes
EventSource(...)
```

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added new functions for creating event, entity, and join data sources.
- Introduced enhanced date macro utilities to enable flexible SQL query
substitutions.

- **Refactor**
- Streamlined naming conventions and standardized parameter formatting.
- Consolidated and simplified import structures for improved
consistency.
- Updated method signatures and calls from `select` to `selects` across
various components.
- Removed reliance on `ttypes` for source definitions and standardized
parameter naming conventions.
  - Simplified macro substitution logic in the `StagingQuery` object.

- **Tests**
- Implemented comprehensive tests for date manipulation features to
ensure robust behavior.
- Updated existing tests to reflect changes in method names and query
formatting.
- Adjusted data generation parameters in tests to increase transaction
volumes.

- **Documentation**
- Updated configuration descriptions to clearly illustrate new date
template options and parameter adjustments.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
cleaning up top level dir 

## Cheour clientslist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Refined version control and build settings by updating ignored paths
and tool versions.
- Removed obsolete internal configurations, tooling, and Doour clientser build
files for a cleaner project structure.
- **Documentation**
  - Updated installation guidance links for clearer setup instructions.
- Eliminated legacy contributor, governance, and quiour clientsstart guides to
reduce clutter.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
No turning baour clients now

## Cheour clientslist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Removed legacy internal components from workflow orchestration and
task management to streamline operations.
- **Documentation**
  - Updated deployment guidance by removing outdated references.

These internal improvements enhance maintainability and performance
without altering your current user experience.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
move OSS docsite release scripts

## Cheour clientslist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Made behind‑the‑scenes updates to streamline our internal release
management processes.

There are no visible changes to functionality for end-users in this
release.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Consolidated and streamlined build dependencies for improved
integration with AWS services and data processing libraries.
- Expanded the set of supported third-party libraries, including new
artifacts for enhanced performance and compatibility.
- Added new dependencies for Hudi, Jaour clientsson, and Zookeeper to enhance
functionality.
- Introduced additional Hudi artifacts for Scala 2.12 and 2.13 to
broaden available functionalities.

- **Tests**
- Added a new test class to verify reliable write/read operations on
Hudi tables using a Spark session.

- **Refactor**
- Enhanced serialization registration to support a broader range of data
types, improving overall processing stability.
- Introduced a new variable for shared library dependencies to simplify
dependency management.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to traour clients
the status of staour clientss when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Improved the internal setup for fetch operations by reorganizing the
underlying structure. This update streamlines baour clientsground processing and
enhances overall maintainability while keeping user-facing functionality
unchanged.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
nikhil-zlai and others added 23 commits May 8, 2025 14:46
## Summary

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to traour clients
the status of staour clientss when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Style**
  - Reorganized import statements for improved readability.

- **Chores**
- Removed debugging print statements from partition insertion to clean
up console output.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: thomaschow <[email protected]>
## Summary

Run push_to_platform on pull request merge only. Also use default
message


## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Updated workflow to run only after a pull request is merged into the
main branch, instead of on every push.
- Adjusted the commit message behavior for subtree updates to use the
default message.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Removed the synthetic dataset generation script for browser and device
fingerprinting data.
- Removed related test configurations and documentation for AWS Zipline
and Plaid data processing.
- Updated AWS release workflow to exclude the "our clients" customer ID from
S3 uploads.
- Cleaned up commented-out AWS S3 and Glue deletion commands in
deployment scripts.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to traour clients
the status of staour clientss when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Removed references to "our clients" as a customer ID from workflows, scripts,
and documentation.
- Deleted test and configuration files related to "our clients" and sample
teams.
- Updated Avro schema namespaces and default values from "com.our clients" to
"com.customer" and related URLs.
	- Improved indentation and formatting in sample configuration files.
- **Tests**
- Updated test arguments and removed obsolete test data related to
"our clients".

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…-passing-candidate to line up with publish_release (#760)

## Summary

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Updated storage paths for artifact uploads to cloud storage in
deployment workflows.

- **Documentation**
- Corrected a type annotation in the documentation for a query
parameter.

- **Tests**
  - Enhanced a test to include and verify a new query parameter.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…mapping (#728)

## Summary

Updating the JoinSchemaResponse to include a mapping from feature ->
listing key. This PR updates our JoinSchemaResponse to include a value
info case class with these details.

## Cheour clientslist
- [X] Added Unit Tests
- [X] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **New Features**
- Added detailed metadata for join value fields, including feature
names, group names, prefixes, left keys, and schema descriptions, now
available in join schema responses.
- **Bug Fixes**
- Improved consistency and validation between join configuration keys
and value field metadata.
- **Tests**
- Enhanced and added tests to validate the presence and correctness of
value field metadata in join schema responses.
- Introduced new test suites covering fetcher failure scenarios and
metadata store functionality.
- Refactored existing fetcher tests to use external utility methods for
data generation.
- Added utility methods for generating deterministic, random, and
event-only test data configurations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [x] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Improved the handling of the `--mode` command-line option to ensure
all available choices are displayed as strings. This enhances
compatibility and usability when selecting modes.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

As we will be publishing from platform for now, delete this workflow
from chronon.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Removed the automated release publishing workflow, including all
related build, validation, artifact promotion, and cleanup steps.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Cheour clientslist
- [ ] Added Unit Tests
- [X] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Updated test cases to use a new event schema with revised field names
and structure.
- Renamed and adjusted test data and helper methods to align with the
new schema and naming conventions.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Pulling out from PR - #751 as
we're waiting on an r there and it shows up as noise in various places
so lets just fix.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Improved handling of metrics exporter URL configuration to prevent
errors when the URL is not defined.
- Ensured metrics are only initialized when both metrics are enabled and
an exporter URL is present.

- **Refactor**
- Enhanced internal logic for safer initialization of metrics reporting,
reducing the risk of misconfiguration.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Add Cloud GCP Embedded Jar to canary build process.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Enhanced CI/CD workflow to build, upload, and manage a new embedded
GCP jar artifact throughout the deployment process.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…rfaces (#751)

## Summary
Refactor some of the schema provider shaped code to -
* Use the existing SerDe class interfaces we have
* Work with Mutation types via the SerDe classes
* Primary shuffling is around pulling the Avro deser out of the existing
BaseAvroDeserializationSchema and delegating that to the SerDe to get a
Mutation baour clients as well as shifting things a bit to call CatalystUtil with
the Mutation Array[Any] types.
* Provide rails for users to provide a custom schema provider. I used
this to test a version of the beacon app out in canary - I'll put up a
separate PR for the test job in a follow up.
* Other misc piled up fixes - Cheour clients that GBUs don't compute empty
results; fix our Otel metrics code to be turned off by default to reduce
log spam.

## Cheour clientslist
- [X] Added Unit Tests
- [X] Covered by existing CI
- [X] Integration tested
-- Tested via canary on our env / cust env and confirmed we pass the
validation piece as well as see the jobs come up and write out data to
BT.
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added Avro serialization and deserialization support for online data
processing.
- Introduced flexible schema registry and custom schema provider
selection for Flink streaming sources.
- **Refactor**
- Unified and renamed the serialization/deserialization interface to
`SerDe` across modules.
- Centralized and simplified schema provider and deserialization logic
for Flink jobs.
  - Improved visibility and type safety for internal utilities.
- **Bug Fixes**
- Enhanced error handling and robustness in metrics initialization and
deserialization workflows.
- **Tests**
- Added and updated tests for Avro deserialization and schema registry
integration.
  - Removed outdated or redundant test suites.
- **Chores**
  - Updated external dependencies to include Avro support.
  - Cleaned up unused files and legacy code.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Builds on top of PR: #751. 

This PR adds a streaming GroupBy that can be run as a canary to sanity
cheour clients and test things out while making Flink changes. I used this to
sanity cheour clients the creation & use of a Moour clients schema serde that some users
have been asking for.

Can be submitted via:
```
$ CHRONON_ROOT=`pwd`/api/python/test/canary
$ zipline compile --chronon-root=$CHRONON_ROOT
$ zipline run --repo=$CHRONON_ROOT --version $VERSION --mode streaming --conf compiled/group_bys/gcp/item_event_canary.actions_v1 --kafka-bootstrap=bootstrap.zipline-kafka-cluster.us-central1.managedkafka.canary-443022.cloud.goog:9092 --groupby-name gcp.item_event_canary.actions_v1 --validate
```

(Needs the Flink event driver to be running - triggered via
DataProcSubmitterTest)

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **New Features**
- Introduced a new group-by aggregation for item event actions,
supporting real-time analytics by listing ID with data sourced from GCP
Kafka and BigQuery.
  - Added a moour clients schema provider for testing item event ingestion.

- **Bug Fixes**
- Updated test configurations to use new event schemas, topics, and data
paths for improved accuracy in Flink Kafka ingest job tests.

- **Refactor**
- Renamed and restructured the event driver to focus on item events,
with a streamlined schema and updated job naming.

- **Chores**
- Added new environment variable for Flink state storage configuration.
  - Updated build configuration to reference the renamed event driver.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Adding a field `LogicalType` to `conf` thrift, and fixing a typo.

## Cheour clientslist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added an optional field for logical type classification to the
configuration in the orchestration service API.

- **Style**
  - Updated a parameter name in a method signature for improved clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: ezvz <[email protected]>
)

## Summary

This is the command we expect users to run in their Airflow setup 
```
zipline run --mode streaming deploy --kafka-bootstrap=<KAFKA_BOOTSTRAP> --conf <CONF>  --version-cheour clients --latest-savepoint --disable-cloud-logging
```
- This command first does a version cheour clients that compares the local
zipline version with the zipline version of the running flink app. If
they're equal, no-op.
- If they're different we proceed with deploying. We get the latest
savepoint/cheour clientspoint and then deploy the Flink app with that. Then in
the CLI, we proceed to poll for the manifest file that will be written
out by the Flink app to update with the updated Flink app id + new
dataproc id.

In addition to `--latest-savepoint`, we're going to support
`--no-savepoint` and `--custom-savepoint` deployment strategies.

In addition we're also going to supporting:
```
zipline run --mode streaming cheour clients-if-job-is-running --conf <CONF> 
```
To cheour clients if there is a running Flink job. We implement this by using the
Dataproc client to filter active jobs with custom labels we set on
job-type and metadata-name.



## Cheour clientslist
- [x] Added Unit Tests
- [ ] Covered by existing CI
- [x] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added Google Cloud Storage client with file listing, existence cheour clientss,
and in-memory downloads.
- Enhanced Flink streaming job management with cheour clientspointing, savepoint
strategies, version cheour clientss, and deployment verification.
- Extended CLI and environment variables to support advanced Flink and
Spark job deployment controls.
- Introduced new configuration templates and test resources for
quiour clientsstart and team metadata.
- Added new Flink job option to write internal manifest linking Flink
job ID and parent job ID.

- **Improvements**
- Upgraded Python and Scala dependencies for improved compatibility and
security.
- Improved logging consistency, error handling, and job state traour clientsing
for Dataproc deployments.
- Refactored job submission logic for better modularity and streaming
support.
  - Enhanced deployment scripts with optional git cheour clients skipping.

- **Bug Fixes**
- Standardized logging and refined error detection in deployment
scripts.
- Improved error handling during streaming job polling and deployment
verification.

- **Tests**
- Added extensive tests for GCS client, Dataproc submitter, job
submission workflows, and configuration handling.

- **Chores**
- Updated build scripts and Bazel files to include new dependencies and
test resources.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

It seems when I copied the workflow to push_to_platform.yaml, I forgot
to delete the trigger workflow. They are now racing with each other
since both repos are currently private.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Removed the automated workflow that triggered platform subtree updates
on new changes to the main branch.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ons (#771)

## Summary

^^^

Currently, we'll face unexpected behavior if multiple people are working
and iterating on the same GroupBy/Join and changing the conf because
we'll upload to the same GCS path.

This change adds the job id to the destination GCS path.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [x] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Streamlined job submission to upload a single metadata configuration
file, simplifying the process.
- Enhanced job ID management by requiring and propagating a job ID
argument, improving job traour clientsing and consistency.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…partition (#772)

## Summary

- Fix partition sensor cheour clients, it needs to cheour clients that the primary
partition value is present.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced logging to show detailed partition keys and values during
partition cheour clientss for improved transparency.

- **Style**
- Improved organization and grouping of import statements for clarity
and consistency.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to traour clients
the status of staour clientss when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

Co-authored-by: thomaschow <[email protected]>
## Summary

Adding a flag so that airflow integration knows whether to schedule a
join or not

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced join metadata to include a flag indicating the presence of
label parts.
- **Tests**
- Updated sample join test to include label part information in join
instantiation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
## Summary

- We should be running setups regardless of whether things are
partitioned.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to traour clients
the status of staour clientss when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Adjusted the timing of SQL setup command execution to occur earlier in
the staging query process, ensuring setups run before any query
execution or partition cheour clientss. No changes to user-facing features or
functionality.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: thomaschow <[email protected]>
…gInfo (#774)

## Summary
When we add fields in our API, we can run into baour clientswards / forwards
compat issues depending on when the json updates make their way out to
the GroupByServingInfo (on orch side / serving side). Turning off the
round trip cheour clients to help cut the noise on these issues. If we can
deserialize the thrift json we proceed else this code will throw a
JsonException.
Some details - [slaour clients
thread](https://zipline-2kh4520.slaour clients.com/archives/C08345NBWH4/p1747092844340579)

## Cheour clientslist
- [ ] Added Unit Tests
- [X] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Improved compatibility when loading certain configuration data by
relaxing validation during data processing.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 15, 2025

Caution

Review failed

An error occurred during the review process. Please try again later.

Walkthrough

This change introduces Bazel build and GitHub Actions CI configuration, migrates Python API compilation and runtime logic to a new modular structure, adds GCP/AWS runner support, implements a new CLI, restructures sample/test data, and removes legacy Airflow DAG/operator code. It also updates aggregation APIs and window handling, and modernizes test and requirements files.

Changes

File(s) / Path(s) Change Summary
.bazelignore, .bazeliskrc, .bazelproject, .bazelrc, .plugin-versions, .tool-versions, .scalafmt.conf, .gitignore, WORKSPACE Added Bazel and tool versioning/configuration files; updated ignore patterns and formatting settings.
.github/..., .github/workflows/... Added/updated GitHub Actions workflows for build, test, Docker, canary/platform pushes, and branch protection. Added issue/PR templates and release changelog config.
aggregator/BUILD.bazel, api/BUILD.bazel Introduced Bazel build rules for Scala/Java/Python modules, including test targets and dependencies.
aggregator/src/main/..., aggregator/src/test/... Migrated Scala test classes to ScalaTest AnyFlatSpec; refactored comments, imports, and formatting; removed ApproxHistogram logic and related tests. Updated aggregation logic to support new frequent/heavy hitter operations.
airflow/... Deleted all Airflow DAG constructors, operators, helpers, and constants; removed orchestration readme and Python run logic.
api/py/..., api/python/ai/chronon/... Removed legacy Python API, run logic, and sample configs. Added new modular Python API: CLI compile context, compiler, serialization, validation, logger, repo runners (default, AWS, GCP), utils, and consolidated types. Added new sample/test data for GCP/AWS. Refactored aggregation, join, query, and window APIs.
api/python/requirements/..., api/python/pyproject.toml Added/updated requirements files for base and dev environments; added Ruff linter config.
README.md, api/python/README.md, api/python/test/canary/README.md, api/python/test/sample/README.md, api/python/ai/chronon/resources/gcp/README.md Updated or added documentation to reflect new project structure, usage, and cloud-specific instructions.
api/python/ai/chronon/types.py, api/python/ai/chronon/source.py, api/python/ai/chronon/windows.py Added unified type imports, source wrappers, and window parsing utilities for easier API usage.
api/python/ai/chronon/repo/zipline.py, api/python/ai/chronon/repo/init.py, api/python/ai/chronon/repo/compilev3.py, api/python/ai/chronon/repo/run.py Added new CLI entrypoints for project initialization, compilation, and job execution.
api/python/ai/chronon/group_by.py, api/python/ai/chronon/join.py, api/python/ai/chronon/query.py Refactored APIs to support new execution info, window string parsing, and removed deprecated parameters.
api/python/ai/chronon/cli/compile/..., api/python/ai/chronon/repo/compilev2.py, api/python/ai/chronon/repo/compile.py Added new compile context, compiler, display, and parsing logic for Python config objects; improved error handling and status reporting.
api/python/ai/chronon/cli/logger.py, api/python/ai/chronon/cli/git_utils.py Added colored logging and git utility helpers for CLI and compilation.
api/python/test/canary/..., api/python/test/sample/... Added/updated test sample data and team configs for GCP/AWS; migrated window specifications to string format; updated imports and aggregation logic.
api/python/ai/chronon/resources/gcp/... Added GCP sample project with GroupBy, Join, Source definitions, teams config, and install script.
AUTHORS, CONTRIBUTING.md, GOVERNANCE.md, LICENSE, README.md (old), api/py/... (old), airflow/... (all), api/python/tox.ini, api/python/requirements/base.txt (old), api/python/requirements/base.in (old), etc. Deleted legacy documentation, license, Airflow, and old Python API files.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Compiler
    participant Validator
    participant Runner
    participant CloudProvider

    User->>CLI: zipline compile/run/init
    CLI->>Compiler: CompileContext setup
    Compiler->>Validator: Validate GroupBy/Join/StagingQuery
    Validator-->>Compiler: Validation results
    Compiler->>CLI: Write compiled objects / errors
    User->>CLI: zipline run --mode ...
    CLI->>Runner: Prepare job, set env, download jars
    Runner->>CloudProvider: Submit Spark/Flink job (GCP/AWS/Local)
    CloudProvider-->>Runner: Job status/results
    Runner-->>CLI: Output results/status
Loading

Possibly related PRs

  • cli v2 #277: Adds a new compilation context class and CLI improvements, directly related to the new CompileContext and CLI compilation logic.
  • Python compile and API skeleton/docs #113: Implements initial Python compile functionality and runtime methods for GroupBy/Join, matching the modular runner and compile system here.
  • fix: Fix approx_histogram_k aggregations #411: Refactors aggregation operations, removing APPROX_HISTOGRAM_K and adding APPROX_FREQUENT_K, which aligns with the aggregation API changes in this PR.

Suggested reviewers

  • tchow-zlai
  • kumar-zlai

Poem

In Bazel’s light, the runners wake,
Old Airflow scripts the dust now take.
Python modules, fresh and neat,
Compile, validate, and jobs repeat.
Cloud and test, both GCP and AWS,
Zipline’s CLI brings order from the mess.

🚀 Code reborn, let features flow—
Onward, Chrononauts, let’s go!


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

varant-zlai and others added 2 commits May 15, 2025 17:58
## Summary

replace partition spec with column -> partiton spec

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Simplified partition specification handling across planners and
utilities by removing the custom partition spec wrapper and
standardizing on a single partition spec type.
- Updated related methods and class constructors to use the new
partition spec approach, streamlining partition metadata access.
- Removed unused fields and imports related to the old partition spec
wrapper.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
## Summary

Setting the partition spec correctly in table dependencies

## Cheour clientslist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Corrected the partition column used for snapshot table dependencies to
ensure accurate data partitioning.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: ezvz <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 53

🔭 Outside diff range comments (6)
api/python/test/sample/joins/sample_team/sample_chaining_join.py (1)

1-34: 💡 Verification agent

🧩 Analysis chain

File deletion seems appropriate.

Part of broader cleanup of sample configurations.


🏁 Script executed:

#!/bin/bash
# Check if any other files reference this deleted sample
rg -l "sample_chaining_join" --type py

Length of output: 134


Remove stale references to deleted sample_chaining_join.py

  • In api/python/test/test_utils.py: remove the sample/joins/sample_team/sample_chaining_join.py entry
  • In api/python/test/sample/group_bys/sample_team/chaining_group_by.py: strip out any doc-comment links to sample_chaining_join.py
api/python/ai/chronon/cli/logger.py (1)

54-62: ⚠️ Potential issue

Fix done() signature. The cond parameter is unused; simplify to def done(message):.

api/python/ai/chronon/cli/compile/compiler.py (1)

151-161: ⚠️ Potential issue

Return type mismatch
_write_object is annotated to return Optional[List[BaseException]] but returns nothing. Either return collected errors or change the annotation to None.

api/python/ai/chronon/repo/validator.py (1)

159-176: ⚠️ Potential issue

build_derived_columns mutates caller input & assumes Set APIs on list ➜ runtime crash
output_columns = pre_derived_columns keeps a reference to the caller’s collection and later calls .clear(), .add(), .remove().
Because get_join_output_columns now passes a list (183-188), the first call to .add() will raise AttributeError.
Fix: copy into a set up-front and only sort when returning.

-    output_columns = pre_derived_columns
+    # Work on an internal set copy – never mutate the caller.
+    output_columns = set(pre_derived_columns)

Also replace .remove(...) with .discard(...) to avoid KeyError.

api/python/ai/chronon/join.py (1)

65-91: 🛠️ Refactor suggestion

Avoid global __import__ monkey-patching

Mutating __builtins__["__import__"] even temporarily is risky in multithreaded processes; any concurrent import during this window may break. Wrap this in a small context-manager (or avoid entirely).

api/python/ai/chronon/group_by.py (1)

608-610: ⚠️ Potential issue

{source} not interpolated.

Second string isn’t an f-string, so {source} is printed literally.

-                "in source {source}. Please specify only the `timeColumn`"
+                f"in source {source}. Please specify only the `timeColumn`"
♻️ Duplicate comments (2)
.github/workflows/push_to_canary.yaml (2)

392-397: Repeat of path/quoting issue in push_to_aws_passing
Fix as above for wheel & jars.
Also add quotes around ${{ needs.build_artifacts.outputs.version }} to eliminate SC2086 warnings.


452-462: Duplicate GCP passing step – path + quoting
Same corrections required for GCP promotion step. Consider extracting an action/step template to DRY.

🧹 Nitpick comments (87)
api/python/README.md (1)

134-136: Add language specifier to code fence.

Markdown linter flags missing language identifier.

-```
+```bash
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

134-134: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

api/python/test/sample/README.md (1)

1-7: Duplicate of canary README.

Content identical to canary README. Consider consolidating or differentiating content.

.github/pull_request_template.md (1)

9-9: Remove stray character.

Line contains only "9" which appears to be unintentional.

-9
.github/release.yml (2)

9-9: Fix trailing spaces.

Remove trailing space after "Minor features".

-    - title: Minor features 
+    - title: Minor features
🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 9-9: trailing spaces

(trailing-spaces)


17-17: Add newline at EOF.

Add newline character at end of file.

        - "*"
+
🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 17-17: no new line character at the end of file

(new-line-at-end-of-file)

api/python/ai/chronon/eval/query_parsing.py (1)

1-20: SQL table extraction utility is well-implemented.

Function cleanly extracts table names from SQL queries using sqlglot with BigQuery dialect. Consider adding error handling for malformed queries.

 def get_tables_from_query(sql_query) -> List[str]:
     import sqlglot
-
-    # Parse the query
-    parsed = sqlglot.parse_one(sql_query, dialect="bigquery")
+    try:
+        # Parse the query
+        parsed = sqlglot.parse_one(sql_query, dialect="bigquery")
 
-    # Extract all table references
-    tables = parsed.find_all(sqlglot.exp.Table)
+        # Extract all table references
+        tables = parsed.find_all(sqlglot.exp.Table)
 
-    table_names = []
-    for table in tables:
-        name_parts = [part for part in [table.catalog, table.db, table.name] if part]
-        table_name = ".".join(name_parts)
-        table_names.append(table_name)
+        table_names = []
+        for table in tables:
+            name_parts = [part for part in [table.catalog, table.db, table.name] if part]
+            table_name = ".".join(name_parts)
+            table_names.append(table_name)
 
-    return table_names
+        return table_names
+    except Exception as e:
+        # Return empty list or raise a more specific error
+        return []
aggregator/src/test/scala/ai/chronon/aggregator/test/EditDistanceTest.scala (1)

25-25: Consider better test description.

"basic" is vague. Suggest more descriptive name like "correctly calculate edit distances" to improve test clarity.

api/python/test/sample/joins/sample_team/sample_chaining_join_parent.py (2)

1-24: Add docstring.

Missing explanatory docstring.


12-19: Extract repeated key_mapping.

Duplicate key_mapping. Consider creating constant.

api/python/pyproject.toml (1)

35-35: Remove commented line.

Decide on E402 rule or remove comment.

api/python/ai/chronon/cli/compile/display/diff_result.py (2)

14-21: Move signage methods up.

Define at class level if reused elsewhere.


5-6: Remove extra blank line.

Consistency improvement.

api/python/ai/chronon/resources/gcp/joins/test/data.py (2)

1-1: Use absolute import for group_bys.test.data.

Use an absolute import path to avoid potential import resolution issues.

-from group_bys.test.data import group_by_v1
+from api.python.ai.chronon.resources.gcp.group_bys.test.data import group_by_v1

23-28: Consider using more descriptive variable names.

Generic variable names like v1 lack context.

-v1 = Join(
+checkout_features_join = Join(
    left=source,
    right_parts=[
        JoinPart(group_by=group_by_v1)
    ],
)
.github/workflows/build_and_push_docker.yaml (1)

35-35: Add newline at end of file

File should end with a newline.

           push: true
           tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
+
🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 35-35: no new line character at the end of file

(new-line-at-end-of-file)

api/python/ai/chronon/repo/explore.py (1)

76-76: Key renamed for consistency

Changed from "output_namespace" to 'outputNamespace' to align with camelCase convention.

.github/image/Dockerfile (2)

59-59: Use --no-cache-dir with pip.
Avoid inflating the final image layer by adding --no-cache-dir to pip3 install ….


61-63: Combine the second apt update.
You run apt update twice; merge into the earlier layer to shrink image & speed builds.

api/python/test/canary/group_bys/aws/purchases.py (2)

19-19: day shadowing is fine but lint-unfriendly.
Minor: use days or length to appease linters/readers.


21-46: Duplicate definitions inflate config.
v1_dev and v1_test differ only in name; build one helper and clone it to cut noise.

.github/workflows/push_to_platform.yaml (1)

34-34: Fix formatting issues.

Remove trailing spaces and add newline at end of file.

-          chmod 600 ~/.ssh/id_rsa
-          
+          chmod 600 ~/.ssh/id_rsa
+
-          ssh-keyscan github.com >> ~/.ssh/known_hosts
-          
+          ssh-keyscan github.com >> ~/.ssh/known_hosts
+
-          ssh-add ~/.ssh/id_rsa
-          
+          ssh-add ~/.ssh/id_rsa
+
-          EOF
-          
+          EOF
+
-          git remote add chronon [email protected]:zipline-ai/chronon.git || true
-          
+          git remote add chronon [email protected]:zipline-ai/chronon.git || true
+
-        run: git push origin main
+        run: git push origin main
+

Also applies to: 37-37, 41-41, 49-49, 52-52, 57-57

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 34-34: trailing spaces

(trailing-spaces)

api/python/ai/chronon/windows.py (1)

50-51: Consider supporting minutes and weeks.

Function only supports hours and days. Consider adding minutes and weeks for greater flexibility.

.github/workflows/test_scala_fmt.yaml (2)

50-50: Missing newline at end of file

Add a newline at the end to follow standard coding practices.

            {}.format-test
+
🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 50-50: no new line character at the end of file

(new-line-at-end-of-file)


42-43: Add error handling for credential decoding

Base64 decoding may fail silently.

      - name: Setup Bazel cache credentials
        run: |
-          echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
+          if ! echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json; then
+            echo "Failed to decode BAZEL_CACHE_CREDENTIALS"
+            exit 1
+          fi
.github/workflows/test_bazel_config.yaml (2)

22-23: Remove extra blank line

For consistency, remove the extra newline.

      - 'WORKSPACE'
-

  concurrency:

43-45: Add error handling for credential decoding

Base64 decoding may fail silently.

      - name: Setup Bazel cache credentials
        run: |
-          echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
+          if ! echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json; then
+            echo "Failed to decode BAZEL_CACHE_CREDENTIALS"
+            exit 1
+          fi
api/python/test/canary/teams.py (1)

111-111: Consider additional AWS configuration

AWS team has minimal configuration compared to GCP.

Would you like me to suggest additional AWS-specific configurations to match GCP's comprehensive setup?

api/python/ai/chronon/cli/plan/controller_iface.py (1)

23-23: Fix typo in method name. Rename upload_branch_mappsing to upload_branch_mapping.

api/python/ai/chronon/resources/gcp/teams.py (2)

27-32: Placeholders in config. Values like <customer_id> remain; add runtime validation or template injection.


41-48: Unfilled environment placeholders. Consider fail-fast on missing values.

aggregator/src/test/scala/ai/chronon/aggregator/test/TwoStackLiteAggregatorTest.scala (1)

92-100: Clean up dead code. Remove or re-enable the commented-out naive aggregator block.

api/python/ai/chronon/eval/__init__.py (2)

50-54: String replacement can mangle nested names

str.replace() might inadvertently rewrite substrings inside other identifiers (e.g. project.table vs project.table2). Use word-boundary regex or sqlglot rewriting instead.

-import re
-...
-clean_query = re.sub(rf'\b{re.escape(table_name)}\b', clean_name, clean_query)

102-108: Optional typing for _spark

Initialising _spark: SparkSession = None violates the declared non-nullable type. Declare as Optional[SparkSession] (Python 3.10+: SparkSession | None) to appease type-checkers.

aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (2)

137-139: Box helpers already exist

java.lang.Long.valueOf / Double.valueOf are used directly two lines below, making toJLong/toJDouble redundant. Consider inlining to cut noise.


268-290: FrequentItems mapper covers boxed numerics only

FrequentItems generic type must be boxed; the conversions provided handle this, good. Be mindful that k default 8 may be too low for realistic heavy-hitter detection—verify with domain use-cases.

README.md (1)

15-16: Tighten wording

Replace “on a regular basis” with “regularly” for brevity.

🧰 Tools
🪛 LanguageTool

[style] ~15-~15: ‘on a regular basis’ might be wordy. Consider a shorter alternative.
Context: ...on are picked and merged into this repo on a regular basis, and improvements made to this reposito...

(EN_WORDINESS_PREMIUM_ON_A_REGULAR_BASIS)

api/python/ai/chronon/repo/zipline.py (4)

1-3: Consolidate importlib imports. Combine to a single line, e.g.:

from importlib.metadata import version as ver, PackageNotFoundError

11-27: Extract large ASCII logo. Consider moving LOGO to a separate text/resource file or trimming inline art.


30-36: Cache package version. You call ver("zipline-ai") twice; store its result in a module‐level constant to avoid duplicate lookups.


39-45: Avoid duplicate version retrieval. _set_package_version() is invoked in both the decorator and function body—call it once and reuse the value.

api/python/ai/chronon/types.py (1)

5-11: Reorder and simplify imports. Follow PEP8: standard libs first, then local; use from ai.chronon.api.common import ttypes as common.

.github/workflows/test_scala_2_12_non_spark.yaml (2)

47-59: Pin checkout action. Consider using actions/checkout@v3 for a stable major release instead of @v4.


35-203: DRY up repeated jobs. All module tests share identical steps; use a matrix or reusable workflow to minimize duplication.

.github/workflows/test_scala_2_13_non_spark.yaml (2)

45-46: Upgrade checkout. Recommend actions/checkout@v3 over @v4 for major‐release stability.


33-209: Centralize repeated definitions. Leveraging a matrix or composite action will make this DRY and easier to maintain.

api/python/ai/chronon/cli/logger.py (3)

1-4: Reorder imports. Follow stdlib → third-party → local convention: e.g. import logging, sys; from datetime import datetime.


17-29: Include exception details. Current format() omits record.exc_info/stack_info; extend it to capture tracebacks when present.


46-52: Leverage click styles. Replace custom ANSI wrappers with click.style(text, fg='red') for portability.

.github/workflows/test_scala_2_12_spark.yaml (2)

33-46: Trim duplication with a matrix / reusable step

Seven jobs repeat identical checkout + cache-credential + Bazel test logic. A strategy.matrix or a reusable workflow would shrink ~100 lines and ease maintenance.

Also applies to: 59-71, 85-97, 111-123, 137-149, 163-175, 189-201


201-201: Missing trailing newline

Add a final newline to keep linters quiet.

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 201-201: no new line character at the end of file

(new-line-at-end-of-file)

.github/workflows/test_python.yaml (1)

71-71: Add trailing newline

Ends-with-newline = happier YAML linters.

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 71-71: no new line character at the end of file

(new-line-at-end-of-file)

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (1)

398-406: Magic constants – extract for clarity

0.75 * 0.5 comes from internal purge + load-factor assumptions. Consider:

private val PurgeFactor   = 0.5
private val LoadFactor    = 0.75
val sketchSize = nextPowerOfTwo(math.ceil(mapSize /(PurgeFactor*LoadFactor)).toInt max 2)

Easier to tune later.

api/python/ai/chronon/eval/sample_tables.py (2)

13-14: Simplified variable naming needed.

Variable raw_scan_query is redundant since it just stores query parameter.

-    raw_scan_query = query
-    print(f"Sampling {table} with query: {raw_scan_query}")
+    print(f"Sampling {table} with query: {query}")

58-59: Remove duplicate import.

os already imported at module level.

-    import os
-
api/python/ai/chronon/resources/gcp/README.md (6)

33-34: Fix incomplete code block.

Missing closing backtick.

   ```bash
   ./zipline-cli-install.sh
-
+  ```

98-98: Fix heading style.

Remove trailing period per markdown standards.

-## 🧪 Running a GroupBy upload (GBU) job.
+## 🧪 Running a GroupBy upload (GBU) job
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

98-98: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)


111-111: Fix heading style.

Remove trailing period.

-## 🧪 Upload the GBU values to online KV store.
+## 🧪 Upload the GBU values to online KV store
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

111-111: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)


122-122: Fix heading style.

Remove trailing period.

-## 🧪 Upload the metadata of Chronon GroupBy or Join to online KV store for serving.
+## 🧪 Upload the metadata of Chronon GroupBy or Join to online KV store for serving
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

122-122: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)


140-140: Fix heading style.

Remove trailing period.

-## 🧪 Fetch feature values from Chronon GroupBy or Join.
+## 🧪 Fetch feature values from Chronon GroupBy or Join
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

140-140: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)


167-167: Update GitHub link.

Link points to airbnb/chronon but should be zipline-ai/chronon.

-[GitHub](https://github.com/airbnb/chronon)
+[GitHub](https://github.com/zipline-ai/chronon)
api/python/ai/chronon/repo/aws.py (1)

49-61: Prefer logger over print
Use LOG.info/exception to keep output consistent and structured.

api/python/ai/chronon/cli/compile/display/compile_status.py (1)

32-40: Key may be None
Using None as dict key for trackers is error-prone; derive key from compiled.obj_type when absent.

api/python/ai/chronon/cli/compile/display/class_tracker.py (1)

68-72: Initialize closed flag
self.closed first used here; define in __init__ for clarity.

   def __init__(self):
       ...
       self.deleted_names: List[str] = []
+      self.closed: bool = False
api/python/ai/chronon/cli/compile/parse_configs.py (1)

29-30: Consider checking file existence before import.

Add file existence check before attempting to import to prevent unexpected errors.

- try:
-     results_dict = from_file(f, cls, input_dir)
+ try:
+     if not os.path.isfile(f):
+         raise FileNotFoundError(f"File {f} does not exist")
+     results_dict = from_file(f, cls, input_dir)
api/python/ai/chronon/source.py (1)

59-61: Documentation mix-up in EntitySource.

Parameter descriptions appear to be misaligned - the description for mutationTopic seems to describe what query should do.

-     - mutationTopic: The logic used to scan both the table and the topic. Contains row level transformations
-                      and filtering expressed as Spark SQL statements.
-     - query: If each new hive partition contains not just the current day's events but the entire set
+     - mutationTopic: Kafka topic containing mutation events for the entity.
+     - query: The logic used to scan both the table and the topic. Contains row level transformations
+              and filtering expressed as Spark SQL statements.
api/python/ai/chronon/repo/run.py (2)

74-74: Remove commented-out fetch_online_jar reference.

Since there's a clear note explaining why it's not used, consider removing the commented code.

-        # NOTE: We don't want to ever call the fetch_online_jar.py script since we're working
-        # on our internal zipline fork of the chronon repo
-        # "online_jar_fetch": os.path.join(chronon_repo_path, "scripts/fetch_online_jar.py"),

239-239: Enhance error message for missing configuration.

Provide more actionable guidance when config file is missing.

-        raise ValueError(f"Conf file {conf_path} does not exist.")
+        raise ValueError(f"Conf file {conf_path} does not exist. Ensure the file path is correct relative to {repo}.")
api/python/ai/chronon/repo/default_runner.py (1)

186-193: Shell-string command building invites injection & quoting bugs
You interpolate user-supplied strings straight into a shell command, then execute via utils.check_call. Prefer subprocess.run([...]) with an arg list or shlex.quote every piece.

api/python/ai/chronon/cli/compile/compiler.py (1)

7-14: Duplicate import
ai.chronon.cli.compile.display.compiled_obj is imported twice (line 7 and 13). Drop one.

.github/workflows/test_scala_2_13_spark.yaml (2)

36-39: Securely handling credentials, but consider GitHub OIDC

The credential handling works but GitHub OIDC is more secure for GCP auth.

Also applies to: 63-66, 90-93, 117-120, 144-147, 171-174, 198-201


209-209: Missing newline at end of file

Add a newline at the end of the file to fix the YAMLlint warning.

            //spark:streaming_test
+
🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 209-209: no new line character at the end of file

(new-line-at-end-of-file)

api/python/ai/chronon/staging_query.py (1)

18-22: Use list default_factory for safety
additional_partitions will be shared if later changed in-place.

-    additional_partitions: Optional[List[str]] = None
+    additional_partitions: Optional[List[str]] = field(default_factory=list)
api/python/ai/chronon/cli/compile/parse_teams.py (2)

42-48: Don’t shadow built-in print
Parameter name hides the global; rename to verbose.


28-38: Handle missing loader
spec or spec.loader can be None, causing AttributeError. Add a guard.

api/python/ai/chronon/repo/compile.py (2)

89-91: Path split is OS-specific
Hard-coding "/" breaks on Windows. Use os.sep or pathlib.Path(input_path).parts.


238-239: Avoid mutating loop var name
Reassigning name hampers traceability; use a new variable.

api/python/ai/chronon/eval/table_scan.py (1)

64-68: Side-effect free helper wanted.
coalesce(self.query.reversalColumn, "is_before") mutates nothing; yet later base_selects["is_before"] is added even if original key exists – overwriting silently. Consider setdefault.

api/python/ai/chronon/cli/compile/compile_context.py (1)

142-156: File handle leak.
open(full_path) without context will keep FDs open; use Path.read_text().

api/python/ai/chronon/repo/utils.py (1)

50-53: Param style
ignoreError deviates from snake_case (ignore_error). Rename for consistency.

api/python/ai/chronon/cli/git_utils.py (3)

61-68: Potential large-file load
git show reads full file into memory; for big binaries this may explode. Consider size cap or streaming.


36-41: Use logger, not print
Direct print mixes with log output; prefer logger.error(...).


130-149: Minor inefficiency
real_changes building could use list-comp with predicate instead of loop.

api/python/ai/chronon/repo/validator.py (1)

389-393: Variable shadowing hurts readability
The list-comprehension reuses the outer name errors, shadowing the surrounding list.

-            for errors in group_by_errors
-            for error in errors
+            for gb_errs in group_by_errors
+            for error in gb_errs
api/python/ai/chronon/utils.py (1)

223-228: dict_to_bash_commands does not quote values
If a value contains spaces or shell-special characters the generated CLI string will break.

-        cmd = (
-            f"--{key.replace('_', '-')}={value}"
-            if value
-            else f"--{key.replace('_', '-')}"
-        )
+        safe_val=${value@Q}  # bash-style quoting
+        cmd = f"--{key.replace('_', '-')}" + (f"={safe_val}" if value else "")
api/python/ai/chronon/repo/gcp.py (1)

248-276: final_args template may swallow {} inside user_args

str.format treats braces literally; if user_args already contains {}, the format call can fail. Consider string.Template or double-brace escaping.

api/python/ai/chronon/group_by.py (2)

226-238: Avoid relying on private helpers.

window_utils._from_str is underscored ⇒ internal API; risk of breakage. Expose a public from_str() wrapper or inline the parsing logic.


183-185: Use public enum names.

common.TimeUnit._VALUES_TO_NAMES is private. Prefer common.TimeUnit._NAMES_TO_VALUES reverse lookup, or store window.timeUnit.name.lower().

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 3114e00 and 4e7b865.

⛔ Files ignored due to path filters (5)
  • api/python/test/sample/data/checkouts.csv is excluded by !**/*.csv
  • api/python/test/sample/data/purchases.csv is excluded by !**/*.csv
  • api/python/test/sample/data/purchases_new.csv is excluded by !**/*.csv
  • api/python/test/sample/data/returns.csv is excluded by !**/*.csv
  • api/python/test/sample/data/users.csv is excluded by !**/*.csv
📒 Files selected for processing (207)
  • .bazelignore (1 hunks)
  • .bazeliskrc (1 hunks)
  • .bazelproject (1 hunks)
  • .bazelrc (1 hunks)
  • .github/ISSUE_TEMPLATE/bug_report.md (1 hunks)
  • .github/ISSUE_TEMPLATE/feature_request.md (1 hunks)
  • .github/image/Dockerfile (1 hunks)
  • .github/pull_request_template.md (1 hunks)
  • .github/release.yml (1 hunks)
  • .github/workflows/build_and_push_docker.yaml (1 hunks)
  • .github/workflows/push_to_canary.yaml (1 hunks)
  • .github/workflows/push_to_platform.yaml (1 hunks)
  • .github/workflows/require_triggered_status_checks.yaml (1 hunks)
  • .github/workflows/test_bazel_config.yaml (1 hunks)
  • .github/workflows/test_python.yaml (1 hunks)
  • .github/workflows/test_scala_2_12_non_spark.yaml (1 hunks)
  • .github/workflows/test_scala_2_12_spark.yaml (1 hunks)
  • .github/workflows/test_scala_2_13_non_spark.yaml (1 hunks)
  • .github/workflows/test_scala_2_13_spark.yaml (1 hunks)
  • .github/workflows/test_scala_fmt.yaml (1 hunks)
  • .gitignore (2 hunks)
  • .plugin-versions (1 hunks)
  • .scalafix.conf (0 hunks)
  • .scalafmt.conf (1 hunks)
  • .tool-versions (1 hunks)
  • AUTHORS (0 hunks)
  • CONTRIBUTING.md (0 hunks)
  • GOVERNANCE.md (0 hunks)
  • LICENSE (0 hunks)
  • README.md (1 hunks)
  • WORKSPACE (1 hunks)
  • aggregator/BUILD.bazel (1 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/base/MinHeap.scala (1 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (2 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/base/TimedAggregators.scala (2 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (3 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/row/MapColumnAggregator.scala (1 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/row/RowAggregator.scala (1 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/row/StatsGenerator.scala (5 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/stats/EditDistance.scala (1 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/windowing/HopsAggregator.scala (1 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/windowing/Resolution.scala (2 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/windowing/SawtoothAggregator.scala (1 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/windowing/SawtoothMutationAggregator.scala (4 hunks)
  • aggregator/src/main/scala/ai/chronon/aggregator/windowing/TwoStackLiteAggregator.scala (2 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxDistinctTest.scala (2 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala (0 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxPercentilesTest.scala (3 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/DataGen.scala (4 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/EditDistanceTest.scala (1 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (7 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/MinHeapTest.scala (1 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/MomentTest.scala (2 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/NaiveAggregator.scala (1 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/RowAggregatorTest.scala (4 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/SawtoothAggregatorTest.scala (3 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/SawtoothOnlineAggregatorTest.scala (2 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/TwoStackLiteAggregatorTest.scala (5 hunks)
  • aggregator/src/test/scala/ai/chronon/aggregator/test/VarianceTest.scala (2 hunks)
  • airflow/constants.py (0 hunks)
  • airflow/decorators.py (0 hunks)
  • airflow/group_by_dag_constructor.py (0 hunks)
  • airflow/helpers.py (0 hunks)
  • airflow/join_dag_constructor.py (0 hunks)
  • airflow/online_offline_consistency_dag_constructor.py (0 hunks)
  • airflow/operators.py (0 hunks)
  • airflow/readme.md (0 hunks)
  • airflow/staging_query_dag_constructor.py (0 hunks)
  • api/BUILD.bazel (1 hunks)
  • api/py/ai/__init__.py (0 hunks)
  • api/py/ai/chronon/__init__.py (0 hunks)
  • api/py/ai/chronon/repo/run.py (0 hunks)
  • api/py/ai/chronon/scheduler/adapters/airflow_adapter.py (0 hunks)
  • api/py/ai/chronon/scheduler/interfaces/flow.py (0 hunks)
  • api/py/ai/chronon/scheduler/interfaces/node.py (0 hunks)
  • api/py/ai/chronon/scheduler/interfaces/orchestrator.py (0 hunks)
  • api/py/example.py (0 hunks)
  • api/py/requirements/base.in (0 hunks)
  • api/py/requirements/base.txt (0 hunks)
  • api/py/test/sample/group_bys/risk/merchant_data.py (0 hunks)
  • api/py/test/sample/group_bys/risk/user_data.py (0 hunks)
  • api/py/test/sample/joins/risk/user_transactions.py (0 hunks)
  • api/py/test/sample/joins/sample_team/sample_chaining_join.py (0 hunks)
  • api/py/test/sample/production/group_bys/risk/transaction_events.txn_group_by_merchant (0 hunks)
  • api/py/test/sample/production/group_bys/risk/transaction_events.txn_group_by_user (0 hunks)
  • api/py/test/sample/production/joins/sample_team/sample_join_from_shorthand.v1 (0 hunks)
  • api/py/test/sample/production/models/quickstart/test.v1 (0 hunks)
  • api/py/test/sample/production/models/risk/transaction_model.v1 (0 hunks)
  • api/py/test/sample/sources/test_sources.py (0 hunks)
  • api/py/test/sample/teams.json (0 hunks)
  • api/py/test/scheduler/test_flow.py (0 hunks)
  • api/py/test/test_join.py (0 hunks)
  • api/py/test/test_run.py (0 hunks)
  • api/py/tox.ini (0 hunks)
  • api/python/README.md (1 hunks)
  • api/python/ai/chronon/airflow_helpers.py (1 hunks)
  • api/python/ai/chronon/cli/compile/compile_context.py (1 hunks)
  • api/python/ai/chronon/cli/compile/compiler.py (1 hunks)
  • api/python/ai/chronon/cli/compile/conf_validator.py (1 hunks)
  • api/python/ai/chronon/cli/compile/display/class_tracker.py (1 hunks)
  • api/python/ai/chronon/cli/compile/display/compile_status.py (1 hunks)
  • api/python/ai/chronon/cli/compile/display/compiled_obj.py (1 hunks)
  • api/python/ai/chronon/cli/compile/display/console.py (1 hunks)
  • api/python/ai/chronon/cli/compile/display/diff_result.py (1 hunks)
  • api/python/ai/chronon/cli/compile/fill_templates.py (1 hunks)
  • api/python/ai/chronon/cli/compile/parse_configs.py (1 hunks)
  • api/python/ai/chronon/cli/compile/parse_teams.py (1 hunks)
  • api/python/ai/chronon/cli/compile/serializer.py (1 hunks)
  • api/python/ai/chronon/cli/git_utils.py (1 hunks)
  • api/python/ai/chronon/cli/logger.py (1 hunks)
  • api/python/ai/chronon/cli/plan/controller_iface.py (1 hunks)
  • api/python/ai/chronon/eval/__init__.py (1 hunks)
  • api/python/ai/chronon/eval/query_parsing.py (1 hunks)
  • api/python/ai/chronon/eval/sample_tables.py (1 hunks)
  • api/python/ai/chronon/eval/table_scan.py (1 hunks)
  • api/python/ai/chronon/group_by.py (14 hunks)
  • api/python/ai/chronon/join.py (9 hunks)
  • api/python/ai/chronon/model.py (1 hunks)
  • api/python/ai/chronon/query.py (3 hunks)
  • api/python/ai/chronon/repo/__init__.py (1 hunks)
  • api/python/ai/chronon/repo/aws.py (1 hunks)
  • api/python/ai/chronon/repo/compile.py (8 hunks)
  • api/python/ai/chronon/repo/compilev2.py (1 hunks)
  • api/python/ai/chronon/repo/compilev3.py (1 hunks)
  • api/python/ai/chronon/repo/constants.py (1 hunks)
  • api/python/ai/chronon/repo/default_runner.py (1 hunks)
  • api/python/ai/chronon/repo/explore.py (6 hunks)
  • api/python/ai/chronon/repo/extract_objects.py (2 hunks)
  • api/python/ai/chronon/repo/gcp.py (1 hunks)
  • api/python/ai/chronon/repo/hub_uploader.py (1 hunks)
  • api/python/ai/chronon/repo/init.py (1 hunks)
  • api/python/ai/chronon/repo/run.py (1 hunks)
  • api/python/ai/chronon/repo/runner.py (1 hunks)
  • api/python/ai/chronon/repo/serializer.py (4 hunks)
  • api/python/ai/chronon/repo/team_json_utils.py (2 hunks)
  • api/python/ai/chronon/repo/utils.py (1 hunks)
  • api/python/ai/chronon/repo/validator.py (15 hunks)
  • api/python/ai/chronon/repo/zipline.py (1 hunks)
  • api/python/ai/chronon/resources/gcp/README.md (1 hunks)
  • api/python/ai/chronon/resources/gcp/group_bys/test/data.py (1 hunks)
  • api/python/ai/chronon/resources/gcp/joins/test/data.py (1 hunks)
  • api/python/ai/chronon/resources/gcp/sources/test/data.py (1 hunks)
  • api/python/ai/chronon/resources/gcp/teams.py (1 hunks)
  • api/python/ai/chronon/resources/gcp/zipline-cli-install.sh (1 hunks)
  • api/python/ai/chronon/source.py (1 hunks)
  • api/python/ai/chronon/staging_query.py (1 hunks)
  • api/python/ai/chronon/types.py (1 hunks)
  • api/python/ai/chronon/utils.py (10 hunks)
  • api/python/ai/chronon/windows.py (1 hunks)
  • api/python/pyproject.toml (1 hunks)
  • api/python/requirements/base.in (1 hunks)
  • api/python/requirements/base.txt (1 hunks)
  • api/python/requirements/dev.in (1 hunks)
  • api/python/requirements/dev.txt (3 hunks)
  • api/python/setup.py (3 hunks)
  • api/python/test/canary/README.md (1 hunks)
  • api/python/test/canary/deprecated_teams.json (1 hunks)
  • api/python/test/canary/group_bys/aws/purchases.py (1 hunks)
  • api/python/test/canary/group_bys/gcp/item_event_canary.py (1 hunks)
  • api/python/test/canary/group_bys/gcp/purchases.py (1 hunks)
  • api/python/test/canary/joins/gcp/training_set.py (1 hunks)
  • api/python/test/canary/teams.py (1 hunks)
  • api/python/test/conftest.py (1 hunks)
  • api/python/test/sample/README.md (1 hunks)
  • api/python/test/sample/aws/teams.json (1 hunks)
  • api/python/test/sample/deprecated_teams.json (1 hunks)
  • api/python/test/sample/group_bys/kaggle/clicks.py (2 hunks)
  • api/python/test/sample/group_bys/kaggle/outbrain.py (3 hunks)
  • api/python/test/sample/group_bys/quickstart/purchases.py (2 hunks)
  • api/python/test/sample/group_bys/quickstart/returns.py (1 hunks)
  • api/python/test/sample/group_bys/quickstart/schema.py (2 hunks)
  • api/python/test/sample/group_bys/quickstart/users.py (2 hunks)
  • api/python/test/sample/group_bys/risk/merchant_data.py (1 hunks)
  • api/python/test/sample/group_bys/risk/transaction_events.py (2 hunks)
  • api/python/test/sample/group_bys/risk/user_data.py (1 hunks)
  • api/python/test/sample/group_bys/sample_team/chaining_group_by.py (1 hunks)
  • api/python/test/sample/group_bys/sample_team/entity_sample_group_by_from_module.py (1 hunks)
  • api/python/test/sample/group_bys/sample_team/event_sample_group_by.py (1 hunks)
  • api/python/test/sample/group_bys/sample_team/group_by_with_kwargs.py (2 hunks)
  • api/python/test/sample/group_bys/sample_team/label_part_group_by.py (1 hunks)
  • api/python/test/sample/group_bys/sample_team/mutation_sample_group_by.py (1 hunks)
  • api/python/test/sample/group_bys/sample_team/sample_chaining_group_by.py (2 hunks)
  • api/python/test/sample/group_bys/sample_team/sample_group_by.py (3 hunks)
  • api/python/test/sample/group_bys/sample_team/sample_group_by_from_join_part.py (1 hunks)
  • api/python/test/sample/group_bys/sample_team/sample_group_by_from_module.py (2 hunks)
  • api/python/test/sample/group_bys/sample_team/sample_group_by_group_by.py (1 hunks)
  • api/python/test/sample/group_bys/sample_team/sample_group_by_missing_input_column.py (1 hunks)
  • api/python/test/sample/group_bys/sample_team/sample_group_by_with_derivations.py (1 hunks)
  • api/python/test/sample/group_bys/sample_team/sample_group_by_with_incorrect_derivations.py (1 hunks)
  • api/python/test/sample/group_bys/sample_team/sample_non_prod_group_by.py (1 hunks)
  • api/python/test/sample/joins/kaggle/outbrain.py (1 hunks)
  • api/python/test/sample/joins/quickstart/training_set.py (1 hunks)
  • api/python/test/sample/joins/risk/user_transactions.py (1 hunks)
  • api/python/test/sample/joins/sample_team/sample_backfill_mutation_join.py (1 hunks)
  • api/python/test/sample/joins/sample_team/sample_chaining_join.py (2 hunks)
  • api/python/test/sample/joins/sample_team/sample_chaining_join_parent.py (1 hunks)
  • api/python/test/sample/joins/sample_team/sample_join.py (2 hunks)
  • api/python/test/sample/joins/sample_team/sample_join_bootstrap.py (3 hunks)
  • api/python/test/sample/joins/sample_team/sample_join_derivation.py (1 hunks)
  • api/python/test/sample/joins/sample_team/sample_join_external_parts.py (1 hunks)
  • api/python/test/sample/joins/sample_team/sample_join_from_group_by_from_join.py (1 hunks)
  • api/python/test/sample/joins/sample_team/sample_join_from_module.py (1 hunks)
  • api/python/test/sample/joins/sample_team/sample_join_from_module_skipped.py (1 hunks)
  • api/python/test/sample/joins/sample_team/sample_join_with_derivations_on_external_parts.py (1 hunks)
  • api/python/test/sample/joins/sample_team/sample_label_join.py (1 hunks)
  • api/python/test/sample/joins/sample_team/sample_label_join_with_agg.py (1 hunks)
  • api/python/test/sample/joins/sample_team/sample_online_join.py (1 hunks)
💤 Files with no reviewable changes (40)
  • api/py/ai/init.py
  • airflow/readme.md
  • api/py/tox.ini
  • api/py/requirements/base.in
  • AUTHORS
  • api/py/test/sample/joins/sample_team/sample_chaining_join.py
  • .scalafix.conf
  • api/py/requirements/base.txt
  • api/py/ai/chronon/scheduler/interfaces/node.py
  • api/py/test/sample/production/models/quickstart/test.v1
  • CONTRIBUTING.md
  • api/py/test/sample/group_bys/risk/merchant_data.py
  • airflow/decorators.py
  • LICENSE
  • api/py/test/sample/production/group_bys/risk/transaction_events.txn_group_by_merchant
  • api/py/test/sample/joins/risk/user_transactions.py
  • aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala
  • api/py/ai/chronon/scheduler/interfaces/flow.py
  • api/py/test/test_join.py
  • airflow/staging_query_dag_constructor.py
  • api/py/ai/chronon/scheduler/interfaces/orchestrator.py
  • airflow/constants.py
  • api/py/test/sample/group_bys/risk/user_data.py
  • api/py/ai/chronon/init.py
  • api/py/test/sample/production/group_bys/risk/transaction_events.txn_group_by_user
  • api/py/test/sample/production/joins/sample_team/sample_join_from_shorthand.v1
  • airflow/online_offline_consistency_dag_constructor.py
  • api/py/test/scheduler/test_flow.py
  • airflow/join_dag_constructor.py
  • airflow/group_by_dag_constructor.py
  • airflow/helpers.py
  • api/py/test/sample/teams.json
  • api/py/ai/chronon/scheduler/adapters/airflow_adapter.py
  • api/py/test/sample/production/models/risk/transaction_model.v1
  • api/py/example.py
  • api/py/test/test_run.py
  • api/py/test/sample/sources/test_sources.py
  • api/py/ai/chronon/repo/run.py
  • airflow/operators.py
  • GOVERNANCE.md
🧰 Additional context used
🧬 Code Graph Analysis (47)
aggregator/src/test/scala/ai/chronon/aggregator/test/EditDistanceTest.scala (1)
aggregator/src/main/scala/ai/chronon/aggregator/stats/EditDistance.scala (1)
  • EditDistance (19-112)
api/python/test/sample/group_bys/sample_team/chaining_group_by.py (3)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
api/python/ai/chronon/source.py (1)
  • JoinSource (74-88)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/python/test/sample/group_bys/kaggle/outbrain.py (2)
api/python/test/sample/sources/kaggle/outbrain.py (1)
  • outbrain_left_events (28-40)
api/python/ai/chronon/group_by.py (2)
  • Accuracy (56-57)
  • Operation (60-146)
aggregator/src/main/scala/ai/chronon/aggregator/row/MapColumnAggregator.scala (1)
api/src/main/scala/ai/chronon/api/ScalaJavaConversions.scala (1)
  • ScalaJavaConversions (6-97)
api/python/ai/chronon/eval/query_parsing.py (3)
online/src/main/scala/ai/chronon/online/connectors/Catalog.scala (1)
  • Table (8-13)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (1)
  • name (172-172)
api/python/ai/chronon/eval/table_scan.py (1)
  • table_name (36-37)
api/python/ai/chronon/resources/gcp/sources/test/data.py (2)
api/python/ai/chronon/source.py (1)
  • EventSource (8-35)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
aggregator/src/main/scala/ai/chronon/aggregator/windowing/HopsAggregator.scala (2)
api/src/main/scala/ai/chronon/api/Row.scala (1)
  • Row (72-126)
api/src/main/scala/ai/chronon/api/TsUtils.scala (1)
  • TsUtils (23-42)
api/python/test/sample/joins/kaggle/outbrain.py (1)
api/python/test/sample/sources/kaggle/outbrain.py (1)
  • outbrain_left_events (28-40)
aggregator/src/test/scala/ai/chronon/aggregator/test/NaiveAggregator.scala (2)
api/src/main/scala/ai/chronon/api/Row.scala (1)
  • Row (72-126)
api/src/main/scala/ai/chronon/api/TsUtils.scala (1)
  • TsUtils (23-42)
api/python/test/sample/joins/quickstart/training_set.py (3)
api/python/ai/chronon/source.py (1)
  • EventSource (8-35)
api/src/main/scala/ai/chronon/api/Builders.scala (1)
  • Source (106-140)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/python/test/sample/group_bys/sample_team/mutation_sample_group_by.py (1)
api/python/ai/chronon/group_by.py (1)
  • Accuracy (56-57)
aggregator/src/main/scala/ai/chronon/aggregator/windowing/SawtoothAggregator.scala (3)
api/src/main/scala/ai/chronon/api/Builders.scala (1)
  • AggregationPart (68-85)
api/src/main/scala/ai/chronon/api/Row.scala (1)
  • Row (72-126)
api/src/main/scala/ai/chronon/api/TsUtils.scala (1)
  • TsUtils (23-42)
api/python/test/sample/group_bys/sample_team/sample_group_by_from_module.py (1)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
api/python/test/sample/joins/sample_team/sample_join_bootstrap.py (2)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/python/ai/chronon/utils.py (1)
  • get_join_output_table_name (317-332)
api/python/test/sample/group_bys/sample_team/sample_group_by.py (1)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
api/python/test/sample/group_bys/risk/user_data.py (4)
api/python/ai/chronon/source.py (1)
  • EntitySource (38-71)
api/src/main/scala/ai/chronon/api/Builders.scala (1)
  • Source (106-140)
api/src/main/scala/ai/chronon/api/Extensions.scala (1)
  • query (382-390)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/python/ai/chronon/resources/gcp/teams.py (1)
api/python/ai/chronon/repo/constants.py (1)
  • RunMode (4-30)
api/python/test/sample/group_bys/sample_team/sample_chaining_group_by.py (3)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
api/python/ai/chronon/source.py (1)
  • JoinSource (74-88)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/python/ai/chronon/resources/gcp/group_bys/test/data.py (1)
api/python/ai/chronon/group_by.py (3)
  • Operation (60-146)
  • TimeUnit (178-180)
  • Window (244-245)
api/python/test/sample/group_bys/quickstart/purchases.py (3)
api/python/ai/chronon/source.py (1)
  • EventSource (8-35)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/python/test/sample/group_bys/quickstart/schema.py (3)
api/python/ai/chronon/source.py (1)
  • EventSource (8-35)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/python/test/sample/group_bys/sample_team/sample_non_prod_group_by.py (1)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
api/python/ai/chronon/eval/sample_tables.py (2)
api/python/ai/chronon/eval/__init__.py (1)
  • eval (17-42)
api/python/ai/chronon/eval/table_scan.py (2)
  • output_path (30-31)
  • raw_scan_query (52-58)
api/python/test/sample/group_bys/quickstart/returns.py (3)
api/python/ai/chronon/source.py (1)
  • EventSource (8-35)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/python/test/sample/joins/risk/user_transactions.py (3)
api/python/ai/chronon/source.py (1)
  • EventSource (8-35)
api/src/main/scala/ai/chronon/api/Builders.scala (1)
  • Source (106-140)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/python/ai/chronon/resources/gcp/zipline-cli-install.sh (1)
scripts/distribution/build_and_upload_artifacts.sh (1)
  • print_usage (3-12)
api/python/test/sample/joins/sample_team/sample_label_join.py (1)
api/python/ai/chronon/join.py (1)
  • LabelParts (252-287)
api/python/test/sample/group_bys/risk/transaction_events.py (3)
api/python/ai/chronon/source.py (1)
  • EventSource (8-35)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/python/ai/chronon/repo/explore.py (1)
api/python/ai/chronon/cli/compile/parse_teams.py (1)
  • load_teams (42-69)
api/python/ai/chronon/windows.py (1)
api/python/ai/chronon/group_by.py (2)
  • Window (244-245)
  • TimeUnit (178-180)
api/python/test/sample/group_bys/kaggle/clicks.py (4)
api/python/ai/chronon/source.py (1)
  • EventSource (8-35)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/python/ai/chronon/utils.py (1)
  • get_staging_query_output_table_name (304-309)
api/python/test/sample/joins/sample_team/sample_label_join_with_agg.py (1)
api/python/ai/chronon/join.py (1)
  • LabelParts (252-287)
api/python/test/sample/group_bys/risk/merchant_data.py (3)
api/python/ai/chronon/source.py (1)
  • EntitySource (38-71)
api/src/main/scala/ai/chronon/api/Builders.scala (1)
  • Source (106-140)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/python/test/sample/joins/sample_team/sample_join.py (2)
api/python/ai/chronon/repo/constants.py (1)
  • RunMode (4-30)
api/python/ai/chronon/join.py (1)
  • LabelParts (252-287)
api/python/ai/chronon/eval/__init__.py (3)
api/python/ai/chronon/eval/query_parsing.py (1)
  • get_tables_from_query (4-19)
api/python/ai/chronon/eval/sample_tables.py (2)
  • sample_tables (20-24)
  • sample_with_query (7-17)
api/python/ai/chronon/eval/table_scan.py (10)
  • TableScan (23-86)
  • clean_table_name (12-13)
  • table_scans_in_group_by (151-155)
  • table_scans_in_join (158-186)
  • table_scans_in_source (113-139)
  • table_name (36-37)
  • output_path (30-31)
  • raw_scan_query (52-58)
  • view_name (33-34)
  • scan_query (60-86)
api/python/test/sample/group_bys/sample_team/group_by_with_kwargs.py (1)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
api/python/ai/chronon/cli/compile/display/compile_status.py (2)
api/python/ai/chronon/cli/compile/display/class_tracker.py (7)
  • ClassTracker (10-107)
  • add (26-42)
  • add_existing (23-24)
  • close (68-71)
  • to_status (73-93)
  • to_errors (95-103)
  • diff (106-107)
api/python/ai/chronon/cli/compile/display/compiled_obj.py (1)
  • CompiledObj (6-12)
api/python/test/sample/group_bys/sample_team/event_sample_group_by.py (1)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (3)
api/python/ai/chronon/group_by.py (1)
  • Operation (60-146)
api/src/main/scala/ai/chronon/api/Extensions.scala (1)
  • getInt (215-223)
api/src/main/scala/ai/chronon/api/DataType.scala (6)
  • IntType (138-138)
  • LongType (140-140)
  • ShortType (146-146)
  • DoubleType (142-142)
  • FloatType (144-144)
  • StringType (152-152)
api/python/test/canary/teams.py (1)
api/python/ai/chronon/repo/constants.py (1)
  • RunMode (4-30)
api/python/ai/chronon/types.py (7)
api/src/main/scala/ai/chronon/api/Extensions.scala (1)
  • query (382-390)
api/python/ai/chronon/query.py (1)
  • selects (103-126)
api/src/main/scala/ai/chronon/api/Builders.scala (2)
  • Source (106-140)
  • MetaData (261-315)
api/python/ai/chronon/source.py (3)
  • EventSource (8-35)
  • EntitySource (38-71)
  • JoinSource (74-88)
api/python/ai/chronon/group_by.py (5)
  • Operation (60-146)
  • Window (244-245)
  • TimeUnit (178-180)
  • DefaultAggregation (157-175)
  • Accuracy (56-57)
api/python/ai/chronon/join.py (1)
  • LabelParts (252-287)
api/python/ai/chronon/staging_query.py (1)
  • TableDependency (18-21)
api/python/ai/chronon/repo/aws.py (3)
api/python/ai/chronon/repo/default_runner.py (3)
  • Runner (19-284)
  • run (173-247)
  • _gen_final_args (249-284)
api/python/ai/chronon/repo/utils.py (5)
  • JobType (20-22)
  • check_call (66-68)
  • extract_filename_from_path (62-63)
  • get_customer_id (58-59)
  • split_date_range (442-467)
api/python/ai/chronon/repo/gcp.py (1)
  • run (363-563)
api/python/ai/chronon/source.py (2)
api/src/main/scala/ai/chronon/api/Extensions.scala (3)
  • query (382-390)
  • topic (453-463)
  • isCumulative (447-451)
api/src/main/scala/ai/chronon/api/Builders.scala (2)
  • Source (106-140)
  • joinSource (132-139)
api/python/ai/chronon/airflow_helpers.py (3)
api/python/ai/chronon/group_by.py (1)
  • GroupBy (429-674)
api/python/ai/chronon/join.py (1)
  • Join (355-554)
api/python/ai/chronon/utils.py (1)
  • get_query (135-136)
aggregator/src/test/scala/ai/chronon/aggregator/test/DataGen.scala (2)
api/src/main/scala/ai/chronon/api/Constants.scala (1)
  • Constants (23-100)
api/src/main/scala/ai/chronon/api/DataType.scala (2)
  • FloatType (144-144)
  • LongType (140-140)
api/python/ai/chronon/repo/compilev2.py (5)
api/python/ai/chronon/repo/serializer.py (1)
  • thrift_simple_json_protected (127-141)
api/python/ai/chronon/repo/validator.py (1)
  • ChrononRepoValidator (195-485)
api/python/ai/chronon/repo/extract_objects.py (1)
  • from_folderV2 (44-65)
api/python/ai/chronon/repo/team_json_utils.py (1)
  • get_team_conf (38-46)
api/python/ai/chronon/utils.py (2)
  • log_table_name (300-301)
  • output_table_name (241-248)
api/python/ai/chronon/cli/git_utils.py (2)
api/python/ai/chronon/repo/utils.py (1)
  • check_output (71-73)
api/python/ai/chronon/repo/runner.py (1)
  • info (189-205)
🪛 markdownlint-cli2 (0.17.2)
api/python/README.md

134-134: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

api/python/ai/chronon/resources/gcp/README.md

98-98: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)


111-111: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)


122-122: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)


140-140: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)

🪛 LanguageTool
README.md

[style] ~15-~15: ‘on a regular basis’ might be wordy. Consider a shorter alternative.
Context: ...on are picked and merged into this repo on a regular basis, and improvements made to this reposito...

(EN_WORDINESS_PREMIUM_ON_A_REGULAR_BASIS)

api/python/requirements/base.txt

[duplication] ~31-~31: Possible typo: you repeated a word.
Context: ...irements/base.in face==24.0.0 # via glom glom==24.11.0 # via -r requirements/base...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~48-~48: Possible typo: you repeated a word.
Context: ...e.in google-cloud-core==2.4.3 # via google-cloud-storage google-cloud-storage==2.19.0 # via -r requirements/base....

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~54-~54: Possible typo: you repeated a word.
Context: ...ia # google-cloud-storage # google-resumable-media google-resumable-media==2.7.2 # via google-cloud-storage g...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~64-~64: Possible typo: you repeated a word.
Context: ... # via # google-api-core # grpcio-status grpcio-status==1.71.0 # via google-api-core idna=...

(ENGLISH_WORD_REPEAT_RULE)

api/python/requirements/dev.txt

[duplication] ~63-~63: Possible typo: you repeated a word.
Context: ...1 # via tox pytest==8.3.5 # via pytest-cov pytest-cov==6.1.1 # via -r requirements/dev.in...

(ENGLISH_WORD_REPEAT_RULE)

🪛 YAMLlint (1.35.1)
.github/workflows/test_scala_fmt.yaml

[error] 50-50: no new line character at the end of file

(new-line-at-end-of-file)

.github/release.yml

[error] 9-9: trailing spaces

(trailing-spaces)


[error] 17-17: no new line character at the end of file

(new-line-at-end-of-file)

.github/workflows/build_and_push_docker.yaml

[error] 35-35: no new line character at the end of file

(new-line-at-end-of-file)

.github/workflows/test_python.yaml

[error] 71-71: no new line character at the end of file

(new-line-at-end-of-file)

.github/workflows/test_scala_2_12_spark.yaml

[error] 201-201: no new line character at the end of file

(new-line-at-end-of-file)

.github/workflows/push_to_platform.yaml

[error] 34-34: trailing spaces

(trailing-spaces)


[error] 37-37: trailing spaces

(trailing-spaces)


[error] 41-41: trailing spaces

(trailing-spaces)


[error] 49-49: trailing spaces

(trailing-spaces)


[error] 52-52: trailing spaces

(trailing-spaces)


[error] 57-57: no new line character at the end of file

(new-line-at-end-of-file)

.github/workflows/test_scala_2_13_spark.yaml

[error] 209-209: no new line character at the end of file

(new-line-at-end-of-file)

.github/workflows/push_to_canary.yaml

[error] 79-79: trailing spaces

(trailing-spaces)


[error] 81-81: trailing spaces

(trailing-spaces)


[error] 102-102: trailing spaces

(trailing-spaces)


[error] 105-105: trailing spaces

(trailing-spaces)


[error] 108-108: trailing spaces

(trailing-spaces)


[warning] 287-287: too many spaces after colon

(colons)


[warning] 336-336: too many spaces after colon

(colons)

🪛 actionlint (1.7.4)
.github/workflows/build_and_push_docker.yaml

24-24: the runner of "docker/login-action@v1" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

.github/workflows/test_scala_2_12_spark.yaml

23-23: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


48-48: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


74-74: label "ubuntu_32_core_128gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


100-100: label "ubuntu_32_core_128gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


126-126: label "ubuntu_32_core_128gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


152-152: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


178-178: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

.github/workflows/test_scala_2_13_spark.yaml

23-23: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


50-50: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


77-77: label "ubuntu_32_core_128gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


104-104: label "ubuntu_32_core_128gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


131-131: label "ubuntu_32_core_128gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


158-158: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


185-185: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

.github/workflows/push_to_canary.yaml

58-58: shellcheck reported issue in this script: SC2086:info:9:20: Double quote to prevent globbing and word splitting

(shellcheck)


74-74: shellcheck reported issue in this script: SC2086:info:13:46: Double quote to prevent globbing and word splitting

(shellcheck)


74-74: shellcheck reported issue in this script: SC2086:info:14:38: Double quote to prevent globbing and word splitting

(shellcheck)


503-503: shellcheck reported issue in this script: SC2086:info:2:36: Double quote to prevent globbing and word splitting

(shellcheck)


503-503: shellcheck reported issue in this script: SC2086:info:4:16: Double quote to prevent globbing and word splitting

(shellcheck)


503-503: shellcheck reported issue in this script: SC2086:info:8:19: Double quote to prevent globbing and word splitting

(shellcheck)


503-503: shellcheck reported issue in this script: SC2086:info:10:17: Double quote to prevent globbing and word splitting

(shellcheck)

⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: groupby_tests
  • GitHub Check: spark_tests
  • GitHub Check: batch_tests
  • GitHub Check: spark_tests
  • GitHub Check: batch_tests
🔇 Additional comments (300)
api/python/requirements/base.txt (1)

1-126: Autogenerated pinned deps
Approved; ensures reproducible builds.

🧰 Tools
🪛 LanguageTool

[duplication] ~31-~31: Possible typo: you repeated a word.
Context: ...irements/base.in face==24.0.0 # via glom glom==24.11.0 # via -r requirements/base...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~48-~48: Possible typo: you repeated a word.
Context: ...e.in google-cloud-core==2.4.3 # via google-cloud-storage google-cloud-storage==2.19.0 # via -r requirements/base....

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~54-~54: Possible typo: you repeated a word.
Context: ...ia # google-cloud-storage # google-resumable-media google-resumable-media==2.7.2 # via google-cloud-storage g...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~64-~64: Possible typo: you repeated a word.
Context: ... # via # google-api-core # grpcio-status grpcio-status==1.71.0 # via google-api-core idna=...

(ENGLISH_WORD_REPEAT_RULE)

api/python/requirements/base.in (1)

1-12: Remove lingering refs
Ensure CI, docs, and scripts no longer reference base.in.

api/python/README.md (1)

137-137: Path reference updated correctly.

Updated directory reference from api/py to api/python in pre-commit setup instructions.

aggregator/src/test/scala/ai/chronon/aggregator/test/DataGen.scala (8)

68-71: Formatting improvement.

Method signature formatting improved for readability.


97-102: Improved chain formatting.

Chain of method calls reformatted for better readability.


166-168: Simplified API by removing redundant parameter.

Removed partitionColumn parameter since it can be derived from partitionSpec.


175-176: Local derivation of partitionColumn.

Now deriving partitionColumn locally instead of passing as parameter.


185-185: Alignment fix.

Fixed alignment with other case patterns.


192-192: Updated method call.

Removed redundant partitionColumn parameter from genImpl call.


194-196: Updated method calls.

Removed redundant partitionColumn parameter from genImpl calls.


201-202: Simplified method signature.

Simplified gen method to only take partitionSpec.

api/python/test/sample/joins/sample_team/sample_backfill_mutation_join.py (1)

20-22: Import reordering.

Imports reordered for better organization.

api/python/test/sample/joins/sample_team/sample_join_from_module_skipped.py (1)

20-20: Reordered import statement.

Import reordering is consistent with style conventions.

api/python/ai/chronon/model.py (1)

3-4: Import reordering looks good.

Added spacing improves readability.

aggregator/src/main/scala/ai/chronon/aggregator/base/MinHeap.scala (1)

77-77: Comment formatting fixed.

Added space after comment slashes for consistency.

api/python/test/conftest.py (1)

18-19: Import reordering looks good.

Added blank line improves readability.

aggregator/src/main/scala/ai/chronon/aggregator/stats/EditDistance.scala (1)

17-17: Package declaration updated to reflect new structure.

Package changed from spark.stats to aggregator.stats, aligning with module restructuring.

.bazeliskrc (1)

1-1: Bazel version pinned for reproducible builds.

Using Bazel 6.4.0 ensures consistent build environment across development machines.

.bazelignore (1)

1-2: Appropriate exclusion of Git directory.

Excluding .git from Bazel scanning improves build performance.

api/python/test/sample/joins/sample_team/sample_join_external_parts.py (1)

22-22: Import statements consolidated.

Multiple imports consolidated into single line, improving code readability.

api/python/requirements/dev.in (1)

8-9: Additions of zipp and importlib-metadata packages

These packages support Python package metadata handling, commonly used together.

api/python/ai/chronon/repo/team_json_utils.py (2)

1-1: Docstring formatting improved

Single-line docstring is more concise.


20-20: Consistent string quoting style

Changed from single to double quotes for consistency.

api/python/test/sample/group_bys/sample_team/sample_group_by_with_derivations.py (1)

17-17: Import consolidation

Simplified multiple imports into a single line.

aggregator/src/main/scala/ai/chronon/aggregator/base/TimedAggregators.scala (2)

75-75: Comment formatting fix

Added space after comment slashes for readability.


95-95: Comment formatting fix

Added space after comment slashes for consistency.

aggregator/src/test/scala/ai/chronon/aggregator/test/NaiveAggregator.scala (1)

21-21: Import consolidation looks good.

Importing multiple elements from the same package is cleaner.

api/python/test/sample/group_bys/sample_team/sample_group_by_with_incorrect_derivations.py (1)

18-18: Clean import statement.

Single-line import improves readability.

api/python/ai/chronon/cli/compile/display/console.py (1)

1-3: Good console setup.

Simple and effective Rich console initialization for formatted output.

api/python/test/sample/group_bys/sample_team/sample_group_by_missing_input_column.py (3)

17-17: Proper spacing after imports.

Blank line improves code readability.


20-20: Good import reordering.

Alphabetical order is more maintainable.


28-28: Helpful test case documentation.

Comment clarifies the intentional error case being tested.

.scalafmt.conf (2)

1-2: Version upgrade with dialect specification.

Version updated and dialect explicitly set to Scala 2.12.


7-8: Docstring wrapping disabled.

Formatting preference added to prevent docstring wrapping.

api/python/test/canary/README.md (1)

1-7: Minimal documentation added.

Basic structure with cloud-specific links established.

.tool-versions (1)

1-7: Tool versions pinned.

Development environment dependencies properly specified for consistent builds.

aggregator/src/main/scala/ai/chronon/aggregator/row/MapColumnAggregator.scala (1)

21-21: Import update looks good

Replaced Scala's utility with project-specific conversion that handles nulls gracefully.

.plugin-versions (1)

1-6: Pinned plugin versions as expected

Version pinning ensures consistent development environment.

api/python/test/sample/joins/kaggle/outbrain.py (1)

17-21: Import cleanup looks good

Simplified and consolidated imports while maintaining functionality.

aggregator/src/main/scala/ai/chronon/aggregator/windowing/HopsAggregator.scala (1)

21-21: Import consolidation looks good

Grouped related imports from same package for better readability.

api/python/test/sample/group_bys/sample_team/entity_sample_group_by_from_module.py (2)

21-21: Clean import consolidation.

Imports now use ai.chronon.types module for all type definitions.


28-28: Improved window readability.

Window format changed from explicit object to cleaner string notation.

aggregator/src/main/scala/ai/chronon/aggregator/windowing/SawtoothAggregator.scala (1)

20-20: Consolidated imports.

Import statement now concisely groups all imports from ai.chronon.api.

api/python/test/sample/group_bys/sample_team/sample_group_by_from_module.py (2)

21-25: Consolidated type imports.

Imports now consistently use ai.chronon.types for all type-related classes.


35-35: Simplified window syntax.

Window format changed from explicit object to cleaner string notation "7d".

api/python/test/sample/group_bys/sample_team/mutation_sample_group_by.py (1)

21-26: Organized imports.

Import statement now consistently groups all type-related classes from ai.chronon.types.

.github/pull_request_template.md (1)

1-8: Looks good.

Clean and simple PR template.

api/python/test/sample/joins/sample_team/sample_join_from_group_by_from_join.py (2)

15-17: Import ordering improvement.

Cleaner import structure.


23-27:

✅ Verification successful

Verify version change.

Changed from using v1 to v2 for group_by reference. Confirm this is intentional.


🏁 Script executed:

#!/bin/bash
# Check if sample_group_by_from_join_part.v2 exists and compare with v1
grep -rn "v2 = " group_bys/sample_team/sample_group_by_from_join_part.py

Length of output: 163


🏁 Script executed:

#!/bin/bash
# locate group_by part file
file=$(fd sample_group_by_from_join_part.py)
echo "file: $file"
# list v1 and v2 defs
grep -nE 'v1\s*=|v2\s*=' "$file"

Length of output: 427


Version change verified: sample_group_by_from_join_part.py only defines v2 (at line 26), and v1 is absent—referencing v2 is intentional.

aggregator/src/main/scala/ai/chronon/aggregator/windowing/TwoStackLiteAggregator.scala (2)

47-53: Improved lambda syntax.

More concise and readable code structure.


164-217: Better documentation format.

Converted to proper ScalaDoc style. Content preserved.

api/python/test/sample/group_bys/sample_team/sample_non_prod_group_by.py (3)

15-21: Simplified imports.

Consolidated imports from ai.chronon.types.


26-29: Improved window syntax.

Using string shorthand "7d" is cleaner than Window object.


30-33: Added trailing comma.

Good practice for cleaner diffs when parameters are added.

.github/release.yml (1)

1-17: Solid changelog configuration.

Clean setup for semantic versioning labels.

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 9-9: trailing spaces

(trailing-spaces)


[error] 17-17: no new line character at the end of file

(new-line-at-end-of-file)

api/python/test/sample/group_bys/sample_team/sample_group_by_group_by.py (1)

17-20: Import reorganization looks good.

Cleaner import structure.

.github/ISSUE_TEMPLATE/feature_request.md (1)

1-20: Well-structured feature request template.

Standard GitHub template with clear sections.

aggregator/src/main/scala/ai/chronon/aggregator/row/RowAggregator.scala (1)

39-63: Good lambda refactoring.

Improved pattern matching syntax while maintaining functionality.

api/python/test/sample/joins/sample_team/sample_join_derivation.py (1)

21-21: Import changes follow best practices.

Reordering imports alphabetically and grouping by source improves readability.

Also applies to: 23-23, 25-25

api/python/test/sample/group_bys/sample_team/group_by_with_kwargs.py (2)

21-25: Consolidated imports enhance maintainability.

Moving all imports to ai.chronon.types simplifies the import structure.


36-36: Simplified window specification improves usability.

String-based window specification "7d" replaces Window object, aligning with API simplification goals.

aggregator/src/test/scala/ai/chronon/aggregator/test/MinHeapTest.scala (1)

21-21: Test framework migration improves consistency.

Migrating from JUnit to ScalaTest's AnyFlatSpec aligns with project's test standardization while preserving test logic.

Also applies to: 26-27

api/python/test/sample/group_bys/quickstart/users.py (3)

15-15: Import changes align with API updates.

Import order updated and selects replaces select.

Also applies to: 19-19


26-35: Correctly migrated to selects function.

Updated to use the new selects function with proper formatting.


37-42: Added trailing commas for consistency.

Formatting improvements with trailing commas.

.github/workflows/require_triggered_status_checks.yaml (1)

1-14: Good branch protection workflow.

Correctly configured to enforce required status checks before allowing pushes.

api/python/ai/chronon/cli/compile/display/compiled_obj.py (1)

1-12: Well-structured CompiledObj dataclass.

Clean implementation with proper type hints.

api/python/test/sample/joins/quickstart/training_set.py (4)

19-22: Import statements properly reorganized.

Imports are now better grouped and organized.


27-37: Source definition correctly updated.

Migrated to selects function with improved formatting.


39-44: Join constructor nicely reformatted.

Better multi-line style with clear comments.


46-52: Second join constructor properly formatted.

Consistent with the first join's style improvements.

api/python/test/sample/joins/sample_team/sample_join_with_derivations_on_external_parts.py (3)

19-22: Import reordering looks good.

The reorganization of imports is clean and follows a logical grouping.


23-23: Import moved to improve organization.

Moving test_sources import after group_bys imports follows a better organizational pattern.


25-33: Import reordering for better organization.

The reordering groups related concepts together - contextual elements, data types, and operations.

api/python/test/sample/group_bys/quickstart/schema.py (4)

1-3: Updates API imports correctly.

Properly updates imports to use selects instead of deprecated select.


9-12: Updates API usage and improves formatting.

Correctly changes select to selects and adds trailing comma for better git diffs.


22-22: Simplified aggregation formatting.

Converted multi-line list to more concise single-line format.


24-24: Added trailing comma for consistency.

Improves maintainability by enabling cleaner diffs in future changes.

.bazelproject (1)

1-22: Good Bazel workspace configuration.

Properly configures the project for Bazel builds with appropriate language support for Python, Scala, TypeScript, and Java.

api/python/test/sample/group_bys/sample_team/sample_group_by.py (4)

17-17: Imports organized on a single line.

Consolidated related imports for better readability.


29-30: Added trailing comma for consistency.

Improves git diff readability for future changes.


32-32: Added online=True parameter.

Explicitly enables online processing mode.


46-48: Improved derivations formatting.

Converted to more concise inline format with proper trailing commas.

aggregator/src/test/scala/ai/chronon/aggregator/test/EditDistanceTest.scala (1)

17-23: Framework migration looks good.

Clean migration from JUnit to ScalaTest with proper imports and class inheritance.

api/python/test/sample/joins/sample_team/sample_join_bootstrap.py (3)

26-27: Import consolidation looks good.

Properly consolidated imports from ai.chronon submodules.


32-44: String literal style change is consistent.

Good conversion from single to double quotes for dictionary keys.


59-63: API update is correct.

Properly updated to use selects instead of select function.

api/python/test/sample/aws/teams.json (2)

7-7: Placeholder needs attention.

<CUSTOMER_ID> placeholder aligns with PR objective to remove customer names.


1-16:

✅ Verification successful

New configuration file looks good.

Clean JSON structure with appropriate defaults for AWS environment.


🏁 Script executed:

#!/bin/bash
# Validate JSON syntax
jq '.' api/python/test/sample/aws/teams.json > /dev/null
echo $?

Length of output: 55


Configuration file approved
Syntax validation passed (jq . exit code 0). JSON is clean with correct AWS defaults. Ready to merge.

api/python/test/sample/group_bys/sample_team/sample_chaining_group_by.py (6)

21-23: Import restructuring looks good.

Clean import organization.


25-35: Simplified imports from ai.chronon.types.

Consolidated imports improve maintainability.


42-43: Standardized string quotes in key_mapping.

Consistent double quote usage.

Also applies to: 46-47


50-51: Added check_consistency flag.

Important safety feature.


54-66: Updated sources formatting and API usage.

Proper list format for sources and using selects instead of select.


76-77: Added trailing comma.

Style consistency improvement.

.github/ISSUE_TEMPLATE/bug_report.md (1)

1-38: Standard GitHub bug report template.

Template follows GitHub best practices with clear sections for bug description, reproduction steps, expected behavior, and environment details.

api/python/test/sample/group_bys/sample_team/chaining_group_by.py (2)

1-4: Clean imports.

Well-organized imports.


5-29: Well-structured GroupBy definition.

GroupBy correctly uses JoinSource with appropriate configuration.

api/python/test/sample/group_bys/sample_team/sample_group_by_from_join_part.py (4)

15-17: Reordered import.

Import restructuring improves readability.


18-24: Consolidated imports from ai.chronon.types.

Better module organization.


26-29: Renamed variable and improved formatting.

Renamed from v1 to v2 with better multi-line formatting.


35-36: Added trailing comma.

Consistent style.

api/python/ai/chronon/resources/gcp/sources/test/data.py (4)

1-3: Clean imports.

Good structure of imports with clear separation.


4-9: Clear documentation.

Docstring explains purpose concisely.


14-21: Well-structured Source definition.

Source configuration is clear with good inline comments explaining each component.


23-24: Helpful usage note.

Good indication of how the object can be used.

api/python/test/sample/joins/sample_team/sample_join_from_module.py (4)

19-22: Import reordering.

Imports are now better organized.


25-25: Cleaner import structure.

Clean single-line import format.


28-28: Consistent spacing.

Removed extra spaces around the equals sign.


32-33: Standardized string quotes.

Switched from single to double quotes for consistency and added trailing comma for better git diffs.

Also applies to: 36-37

api/python/ai/chronon/resources/gcp/group_bys/test/data.py (3)

2-4: Clean imports.

Good separation of imports with appropriate blank line.


6-6: Concise window definition.

Good use of list comprehension with clear inline comment.


8-33: Well-structured GroupBy configuration.

Clear organization with good comments explaining each aggregation type.

aggregator/src/main/scala/ai/chronon/aggregator/row/StatsGenerator.scala (8)

21-21: Updated import for ScalaJavaConversions.

Standardized Scala-Java conversion utility.


28-35: Improved comment formatting.

Better spacing in ScalaDoc comment.


47-54: Fixed comment formatting.

Improved ScalaDoc comment structure.


61-62: Consistent ScalaDoc style.

Fixed spacing in comment.


69-71: Improved comment formatting.

Better spacing in ScalaDoc.


114-124: Cleaner pattern matching.

More idiomatic Scala with case expressions.


145-153: Fixed comment formatting.

Better spacing in documentation.


160-165: More readable method chaining.

Better line breaks and indentation in method calls.

api/python/test/sample/group_bys/risk/user_data.py (1)

1-28: Clean implementation of user data GroupBy

Good structure with clear documentation and well-organized fields selection for user data.

api/python/test/sample/group_bys/quickstart/purchases.py (4)

15-17: Import refactoring looks good

Updated imports align with new API conventions.


24-35: Source definition update looks good

Changed selects usage and added bucket_rand column.


37-37: Window format simplification

Good conversion to string-based window format.


43-66: GroupBy aggregation updates look good

Added LAST_K operations with appropriate configurations.

api/python/ai/chronon/repo/__init__.py (1)

15-32: Good mapping structure for folder names to classes

Clean implementation of class mapping for the new compilation framework.

api/python/test/sample/group_bys/kaggle/outbrain.py (3)

15-22: Import cleanup looks good

Removed unnecessary imports.


43-53: Simplified window format

Good conversion to string-based window format for aggregations.


79-81: Improved formatting

Better multi-line formatting for readability.

aggregator/src/test/scala/ai/chronon/aggregator/test/SawtoothAggregatorTest.scala (4)

26-26: Appropriate framework migration

ScalaTest import replaces JUnit.


49-49: Good modernization

Updated to extend AnyFlatSpec instead of TestCase.


51-51: Improved test style

Converted to ScalaTest's BDD style.


122-122: Consistent test style

Matches ScalaTest convention used throughout file.

aggregator/src/test/scala/ai/chronon/aggregator/test/VarianceTest.scala (3)

21-21: Framework modernization

ScalaTest import added.


25-25: Modern test base class

AnyFlatSpec is appropriate for this test.


63-63: Better test declaration style

Converted to BDD "it should" format.

api/python/test/canary/joins/gcp/training_set.py (1)

1-36: Clean training set join definition

Join setup correctly references GCP group_bys. Well-documented.

aggregator/src/test/scala/ai/chronon/aggregator/test/RowAggregatorTest.scala (7)

22-22: Framework migration

ScalaTest import added.


51-52: Modern test structure

Updated to AnyFlatSpec with BDD style.


88-88: Helpful comment added

Explanation clarifies calculation.


118-121: Cleaner pattern matching

Improved lambda syntax.


123-126: Consistent pattern matching

Matches style used elsewhere.


132-135: Cleaner lambda syntax

Pattern matching improves readability.


138-140: Simplified lambda

Cleaner pattern matching syntax.

api/python/requirements/dev.txt (1)

1-77: Dependency updates look good.

Version upgrades maintain compatibility.

🧰 Tools
🪛 LanguageTool

[duplication] ~63-~63: Possible typo: you repeated a word.
Context: ...1 # via tox pytest==8.3.5 # via pytest-cov pytest-cov==6.1.1 # via -r requirements/dev.in...

(ENGLISH_WORD_REPEAT_RULE)

api/python/ai/chronon/cli/compile/display/diff_result.py (1)

1-46: Well-structured diff display.

Clean implementation of colored diff output.

api/python/ai/chronon/resources/gcp/joins/test/data.py (1)

11-21: Good implementation with clear comments.

The source definition is well-structured with helpful inline comments.

.gitignore (4)

14-32: Good updates for new project structure.

Directory path updates reflect migration from api/py/ to api/python/.


39-44: Path updates aligned with directory restructuring.

Python distribution patterns updated properly.


84-86: Good addition for frontend generated files.

Properly excludes Thrift-generated types from version control.


103-114: Appropriate Bazel-related exclusions.

Correctly excludes Bazel temporary files and local configuration.

.bazelrc (4)

1-2: Good practice for local overrides.

Allows developers to customize build settings locally.


3-14: Well-structured Scala version configuration.

Clear documentation and configuration options for different Scala versions.


21-31: Good platform-specific configurations.

Properly handles compiler warning suppressions for different platforms.


15-17:

✅ Verification successful

Consider securing the remote cache URL.

Hardcoded GCS URL could expose sensitive infrastructure details.

Consider moving the remote cache URL to the .bazelrc.local file or using an environment variable.


🏁 Script executed:

#!/bin/bash
# Check if this URL is publicly accessible or contains sensitive information
curl -I https://storage.googleapis.com/zipline-bazel-cache 2>/dev/null | head -n 1

Length of output: 84


Remote cache URL is private (HTTP 403)
The GCS endpoint rejects anonymous requests, so the bucket isn’t publicly accessible. No change required.
Optional: if you’d like to abstract infra details, you can move the URL to .bazelrc.local or reference an environment variable instead.

aggregator/src/test/scala/ai/chronon/aggregator/test/SawtoothOnlineAggregatorTest.scala (2)

27-27: Successful migration to ScalaTest.

Clean conversion from JUnit to ScalaTest's AnyFlatSpec style.

Also applies to: 34-34, 36-36


52-123: Improved formatting consistency.

Parameter alignment in Builders.Aggregation calls is now more consistent.

aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxPercentilesTest.scala (4)

23-23: Appropriate import for ScalaTest migration

Clean import for AnyFlatSpec.


29-29: Good migration to ScalaTest

Proper inheritance from AnyFlatSpec instead of TestCase.


59-65: Well-formatted ScalaTest syntax

Clean conversion from JUnit to ScalaTest style using "it should".


77-95: Clean test migration

Good conversion of "testPSIDrifts" to ScalaTest format.

api/python/ai/chronon/repo/serializer.py (5)

18-25: Well-organized imports

Better organization of Thrift-related imports.


31-41: Improved code formatting

Better indentation in ThriftJSONDecoder methods.


62-75: Better readability for complex dictionary comprehension

Multi-line formatting improves readability.


95-102: Useful new utility function

Good addition of json2binary for Thrift serialization.


109-112: Improved error message

Better formatted error message with more context.

aggregator/src/main/scala/ai/chronon/aggregator/windowing/SawtoothMutationAggregator.scala (4)

41-41: Made parameter public

Converting parameter to val makes it accessible.


112-115: Performance optimization

Storing windowMillis avoids repeated property access.


131-135: Loop optimization

Reduces repeated calls to windowMappings.


148-153: Consistent optimization

Same pattern applied throughout the class.

api/python/test/sample/group_bys/kaggle/clicks.py (4)

15-25: Better import organization

Appropriate import grouping and updated to use selects.


44-52: Updated query construction

Using selects instead of select aligns with API changes.


58-60: Enhanced aggregations

Added COUNT operation and simplified window syntax.


64-65: Consistent formatting

Added trailing comma for better git diffs.

api/python/test/sample/group_bys/quickstart/returns.py (4)

15-15: Imports updated to use EventSource before Source

Updated import order provides better structure for dependency resolution.


21-21: Updated to use selects instead of select

API updated to use the newer selects function from query module.


38-38: Simplified window sizes with string durations

Replaced Window objects with more readable string format.


44-53: Reordered aggregations for clarity

Updated the order of operations to be more logical (SUM, COUNT, AVERAGE).

api/python/test/sample/joins/sample_team/sample_online_join.py (4)

26-27: Added important imports for environment configuration

New imports enable proper environment configuration.


34-43: Standardized key format to double quotes

Changed key_mapping values to use consistent double quotes.


45-49: Improved environment variable configuration

Replaced legacy parameters with structured EnvironmentVariables object.


50-51: Added important join configuration flags

Added online and consistency checking parameters.

api/python/ai/chronon/repo/init.py (4)

14-28: Well-structured CLI command definition

Good use of click decorators with appropriate options.


33-39: Added safeguard for existing directories

Prevents accidental overwrites with user confirmation.


42-47: Clear user feedback with helpful instructions

Good UX with success message and PYTHONPATH instructions.


48-49: Robust error handling

Exception handling prints full traceback for debugging.

api/python/test/sample/joins/sample_team/sample_label_join_with_agg.py (5)

23-23: Import structure updated

Cleaner import from dedicated module.


26-26: Consolidated import from types module

Good simplification using the centralized types module.


34-35: String format standardization

Consistent use of double quotes in key mappings.

Also applies to: 38-39


41-48: Updated label parts structure

Migrated to the newer LabelParts API that wraps a list of JoinPart objects.


49-50: Simplified Join parameters

Removed deprecated parameters and improved formatting.

api/python/test/sample/joins/risk/user_transactions.py (4)

1-4: Well-organized imports

Good organization of domain-specific group-bys.


5-7: Clear API imports

Clean import structure for Chronon types.


9-13: Well-defined source

Clean source definition with appropriate table and query configuration.


15-23: Well-structured join

Effective organization of join parts with appropriate prefixes to differentiate fields.

api/python/test/sample/joins/sample_team/sample_label_join.py (5)

23-23: Import from dedicated module

Improved structure using dedicated module.


26-26: Organized imports

Good import organization from join module.


34-35: Consistent string format

Standardized double quotes in key mappings.

Also applies to: 38-39


41-48: Updated label parts structure

Using the newer LabelParts API with list of JoinPart objects.


49-50: Simplified constructor parameters

Removed deprecated parameters for cleaner interface.

api/python/test/sample/group_bys/sample_team/event_sample_group_by.py (2)

17-17: Consolidated imports

Good use of centralized types module.


23-30: Streamlined aggregations

Improved aggregation structure with:

  • Simplified window specification ("7d")
  • Better organization of operations
  • Clearer percentile parameters
api/python/ai/chronon/repo/explore.py (4)

147-148: Updated team attribute access

Now using __dict__ attribute access instead of direct dictionary access, aligning with the new Team object model.


315-316: Fixed variable shadowing

Renamed file to filepath to avoid shadowing the parameter name.


370-384: Added support for Python-based team configuration

New function supports both JSON and Python module loading, enhancing flexibility.


403-403: Updated call to load_team_data

Now passing teams_root parameter to support the enhanced team loading functionality.

api/python/test/sample/group_bys/sample_team/label_part_group_by.py (1)

1-24: LGTM - Clean GroupBy definitions

The file properly defines two GroupBy objects with different configurations.

aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxDistinctTest.scala (3)

21-23: Updated test framework

Migrated from JUnit to ScalaTest by extending AnyFlatSpec.


53-57: Modernized test style

Converted to ScalaTest's more readable "it should" syntax.


59-63: Consistent test style updates

Matches the ScalaTest pattern applied to other tests.

api/python/setup.py (7)

30-32: Added explicit version default

Set default version to "0.0.1" for better versioning control.


52-52: Added resource collection

New glob pattern collects test samples recursively.


55-55: Updated Python version requirement

Now requiring Python 3.11 in classifiers.


59-63: Added CLI entry point

Console script maps "zipline" command to ai.chronon.repo.zipline:zipline.


64-67: Package rebranding

Changed name from "chronon-ai" to "zipline-ai" with updated description.


68-70: Improved package data handling

Now includes resources directory content in the package.


74-78: Updated packaging settings

Increased minimum Python version to 3.11 and set zip_safe to False.

api/python/test/canary/group_bys/aws/purchases.py (1)

42-44: Confirm Operation.LAST_K(10) API.
If LAST_K is an enum not a function, this will throw at runtime. Please verify.

api/python/test/sample/group_bys/risk/transaction_events.py (8)

1-3: Updated imports for standardized API.

Import changes align with API updates - using selects instead of select and removing unused imports.


9-9: Line spacing improvement.

Added space for better code readability.


16-17: Updated to use selects function and improved formatting.

Correctly updated to use selects function and added trailing comma for consistent formatting.


22-23: Enhanced readability with line spacing.

Added space for better code organization.


23-23: Simplified window specification.

Changed from explicit Window objects to string literals for more concise window specification.


35-36: Updated window specifications to use string format.

String-based window specifications replace explicit Window objects, consistent with API updates.

Also applies to: 40-41


45-45: Improved code organization.

Added space for better visual separation between function and implementation.


50-50: Added trailing comma for consistency.

Fixed formatting consistency.

api/python/test/sample/group_bys/risk/merchant_data.py (1)

1-29: New GroupBy for merchant data.

Well-structured merchant data source and GroupBy definition. Selects relevant merchant fields and correctly sets up the GroupBy keyed by merchant_id.

.github/workflows/push_to_platform.yaml (4)

3-11: Well-configured workflow trigger.

Correctly set to run only when PRs to main are closed and merged.


14-20: Proper repository checkout.

Correctly checks out platform repo with appropriate depth and reference.


22-25: Git configuration for Actions user.

Properly configures Git user identity for automated commits.


27-54: SSH setup for repository access.

Comprehensive SSH configuration for secure subtree operations.

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 34-34: trailing spaces

(trailing-spaces)


[error] 37-37: trailing spaces

(trailing-spaces)


[error] 41-41: trailing spaces

(trailing-spaces)


[error] 49-49: trailing spaces

(trailing-spaces)


[error] 52-52: trailing spaces

(trailing-spaces)

api/python/ai/chronon/windows.py (3)

1-51: Well-structured window parsing implementation.

Clean implementation for converting string duration formats to Window objects with proper error handling.


1-10: Helper functions for window creation.

Simple, focused helper functions that encapsulate window creation logic.


12-50: Robust string parsing with comprehensive error handling.

The _from_str function handles all edge cases with clear error messages.

api/python/test/sample/joins/sample_team/sample_join.py (4)

17-17: New imports added for enhanced functionality

Imports restructured to include RunMode, EnvironmentVariables, and LabelParts.

Also applies to: 22-23


27-27: Removed experimental tag

Experimental tag removed from JoinPart instances.


30-36: Modernized environment variable handling

Replaced legacy env dictionary with structured EnvironmentVariables class. Added online flag and label_part configuration.


43-43: Updated string formatting

Changed single quotes to double quotes for consistency.

api/python/ai/chronon/cli/compile/fill_templates.py (3)

6-14: Well-structured helper function

Simple utility function for template substitution.


17-31: Clean implementation for join template handling

Function properly processes bootstrapParts and dependencies.


32-40: Label dependencies template handling

Correctly handles join_backfill_table template replacement for labelParts.

api/python/ai/chronon/repo/compilev3.py (2)

11-32: Well-designed CLI command

Good use of Click decorators with appropriate defaults and help text. Path handling is robust.


35-52: Clean compilation helper function

Proper directory validation and error handling. Good separation of concerns.

api/python/test/sample/deprecated_teams.json (2)

1-36: Appropriate default configurations

Generic default settings with placeholders for sensitive values.


37-64: Generic team definitions

Teams defined using generic names instead of actual customer names, consistent with PR objective.

api/python/test/canary/deprecated_teams.json (1)

9-11: Replace TODO placeholders

Remove temporary placeholder values with actual paths and classes.

These placeholders indicate incomplete configuration that should be addressed before production use.

api/python/test/canary/teams.py (2)

7-7: Update email placeholder

Replace customer placeholder with actual email.

-    email="ml-infra@<customer>.com",  # TODO: Infra team email
+    email="[email protected]",  # Replace with actual team email

18-20: Replace TODO placeholders

Three TODO items need resolution before production use.

These placeholders indicate incomplete configuration:

  • Hadoop directory path
  • Online class specification
  • Online args configuration
api/python/test/canary/group_bys/gcp/purchases.py (1)

30-37: Verify GroupBy signature. Ensure a name parameter isn’t required by GroupBy.

aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (1)

142-148: DoubleLong conversion silently truncates

Casting a Double (or Float) to Long via longValue() discards the fractional part. If callers expect rounding, this will surprise them—document or adjust.

api/python/ai/chronon/repo/zipline.py (2)

6-9: Internal CLI imports look solid. Paths and aliases align with the new modular layout.


47-49: Subcommand registration is correct. The CLI group is wired to compile, run, and init as intended.

api/python/ai/chronon/types.py (7)

13-20: Approve source re-exports. Clear aliases for Query and Source improve discoverability.


21-28: Approve aggregation exports. Exposing GroupBy, Aggregation, and friends in one place is helpful.


29-32: Accuracy enums correct. TEMPORAL and SNAPSHOT are properly aliased.


33-44: Join primitives consolidated. All join-related types are neatly re-exported.


46-49: StagingQuery & MetaData aliases. Good centralization of staging‐query concepts.


51-55: Common types re-exported. ExecutionInfo and TableDependency are critical for workflow definitions.


56-57: Verify Team existence. Ensure ttypes.Team is defined in your Thrift schema to prevent import failures.

.github/workflows/test_scala_2_12_non_spark.yaml (1)

31-34: Concurrency is well set. Cancelling in-progress runs for the same branch is a solid safeguard.

.github/workflows/test_scala_2_13_non_spark.yaml (2)

51-58: Scala 2.13 config applied. --config=scala_2.13 is correctly included in the test commands.


29-32: Concurrency block is good. Settings ensure orderly execution.

api/python/ai/chronon/eval/sample_tables.py (1)

24-24:

✅ Verification successful

Verify path handling.

Ensure local_warehouse is a Path object if using / operator.


🏁 Script executed:

#!/bin/bash
# Check local_warehouse definition
grep -r "local_warehouse" api/python/ai/chronon/eval/

Length of output: 686


Path handling validated

local_warehouse is defined as a pathlib.Path in api/python/ai/chronon/eval/table_scan.py, so using the / operator is safe.

api/python/ai/chronon/cli/compile/serializer.py (2)

24-84: LGTM!

Well-structured code with good type handling.


86-116: Clean implementation with good error handling.

Functions properly handle serialization/deserialization.

aggregator/BUILD.bazel (3)

1-21: LGTM!

Well-structured Scala library definition.


36-52: LGTM!

Test configuration looks good.


23-23:

✅ Verification successful

Verify dependency definition.

_SCALA_TEST_DEPS is used but definition not shown.


🏁 Script executed:

#!/bin/bash
# Find where _SCALA_TEST_DEPS is defined
grep -r "_SCALA_TEST_DEPS" --include="BUILD*" .

Length of output: 463


🏁 Script executed:

#!/bin/bash
# locate definition of _SCALA_TEST_DEPS
grep -R "_SCALA_TEST_DEPS" -n . --include="*.bzl"

Length of output: 49


🏁 Script executed:

#!/bin/bash
grep -R "load" -n aggregator/BUILD.bazel

Length of output: 40


🏁 Script executed:

#!/bin/bash
# search repo for definition
rg -n '_SCALA_TEST_DEPS' .

Length of output: 522


_SCALA_TEST_DEPS definition found
The _SCALA_TEST_DEPS list is declared in tools/build_rules/prelude_bazel:39. No further action needed here.

api/python/ai/chronon/query.py (6)

15-16: Added OrderedDict import and reordered imports for new selects function.


27-27: Fixed mutable default argument pattern.

Changed from setups: List[str] = [] to setups: List[str] = None to avoid potential bugs with shared mutable defaults.


30-32: Added new partition-related parameters.

New parameters enhance query configuration flexibility.


77-85: Well-documented new parameters.

Clear descriptions for all new parameters.


89-99: Updated API object construction with explicit keyword arguments.

Properly added new parameters to the API Query constructor.


103-126: Renamed select to selects and improved implementation.

The new implementation:

  1. Uses OrderedDict to preserve argument order
  2. Handles both positional and keyword arguments
  3. Has clear documentation with examples

This is a better design that offers more flexibility.

api/python/ai/chronon/cli/compile/parse_configs.py (1)

15-74: Well-structured function for recursive config parsing.

Function properly handles:

  • File discovery
  • Object extraction
  • Metadata updates
  • Error handling with graceful recovery
api/python/ai/chronon/source.py (3)

8-35: Well-designed EventSource wrapper with comprehensive documentation.

Clean function signature with clear docstring explaining event source semantics.


38-71: EntitySource wrapper with thorough documentation.

Function correctly constructs the entity source with appropriate parameters.


74-88: Concise JoinSource wrapper with clear online/offline flow documentation.

Function correctly constructs a join source with informative description of how it's used.

api/python/ai/chronon/repo/run.py (2)

52-86: Comprehensive default value handling.

Function properly initializes defaults from environment variables with fallbacks.


247-268: Cloud provider-specific runner dispatch logic.

The code effectively handles different cloud environments by dispatching to appropriate runners.

api/python/ai/chronon/repo/hub_uploader.py (1)

12-18: changed_entity_names logic is placeholder but misleading
Assigning the full dict to changed_entity_names then iterating over it always returns all entities. Add a TODO comment or implement real diffing to avoid accidental full uploads.

.github/workflows/test_scala_2_13_spark.yaml (2)

1-19: Well-structured workflow trigger configuration

Properly configured to run only when relevant files change.


23-23:

❓ Verification inconclusive

Custom runner labels need verification

These runner labels aren't standard GitHub Actions runners.

Also applies to: 50-50, 77-77, 104-104, 131-131, 158-158, 185-185


🏁 Script executed:

#!/bin/bash
# Check if these are self-hosted runners defined in the repository
gh api repos/$GITHUB_REPOSITORY/actions/runners | jq '.runners[] | .name'

Length of output: 140


Verify custom runner labels exist

These labels aren’t standard GitHub-hosted runners. Ensure you have matching self-hosted runners registered with these exact labels in your repo/org; otherwise these jobs will fail.

Affected locations:

  • .github/workflows/test_scala_2_13_spark.yaml: line 23 (runs-on: ubuntu-8_cores-32_gb)
  • Same pattern at lines 50, 77, 104, 131, 158, 185
🧰 Tools
🪛 actionlint (1.7.4)

23-23: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

api/python/ai/chronon/repo/extract_objects.py (4)

25-41: Simplified from_folder with better parameter alignment

Properly removed unused root_path parameter.


44-65: Enhanced error tracking in V2 implementation

Good addition of structured error handling with target file tracking.


83-103: Improved from_file with better path handling

Now uses dedicated helper functions for path conversion.


106-131: New robust path handling utilities

Good assertions to catch invalid paths early.

api/BUILD.bazel (3)

1-19: Well-structured Thrift generation and Java library setup

Clear dependencies with appropriate visibility.


21-43: Clean Scala library configuration with conditional formatting

Smart conditional formatting based on Scala version.


45-72: Comprehensive test setup

Well-structured test dependencies and suite configuration.

aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (4)

8-17: Proper migration to ScalaTest

Well-executed transition from JUnit to ScalaTest.


83-96: Updated sketch size expectations

Reordered and updated expected sketch size mappings.


166-182: Excellent test data generation helper

Creates realistic skewed data distribution for testing.


184-210: Comprehensive test cases for frequent items behavior

Tests both frequent items and heavy hitters configuration modes.

api/python/ai/chronon/staging_query.py (1)

90-92: Caller-path team detection is brittle
inspect.stack()[1].filename.split("/")[-2] fails on shallow paths & Windows. Consider an explicit team arg or pathlib.Path.

api/python/ai/chronon/cli/compile/compile_context.py (1)

106-113: Dot-split assumes team present.
compiled_obj.name.split(".", 1) will fail if no dot. Guard to avoid ValueError.

api/python/ai/chronon/repo/constants.py (1)

72-78: Check Spark/Scala mapping.
Spark 3.2.1 is usually built for Scala 2.12, not 2.13. Confirm before releasing.

api/python/ai/chronon/join.py (1)

523-529: Verify ExecutionInfo field names

common.ExecutionInfo is populated with env=env_vars; schema often uses envVars (camel-case). Confirm to avoid lost data.

api/python/ai/chronon/group_by.py (1)

143-146: Default k dropped from 128→20 – please confirm.

Lower k increases error on APPROX_PERCENTILE. Is 20 empirically justified?

esac
done

gcloud storage cp "${ARTIFACT_PREFIX%/}/release/$VERSION/wheels/zipline_ai-$VERSION-py3-none-any.whl" .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for download.

Missing check if download fails.

-gcloud storage cp "${ARTIFACT_PREFIX%/}/release/$VERSION/wheels/zipline_ai-$VERSION-py3-none-any.whl" .
+if ! gcloud storage cp "${ARTIFACT_PREFIX%/}/release/$VERSION/wheels/zipline_ai-$VERSION-py3-none-any.whl" .; then
+    echo "Error: Failed to download wheel file."
+    exit 1
+fi
🤖 Prompt for AI Agents
In api/python/ai/chronon/resources/gcp/zipline-cli-install.sh at line 48, the
script downloads a file using gcloud storage cp but does not check if the
download succeeds. Add error handling by checking the exit status of the gcloud
command immediately after it runs. If the command fails, output an error message
and exit the script with a non-zero status to prevent further execution.


trap 'rm -f ./zipline_ai-$VERSION-py3-none-any.whl' EXIT

pip3 uninstall zipline-ai
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add -y flag to uninstall.

Pip uninstall prompts for confirmation.

-pip3 uninstall zipline-ai
+pip3 uninstall -y zipline-ai || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pip3 uninstall zipline-ai
pip3 uninstall -y zipline-ai || true
🤖 Prompt for AI Agents
In api/python/ai/chronon/resources/gcp/zipline-cli-install.sh at line 52, the
pip3 uninstall command lacks the -y flag, causing it to prompt for confirmation.
Add the -y flag to the pip3 uninstall command to enable automatic confirmation
and prevent the prompt during uninstallation.


pip3 uninstall zipline-ai

pip3 install ./zipline_ai-$VERSION-py3-none-any.whl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for install.

Missing check if install fails.

-pip3 install ./zipline_ai-$VERSION-py3-none-any.whl
+if ! pip3 install ./zipline_ai-$VERSION-py3-none-any.whl; then
+    echo "Error: Failed to install wheel file."
+    exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pip3 install ./zipline_ai-$VERSION-py3-none-any.whl
if ! pip3 install ./zipline_ai-$VERSION-py3-none-any.whl; then
echo "Error: Failed to install wheel file."
exit 1
fi
🤖 Prompt for AI Agents
In api/python/ai/chronon/resources/gcp/zipline-cli-install.sh at line 54, the
pip3 install command lacks error handling. Modify the script to check the exit
status of the pip3 install command immediately after it runs, and if the
installation fails (non-zero exit code), output an error message and exit the
script with a failure status to prevent further execution.

Comment on lines 11 to 14
if [ $# -ne 4 ]; then
print_usage
exit 1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix argument validation logic.

Script requires exactly 4 args but doesn't verify both required params are present.

-if [ $# -ne 4 ]; then
+if [ $# -lt 2 ]; then
     print_usage
     exit 1
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [ $# -ne 4 ]; then
print_usage
exit 1
fi
if [ $# -lt 2 ]; then
print_usage
exit 1
fi
🤖 Prompt for AI Agents
In api/python/ai/chronon/resources/gcp/zipline-cli-install.sh around lines 11 to
14, the argument validation only checks if the number of arguments is not equal
to 4 but does not verify that all required parameters are present. Update the
validation logic to explicitly check that each required argument is provided and
not empty, ensuring the script only proceeds when all four necessary parameters
are correctly passed.

Comment on lines 24 to 28
uses: docker/login-action@v1
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Update docker/login-action version

The runner for docker/login-action@v1 is too old for GitHub Actions.

-      - name: Log in to the Container registry
-        uses: docker/login-action@v1
+      - name: Log in to the Container registry
+        uses: docker/login-action@v2

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 actionlint (1.7.4)

24-24: the runner of "docker/login-action@v1" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents
In .github/workflows/build_and_push_docker.yaml around lines 24 to 28, the
docker/login-action is using version v1, which is outdated for the current
GitHub Actions runner. Update the action version from docker/login-action@v1 to
the latest stable version, such as docker/login-action@v2, to ensure
compatibility and improved functionality.

Comment on lines 185 to 193
- name: Push Jars to s3 Bucket
shell: bash
run: |
set -eo pipefail
aws s3 cp ${{ needs.build_artifacts.outputs.wheel_file }} s3://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/wheels/ --metadata="updated_date=$(date),commit=$(git rev-parse HEAD),branch=$(git rev-parse --abbrev-ref HEAD)"
aws s3 cp flink_assembly_deploy.jar s3://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/jars/flink_assembly_deploy.jar --metadata="updated_date=$(date),commit=$(git rev-parse HEAD),branch=$(git rev-parse --abbrev-ref HEAD)"
aws s3 cp cloud_aws_lib_deploy.jar s3://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/jars/cloud_aws_lib_deploy.jar --metadata="updated_date=$(date),commit=$(git rev-parse HEAD),branch=$(git rev-parse --abbrev-ref HEAD)"
aws s3 cp service_assembly_deploy.jar s3://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/jars/service_assembly_deploy.jar --metadata="updated_date=$(date),commit=$(git rev-parse HEAD),branch=$(git rev-parse --abbrev-ref HEAD)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Artifact paths wrong ⇒ aws s3 cp will 404
actions/download-artifact restores files under a directory named after the artifact (flink-assembly-jar/…).
aws s3 cp flink_assembly_deploy.jar … therefore fails.

-aws s3 cp flink_assembly_deploy.jar ...
+aws s3 cp "flink-assembly-jar/flink_assembly_deploy.jar" ...

Apply the same prefix for all jar & wheel uploads, and quote variables to silence SC2086.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In .github/workflows/push_to_canary.yaml around lines 185 to 193, the aws s3 cp
commands use incorrect artifact paths causing 404 errors because the downloaded
artifacts are inside directories named after the artifact. Fix this by prefixing
the jar and wheel file paths with their respective artifact directory names.
Also, quote all variable expansions in the commands to prevent word splitting
and silence shellcheck SC2086 warnings.

Comment on lines 248 to 258
set -eo pipefail
gcloud storage cp ${{ needs.build_artifacts.outputs.wheel_file }} gs://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/wheels/
gcloud storage objects update gs://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/wheels/${{ needs.build_artifacts.outputs.wheel_file }} --custom-metadata="updated_date=$(date),commit=$(git rev-parse HEAD),branch=$(git rev-parse --abbrev-ref HEAD)"
gcloud storage cp flink_assembly_deploy.jar gs://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/jars/flink_assembly_deploy.jar
gcloud storage objects update gs://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/jars/flink_assembly_deploy.jar --custom-metadata="updated_date=$(date),commit=$(git rev-parse HEAD),branch=$(git rev-parse --abbrev-ref HEAD)"
gcloud storage cp cloud_gcp_lib_deploy.jar gs://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/jars/cloud_gcp_lib_deploy.jar
gcloud storage objects update gs://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/jars/cloud_gcp_lib_deploy.jar --custom-metadata="updated_date=$(date),commit=$(git rev-parse HEAD),branch=$(git rev-parse --abbrev-ref HEAD)"
gcloud storage cp cloud_gcp_embedded_lib_deploy.jar gs://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/jars/cloud_gcp_embedded_lib_deploy.jar
gcloud storage objects update gs://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/jars/cloud_gcp_embedded_lib_deploy.jar --custom-metadata="updated_date=$(date),commit=$(git rev-parse HEAD),branch=$(git rev-parse --abbrev-ref HEAD)"
gcloud storage cp service_assembly_deploy.jar gs://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/jars/service_assembly_deploy.jar
gcloud storage objects update gs://zipline-artifacts-canary/release/${{ needs.build_artifacts.outputs.version }}/jars/service_assembly_deploy.jar --custom-metadata="updated_date=$(date),commit=$(git rev-parse HEAD),branch=$(git rev-parse --abbrev-ref HEAD)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Same missing sub-folder with gcloud storage cp
All gcloud storage cp lines assume flat files. Prefix with the download directory & quote paths.

Example:

-gcloud storage cp flink_assembly_deploy.jar ...
+gcloud storage cp "flink-assembly-jar/flink_assembly_deploy.jar" ...
🤖 Prompt for AI Agents
In .github/workflows/push_to_canary.yaml around lines 248 to 258, the gcloud
storage cp commands do not include the source sub-folder path and do not quote
file paths, which can cause errors if files are not in the current directory or
paths contain spaces. Fix this by prefixing each source file with its download
directory path and enclosing all file paths in quotes to ensure correct file
referencing and handling of spaces.

Comment on lines 252 to 287
def LabelParts(
labels: List[api.JoinPart],
left_start_offset: int,
left_end_offset: int,
label_offline_schedule: str = "@daily",
) -> api.LabelPart:
) -> api.LabelParts:
"""
Used to describe labels in join. Label part can be viewed as regular join part but represent
label data instead of regular feature data. Once labels are mature, label join job would join
labels with features in the training window user specified using `leftStartOffset` and
`leftEndOffset`.
The offsets are relative days compared to given label landing date `label_ds`. This parameter is required to be
passed in for each label join job. For example, given `label_ds = 2023-04-30`, `left_start_offset = 30`, and
`left_end_offset = 10`, the left size start date will be computed as 30 days before `label_ds` (inclusive),
which is 2023-04-01. Similarly, the left end date will be 2023-04-21. Labels will be refreshed within this window
[2023-04-01, 2023-04-21] in this specific label job run.
labels with features in the training window user specified within the label GroupBy-s.
Since label join job will run continuously based on the schedule, multiple labels could be generated but with
different label_ds or label version. Label join job would have all computed label versions available, as well as
a view of latest version for easy label retrieval.
LabelPart definition can be updated along the way, but label join job can only accommodate these changes going
LabelParts definition can be updated along the way, but label join job can only accommodate these changes going
forward unless a backfill is manually triggered.
Label aggregation is also supported but with conditions applied. Single aggregation with one window is allowed
for now. If aggregation is present, we would infer the left_start_offset and left_end_offset same as window size
and the param input will be ignored.
:param labels: List of labels
:param left_start_offset: Relative integer to define the earliest date label should be refreshed
compared to label_ds date specified. For labels with aggregations,
this param has to be same as aggregation window size.
:param left_end_offset: Relative integer to define the most recent date(inclusive) label should be refreshed.
e.g. left_end_offset = 3 most recent label available will be 3 days
prior to 'label_ds' (including `label_ds`). For labels with aggregations, this param
has to be same as aggregation window size.
:param label_offline_schedule: Cron expression for Airflow to schedule a DAG for offline
label join compute tasks
"""

label_metadata = api.MetaData(offlineSchedule=label_offline_schedule)
exec_info = common.ExecutionInfo(
scheduleCron=label_offline_schedule,
)
label_metadata = api.MetaData(executionInfo=exec_info)

for label in labels:
if label.groupBy.aggregations is not None:
assert len(labels) == 1, "Multiple label joinPart is not supported yet"
valid_agg = (
len(label.groupBy.aggregations) == 1
and label.groupBy.aggregations[0].windows is not None
and len(label.groupBy.aggregations[0].windows) == 1
)
assert valid_agg, (
"Too many aggregations or invalid windows found. "
"Single aggregation with one window allowed."
)
valid_time_unit = (
label.groupBy.aggregations[0].windows[0].timeUnit == api.TimeUnit.DAYS
)
assert valid_time_unit, "Label aggregation window unit must be DAYS"
window_size = label.groupBy.aggregations[0].windows[0].length
if left_start_offset != window_size or left_start_offset != left_end_offset:
assert (
left_start_offset == window_size and left_end_offset == window_size
), (
"left_start_offset and left_end_offset will be inferred to be same as aggregation"
"window {window_size} and the incorrect values will be ignored. "
)

return api.LabelPart(
return api.LabelParts(
labels=labels,
leftStartOffset=left_start_offset,
leftEndOffset=left_end_offset,
metaData=label_metadata,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

left_start_offset / left_end_offset silently ignored

The parameters remain in the signature & docstring but are never used, so callers believe offsets are honored while they’re discarded. Either accept + store them or delete from the API.

🤖 Prompt for AI Agents
In api/python/ai/chronon/join.py between lines 252 and 287, the parameters
left_start_offset and left_end_offset are present in the function signature and
docstring but are not used in the function body, causing confusion. To fix this,
either remove these parameters from the function signature and update the
docstring accordingly, or modify the function to accept and store these offset
values properly, ensuring they are utilized or reflected in the returned
api.LabelParts object or related metadata.

Comment on lines 200 to 206
"""
assert name != "contextual", "Please use `ContextualSource`"
return api.ExternalSource(
metadata=api.MetaData(name=name, team=team, customJson=custom_json),
metadata=api.MetaData(name=name, team=team),
keySchema=DataType.STRUCT(f"ext_{name}_keys", *key_fields),
valueSchema=DataType.STRUCT(f"ext_{name}_values", *value_fields),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Field name typo breaks Thrift serialization

api.ExternalSource expects metaData, not metadata; current code silently drops the metadata, making source.metaData.name later None and crashing duplicate checks.

-        metadata=api.MetaData(name=name, team=team),
+        metaData=api.MetaData(name=name, team=team),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""
assert name != "contextual", "Please use `ContextualSource`"
return api.ExternalSource(
metadata=api.MetaData(name=name, team=team, customJson=custom_json),
metadata=api.MetaData(name=name, team=team),
keySchema=DataType.STRUCT(f"ext_{name}_keys", *key_fields),
valueSchema=DataType.STRUCT(f"ext_{name}_values", *value_fields),
)
"""
assert name != "contextual", "Please use `ContextualSource`"
return api.ExternalSource(
metaData=api.MetaData(name=name, team=team),
keySchema=DataType.STRUCT(f"ext_{name}_keys", *key_fields),
valueSchema=DataType.STRUCT(f"ext_{name}_values", *value_fields),
)
🤖 Prompt for AI Agents
In api/python/ai/chronon/join.py around lines 200 to 206, the argument name for
the metadata parameter in the api.ExternalSource constructor is incorrectly
written as "metadata" instead of the expected "metaData". This typo causes the
metadata to be ignored, leading to None values and crashes later. Fix this by
renaming the argument from "metadata" to "metaData" to ensure proper
serialization and avoid runtime errors.

Comment on lines 150 to 166
pre_derived_columns: Set[str], derivations: List[Derivation]
) -> List[str]:
"""
Build the derived columns from pre-derived columns and derivations.
"""
# if derivations contain star, then all columns are included except the columns which are renamed
output_columns = pre_derived_columns
if derivations:
found = any(derivation.expression == "*" for derivation in derivations)
if not found:
output_columns.clear()
for derivation in derivations:
if found and is_identifier(derivation.expression):
output_columns.remove(derivation.expression)
if derivation.name != "*":
output_columns.add(derivation.name)
return list(output_columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Don’t mutate caller’s set

output_columns = pre_derived_columns shares the original set; later .clear() wipes data upstream. Copy first.

-    output_columns = pre_derived_columns
+    output_columns = set(pre_derived_columns)  # local copy
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pre_derived_columns: Set[str], derivations: List[Derivation]
) -> List[str]:
"""
Build the derived columns from pre-derived columns and derivations.
"""
# if derivations contain star, then all columns are included except the columns which are renamed
output_columns = pre_derived_columns
if derivations:
found = any(derivation.expression == "*" for derivation in derivations)
if not found:
output_columns.clear()
for derivation in derivations:
if found and is_identifier(derivation.expression):
output_columns.remove(derivation.expression)
if derivation.name != "*":
output_columns.add(derivation.name)
return list(output_columns)
pre_derived_columns: Set[str], derivations: List[Derivation]
) -> List[str]:
"""
Build the derived columns from pre-derived columns and derivations.
"""
# if derivations contain star, then all columns are included except the columns which are renamed
output_columns = set(pre_derived_columns) # local copy
if derivations:
found = any(derivation.expression == "*" for derivation in derivations)
if not found:
output_columns.clear()
for derivation in derivations:
if found and is_identifier(derivation.expression):
output_columns.remove(derivation.expression)
if derivation.name != "*":
output_columns.add(derivation.name)
return list(output_columns)
🤖 Prompt for AI Agents
In api/python/ai/chronon/cli/compile/conf_validator.py around lines 150 to 166,
the variable output_columns is assigned directly from pre_derived_columns,
causing mutations like clear() to affect the caller's set. To fix this, create a
copy of pre_derived_columns when assigning to output_columns to avoid mutating
the original set passed by the caller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants