Skip to content

Conversation

@david-zlai
Copy link
Contributor

@david-zlai david-zlai commented Apr 9, 2025

Summary

Tested on etsy:

uv run zipline run --mode metastore check-partitions --partition-names=search.beacon_main_v2/_DATE=2025-04-06/_HOUR=23 --conf airflow_conf/airflow/search/common_conf 

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • New Features

    • Added support for compiling and handling team metadata files without standard metadata attributes.
    • Introduced a method to check if specific partitions exist in tables, improving partition validation.
    • Added merging functionality for team execution information into metadata during compilation.
  • Improvements

    • Enhanced robustness and flexibility in configuration and runtime environment handling, including safer argument parsing and improved error handling.
    • Streamlined logic for uploading local files and generating command-line arguments for job submission.
    • Refined logic for determining configuration properties and partition checks in Spark utilities.
    • Improved command-line argument construction and safer configuration retrieval for job submission.
  • Chores

    • Updated default values and configuration constants to reflect new handling logic and avoid unnecessary defaults.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Apr 9, 2025

Walkthrough

This set of changes enhances configuration handling and metadata compilation across both Python and Scala components. Updates include safer and more flexible argument construction for runners, improved parsing and merging of team metadata, and new support for compiling team metadata objects. Scala utilities gain a new method for partition checks, and argument parsing in job submission is made more robust. Minor import reorganizations and type annotation improvements are also present.

Changes

File(s) Change Summary
api/python/ai/chronon/repo/default_runner.py, api/python/ai/chronon/repo/constants.py, api/python/ai/chronon/repo/gcp.py, api/python/ai/chronon/repo/run.py Refactored runner argument construction for safety and flexibility; removed default conf_type; updated how Dataproc submitter arguments are generated and handled; changed MODE_ARGS for metastore mode to be handled elsewhere.
api/python/ai/chronon/repo/utils.py Improved runtime environment setup: more robust extraction of metadata, safer error handling, and streamlined logging.
api/python/ai/chronon/cli/compile/compile_context.py Added support for team metadata configs without standard attributes; updated type annotations; improved parsing of configuration files and output path logic.
api/python/ai/chronon/cli/compile/compiler.py Added _compile_team_metadata method to compile and serialize team metadata; integrated this step into the overall compilation process.
api/python/ai/chronon/cli/compile/parse_teams.py Introduced merge_team_execution_info function for modular merging of team execution info into metadata; refactored code for clarity and maintainability.
spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala Enhanced config property extraction: added debug logs, refined logic for parsing and handling confType and originalMode, and improved error handling for metadata extraction.
spark/src/main/scala/ai/chronon/spark/TableUtils.scala Added containsPartitions method for checking if a partition spec exists in a table, with special handling for Iceberg format.
spark/src/main/scala/ai/chronon/spark/Driver.scala Consolidated imports; replaced partition check logic with new containsPartitions method and added logging.

Sequence Diagram(s)

sequenceDiagram
    participant Compiler
    participant CompileContext
    participant MetaData
    participant Serializer

    Compiler->>CompileContext: Get teams_dict
    loop For each team
        Compiler->>MetaData: Create MetaData object
        Compiler->>MetaData: merge_team_execution_info
        Compiler->>Serializer: Serialize MetaData
        Compiler->>Compiler: Write CompiledObj
    end
    Compiler->>CompileContext: Update compile status with team metadata
Loading
sequenceDiagram
    participant DefaultRunner
    participant GcpRunner
    participant JobSubmitter

    DefaultRunner->>DefaultRunner: _gen_final_args (builds args safely)
    GcpRunner->>GcpRunner: generate_dataproc_submitter_args (uploads local files if needed)
    GcpRunner->>JobSubmitter: Submit job with constructed args
Loading

Possibly related PRs

  • zipline-ai/chronon#549: Refactors how configuration type and paths are parsed and passed in job submission, closely related to the changes in argument handling and config extraction in this PR.
  • zipline-ai/chronon#597: Modifies partition checking logic in the same run method, related to partition presence verification changes here.
  • zipline-ai/chronon#613: Fixes mode config logic in job submission, related to config parsing updates in JobSubmitter.scala.

Suggested reviewers

  • nikhil-zlai
  • piyush-zlai

Poem

In lines of code both old and new,
Metadata finds a broader view.
Teams now compile with grace and flair,
Partition checks are handled with care.
Arguments built with safety in mind,
Robust and tidy, all aligned.
🎉 Cheers to progress—onward we go! 🚀

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 5ab3456 and edc6c98.

📒 Files selected for processing (1)
  • spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • spark/src/main/scala/ai/chronon/spark/submission/JobSubmitter.scala
⏰ Context from checks skipped due to timeout of 90000ms (16)
  • GitHub Check: streaming_tests
  • GitHub Check: spark_tests
  • GitHub Check: join_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: groupby_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: batch_tests
  • GitHub Check: streaming_tests
  • GitHub Check: join_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: spark_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: non_spark_tests
  • GitHub Check: non_spark_tests
  • GitHub Check: enforce_triggered_workflows
  • GitHub Check: python_tests

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/Driver.scala (1)

1008-1010: Enhanced debugging in partition checking.

Added variable and logging to better understand partition state during checks, supporting the check-partitions verb indicated in PR objectives.

Consider using a more descriptive logging format if partitions list could be large:

-        logger.info("Current partitions: " + currentPartitions.mkString(", "))
+        logger.info(s"Found ${currentPartitions.size} partitions for table $tbl: " + 
+            currentPartitions.take(10).mkString(", ") + 
+            (if (currentPartitions.size > 10) "..." else ""))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 6e279cb and 00825de.

📒 Files selected for processing (1)
  • spark/src/main/scala/ai/chronon/spark/Driver.scala (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
spark/src/main/scala/ai/chronon/spark/Driver.scala (4)
spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (3)
  • spark (264-296)
  • Summarizer (38-297)
  • Summarizer (353-394)
spark/src/main/scala/ai/chronon/spark/stats/CompareBaseJob.scala (1)
  • CompareBaseJob (31-185)
spark/src/main/scala/ai/chronon/spark/stats/CompareJob.scala (2)
  • CompareJob (41-113)
  • CompareJob (115-184)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2)
  • sql (298-326)
  • allPartitions (137-159)
⏰ Context from checks skipped due to timeout of 90000ms (18)
  • GitHub Check: streaming_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: join_tests
  • GitHub Check: streaming_tests
  • GitHub Check: groupby_tests
  • GitHub Check: groupby_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: batch_tests
  • GitHub Check: join_tests
  • GitHub Check: spark_tests
  • GitHub Check: spark_tests
  • GitHub Check: batch_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: non_spark_tests
  • GitHub Check: non_spark_tests
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (4)
spark/src/main/scala/ai/chronon/spark/Driver.scala (4)

30-31: Improved import organization.

Consolidated related imports for better readability.


36-41: Better import organization for StreamingQueryListener.

Expanded imports with a structured format.


44-45: Organized imports for scallop and slf4j.

Better structured for maintainability.


49-53: Improved organization of utility imports.

Expanded imports to explicitly list required classes.


if self.conf:
if (self.conf
and (self.mode != "metastore" and "check-partitions" in args)): # TODO: don't check for metastore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just leave out the check-partitions arg. Anything in metastore can prob just skip the regular checking.

done


if [[ -n $(git diff HEAD) ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we removing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me put that back

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (1)

136-152: Added partition existence check method.

New method checks if a partition exists in a table, handling different table formats appropriately.

This implementation:

  • Returns false early if table isn't reachable
  • Uses different check logic for Iceberg vs other formats
  • Handles error cases properly

Consider adding documentation explaining the different behavior for Iceberg tables where only column keys are checked versus other formats where exact specification matching is performed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between d68da5c and 777b4c6.

📒 Files selected for processing (1)
  • spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (19)
  • GitHub Check: streaming_tests
  • GitHub Check: streaming_tests
  • GitHub Check: join_tests
  • GitHub Check: join_tests
  • GitHub Check: groupby_tests
  • GitHub Check: groupby_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: spark_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: spark_tests
  • GitHub Check: batch_tests
  • GitHub Check: non_spark_tests
  • GitHub Check: python_tests
  • GitHub Check: non_spark_tests
  • GitHub Check: batch_tests
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (1)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (1)

25-25: Modified import to include required format classes.

Import statement updated to include FormatProvider and Iceberg for the new partition checking functionality.


format match {
case Iceberg => {
partitionSpec.keySet.subsetOf(this.partitions(tableName).toSet)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not worry about the Iceberg case, or follow up with it in another PR. We should just support this in the format itself.

@david-zlai david-zlai merged commit 7ab3c3b into main Apr 15, 2025
22 checks passed
@david-zlai david-zlai deleted the davidhan/support_non_object_confs branch April 15, 2025 14:59
kumar-zlai pushed a commit that referenced this pull request Apr 25, 2025
## Summary

Tested on etsy: 
```
uv run zipline run --mode metastore check-partitions --partition-names=search.beacon_main_v2/_DATE=2025-04-06/_HOUR=23 --conf airflow_conf/airflow/search/common_conf 
```
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for compiling and handling team metadata files without
standard metadata attributes.
- Introduced a method to check if specific partitions exist in tables,
improving partition validation.
- Added merging functionality for team execution information into
metadata during compilation.

- **Improvements**
- Enhanced robustness and flexibility in configuration and runtime
environment handling, including safer argument parsing and improved
error handling.
- Streamlined logic for uploading local files and generating
command-line arguments for job submission.
- Refined logic for determining configuration properties and partition
checks in Spark utilities.
- Improved command-line argument construction and safer configuration
retrieval for job submission.

- **Chores**
- Updated default values and configuration constants to reflect new
handling logic and avoid unnecessary defaults.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kumar-zlai pushed a commit that referenced this pull request Apr 29, 2025
## Summary

Tested on etsy: 
```
uv run zipline run --mode metastore check-partitions --partition-names=search.beacon_main_v2/_DATE=2025-04-06/_HOUR=23 --conf airflow_conf/airflow/search/common_conf 
```
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for compiling and handling team metadata files without
standard metadata attributes.
- Introduced a method to check if specific partitions exist in tables,
improving partition validation.
- Added merging functionality for team execution information into
metadata during compilation.

- **Improvements**
- Enhanced robustness and flexibility in configuration and runtime
environment handling, including safer argument parsing and improved
error handling.
- Streamlined logic for uploading local files and generating
command-line arguments for job submission.
- Refined logic for determining configuration properties and partition
checks in Spark utilities.
- Improved command-line argument construction and safer configuration
retrieval for job submission.

- **Chores**
- Updated default values and configuration constants to reflect new
handling logic and avoid unnecessary defaults.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
## Summary

Tested on our clients: 
```
uv run zipline run --mode metastore check-partitions --partition-names=search.beacon_main_v2/_DATE=2025-04-06/_HOUR=23 --conf airflow_conf/airflow/search/common_conf 
```
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for compiling and handling team metadata files without
standard metadata attributes.
- Introduced a method to check if specific partitions exist in tables,
improving partition validation.
- Added merging functionality for team execution information into
metadata during compilation.

- **Improvements**
- Enhanced robustness and flexibility in configuration and runtime
environment handling, including safer argument parsing and improved
error handling.
- Streamlined logic for uploading local files and generating
command-line arguments for job submission.
- Refined logic for determining configuration properties and partition
checks in Spark utilities.
- Improved command-line argument construction and safer configuration
retrieval for job submission.

- **Chores**
- Updated default values and configuration constants to reflect new
handling logic and avoid unnecessary defaults.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
## Summary

Tested on our clients: 
```
uv run zipline run --mode metastore check-partitions --partition-names=search.beacon_main_v2/_DATE=2025-04-06/_HOUR=23 --conf airflow_conf/airflow/search/common_conf 
```
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for compiling and handling team metadata files without
standard metadata attributes.
- Introduced a method to check if specific partitions exist in tables,
improving partition validation.
- Added merging functionality for team execution information into
metadata during compilation.

- **Improvements**
- Enhanced robustness and flexibility in configuration and runtime
environment handling, including safer argument parsing and improved
error handling.
- Streamlined logic for uploading local files and generating
command-line arguments for job submission.
- Refined logic for determining configuration properties and partition
checks in Spark utilities.
- Improved command-line argument construction and safer configuration
retrieval for job submission.

- **Chores**
- Updated default values and configuration constants to reflect new
handling logic and avoid unnecessary defaults.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 16, 2025
## Summary

Tested on our clients: 
```
uv run zipline run --mode metastore cheour clients-partitions --partition-names=search.beacon_main_v2/_DATE=2025-04-06/_HOUR=23 --conf airflow_conf/airflow/search/common_conf 
```
## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for compiling and handling team metadata files without
standard metadata attributes.
- Introduced a method to cheour clients if specific partitions exist in tables,
improving partition validation.
- Added merging functionality for team execution information into
metadata during compilation.

- **Improvements**
- Enhanced robustness and flexibility in configuration and runtime
environment handling, including safer argument parsing and improved
error handling.
- Streamlined logic for uploading local files and generating
command-line arguments for job submission.
- Refined logic for determining configuration properties and partition
cheour clientss in Spark utilities.
- Improved command-line argument construction and safer configuration
retrieval for job submission.

- **Chores**
- Updated default values and configuration constants to reflect new
handling logic and avoid unnecessary defaults.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants