-
Notifications
You must be signed in to change notification settings - Fork 8
Changes needed to support check-partitions verb #615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts commit 12e09ef.
WalkthroughThis set of changes enhances configuration handling and metadata compilation across both Python and Scala components. Updates include safer and more flexible argument construction for runners, improved parsing and merging of team metadata, and new support for compiling team metadata objects. Scala utilities gain a new method for partition checks, and argument parsing in job submission is made more robust. Minor import reorganizations and type annotation improvements are also present. Changes
Sequence Diagram(s)sequenceDiagram
participant Compiler
participant CompileContext
participant MetaData
participant Serializer
Compiler->>CompileContext: Get teams_dict
loop For each team
Compiler->>MetaData: Create MetaData object
Compiler->>MetaData: merge_team_execution_info
Compiler->>Serializer: Serialize MetaData
Compiler->>Compiler: Write CompiledObj
end
Compiler->>CompileContext: Update compile status with team metadata
sequenceDiagram
participant DefaultRunner
participant GcpRunner
participant JobSubmitter
DefaultRunner->>DefaultRunner: _gen_final_args (builds args safely)
GcpRunner->>GcpRunner: generate_dataproc_submitter_args (uploads local files if needed)
GcpRunner->>JobSubmitter: Submit job with constructed args
Possibly related PRs
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (16)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/Driver.scala (1)
1008-1010: Enhanced debugging in partition checking.Added variable and logging to better understand partition state during checks, supporting the check-partitions verb indicated in PR objectives.
Consider using a more descriptive logging format if partitions list could be large:
- logger.info("Current partitions: " + currentPartitions.mkString(", ")) + logger.info(s"Found ${currentPartitions.size} partitions for table $tbl: " + + currentPartitions.take(10).mkString(", ") + + (if (currentPartitions.size > 10) "..." else ""))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (1)
spark/src/main/scala/ai/chronon/spark/Driver.scala(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
spark/src/main/scala/ai/chronon/spark/Driver.scala (4)
spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (3)
spark(264-296)Summarizer(38-297)Summarizer(353-394)spark/src/main/scala/ai/chronon/spark/stats/CompareBaseJob.scala (1)
CompareBaseJob(31-185)spark/src/main/scala/ai/chronon/spark/stats/CompareJob.scala (2)
CompareJob(41-113)CompareJob(115-184)spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2)
sql(298-326)allPartitions(137-159)
⏰ Context from checks skipped due to timeout of 90000ms (18)
- GitHub Check: streaming_tests
- GitHub Check: analyzer_tests
- GitHub Check: join_tests
- GitHub Check: streaming_tests
- GitHub Check: groupby_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: fetcher_tests
- GitHub Check: fetcher_tests
- GitHub Check: batch_tests
- GitHub Check: join_tests
- GitHub Check: spark_tests
- GitHub Check: spark_tests
- GitHub Check: batch_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: non_spark_tests
- GitHub Check: non_spark_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (4)
spark/src/main/scala/ai/chronon/spark/Driver.scala (4)
30-31: Improved import organization.Consolidated related imports for better readability.
36-41: Better import organization for StreamingQueryListener.Expanded imports with a structured format.
44-45: Organized imports for scallop and slf4j.Better structured for maintainability.
49-53: Improved organization of utility imports.Expanded imports to explicitly list required classes.
|
|
||
| if self.conf: | ||
| if (self.conf | ||
| and (self.mode != "metastore" and "check-partitions" in args)): # TODO: don't check for metastore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just leave out the check-partitions arg. Anything in metastore can prob just skip the regular checking.
| done | ||
|
|
||
|
|
||
| if [[ -n $(git diff HEAD) ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we removing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me put that back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (1)
136-152: Added partition existence check method.New method checks if a partition exists in a table, handling different table formats appropriately.
This implementation:
- Returns false early if table isn't reachable
- Uses different check logic for Iceberg vs other formats
- Handles error cases properly
Consider adding documentation explaining the different behavior for Iceberg tables where only column keys are checked versus other formats where exact specification matching is performed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (1)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (19)
- GitHub Check: streaming_tests
- GitHub Check: streaming_tests
- GitHub Check: join_tests
- GitHub Check: join_tests
- GitHub Check: groupby_tests
- GitHub Check: groupby_tests
- GitHub Check: fetcher_tests
- GitHub Check: analyzer_tests
- GitHub Check: analyzer_tests
- GitHub Check: fetcher_tests
- GitHub Check: spark_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: spark_tests
- GitHub Check: batch_tests
- GitHub Check: non_spark_tests
- GitHub Check: python_tests
- GitHub Check: non_spark_tests
- GitHub Check: batch_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (1)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (1)
25-25: Modified import to include required format classes.Import statement updated to include
FormatProviderandIcebergfor the new partition checking functionality.
|
|
||
| format match { | ||
| case Iceberg => { | ||
| partitionSpec.keySet.subsetOf(this.partitions(tableName).toSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not worry about the Iceberg case, or follow up with it in another PR. We should just support this in the format itself.
## Summary Tested on etsy: ``` uv run zipline run --mode metastore check-partitions --partition-names=search.beacon_main_v2/_DATE=2025-04-06/_HOUR=23 --conf airflow_conf/airflow/search/common_conf ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for compiling and handling team metadata files without standard metadata attributes. - Introduced a method to check if specific partitions exist in tables, improving partition validation. - Added merging functionality for team execution information into metadata during compilation. - **Improvements** - Enhanced robustness and flexibility in configuration and runtime environment handling, including safer argument parsing and improved error handling. - Streamlined logic for uploading local files and generating command-line arguments for job submission. - Refined logic for determining configuration properties and partition checks in Spark utilities. - Improved command-line argument construction and safer configuration retrieval for job submission. - **Chores** - Updated default values and configuration constants to reflect new handling logic and avoid unnecessary defaults. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Tested on etsy: ``` uv run zipline run --mode metastore check-partitions --partition-names=search.beacon_main_v2/_DATE=2025-04-06/_HOUR=23 --conf airflow_conf/airflow/search/common_conf ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for compiling and handling team metadata files without standard metadata attributes. - Introduced a method to check if specific partitions exist in tables, improving partition validation. - Added merging functionality for team execution information into metadata during compilation. - **Improvements** - Enhanced robustness and flexibility in configuration and runtime environment handling, including safer argument parsing and improved error handling. - Streamlined logic for uploading local files and generating command-line arguments for job submission. - Refined logic for determining configuration properties and partition checks in Spark utilities. - Improved command-line argument construction and safer configuration retrieval for job submission. - **Chores** - Updated default values and configuration constants to reflect new handling logic and avoid unnecessary defaults. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Tested on our clients: ``` uv run zipline run --mode metastore check-partitions --partition-names=search.beacon_main_v2/_DATE=2025-04-06/_HOUR=23 --conf airflow_conf/airflow/search/common_conf ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for compiling and handling team metadata files without standard metadata attributes. - Introduced a method to check if specific partitions exist in tables, improving partition validation. - Added merging functionality for team execution information into metadata during compilation. - **Improvements** - Enhanced robustness and flexibility in configuration and runtime environment handling, including safer argument parsing and improved error handling. - Streamlined logic for uploading local files and generating command-line arguments for job submission. - Refined logic for determining configuration properties and partition checks in Spark utilities. - Improved command-line argument construction and safer configuration retrieval for job submission. - **Chores** - Updated default values and configuration constants to reflect new handling logic and avoid unnecessary defaults. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Tested on our clients: ``` uv run zipline run --mode metastore check-partitions --partition-names=search.beacon_main_v2/_DATE=2025-04-06/_HOUR=23 --conf airflow_conf/airflow/search/common_conf ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for compiling and handling team metadata files without standard metadata attributes. - Introduced a method to check if specific partitions exist in tables, improving partition validation. - Added merging functionality for team execution information into metadata during compilation. - **Improvements** - Enhanced robustness and flexibility in configuration and runtime environment handling, including safer argument parsing and improved error handling. - Streamlined logic for uploading local files and generating command-line arguments for job submission. - Refined logic for determining configuration properties and partition checks in Spark utilities. - Improved command-line argument construction and safer configuration retrieval for job submission. - **Chores** - Updated default values and configuration constants to reflect new handling logic and avoid unnecessary defaults. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Tested on our clients: ``` uv run zipline run --mode metastore cheour clients-partitions --partition-names=search.beacon_main_v2/_DATE=2025-04-06/_HOUR=23 --conf airflow_conf/airflow/search/common_conf ``` ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for compiling and handling team metadata files without standard metadata attributes. - Introduced a method to cheour clients if specific partitions exist in tables, improving partition validation. - Added merging functionality for team execution information into metadata during compilation. - **Improvements** - Enhanced robustness and flexibility in configuration and runtime environment handling, including safer argument parsing and improved error handling. - Streamlined logic for uploading local files and generating command-line arguments for job submission. - Refined logic for determining configuration properties and partition cheour clientss in Spark utilities. - Improved command-line argument construction and safer configuration retrieval for job submission. - **Chores** - Updated default values and configuration constants to reflect new handling logic and avoid unnecessary defaults. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Tested on etsy:
Checklist
Summary by CodeRabbit
New Features
Improvements
Chores