-
Notifications
You must be signed in to change notification settings - Fork 8
spark opt: turn off time range check by default and select only relevant cols from left #577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request introduces a new method, Changes
Possibly related PRs
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code Definitions (1)spark/src/main/scala/ai/chronon/spark/JoinPartJob.scala (4)
⏰ Context from checks skipped due to timeout of 90000ms (14)
🔇 Additional comments (6)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/JoinPartJob.scala (1)
170-177: Feature toggle for time range checking.This conditional logic implements the PR objective of turning off time range checks by default, which should improve performance but may affect accuracy in some cases.
Consider adding a comment explaining when time range checking should be enabled vs. disabled for better maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (3)
spark/src/main/scala/ai/chronon/spark/JoinPartJob.scala(7 hunks)spark/src/main/scala/ai/chronon/spark/JoinUtils.scala(0 hunks)spark/src/test/scala/ai/chronon/spark/test/streaming/MutationsTest.scala(2 hunks)
💤 Files with no reviewable changes (1)
- spark/src/main/scala/ai/chronon/spark/JoinUtils.scala
🧰 Additional context used
🧬 Code Definitions (1)
spark/src/main/scala/ai/chronon/spark/JoinPartJob.scala (3)
api/src/main/scala/ai/chronon/api/Extensions.scala (5)
rightToLeft(744-754)query(337-345)toPartitionRange(1244-1248)dataModel(320-325)dataModel(456-463)spark/src/main/scala/ai/chronon/spark/TableUtils.scala (1)
scanDf(593-613)spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (2)
JoinUtils(37-618)genBloomFilterIfNeeded(333-368)
⏰ Context from checks skipped due to timeout of 90000ms (16)
- GitHub Check: streaming_tests
- GitHub Check: join_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: non_spark_tests
- GitHub Check: spark_tests
- GitHub Check: fetcher_tests
- GitHub Check: non_spark_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: streaming_tests
- GitHub Check: fetcher_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: join_tests
- GitHub Check: spark_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (8)
spark/src/test/scala/ai/chronon/spark/test/streaming/MutationsTest.scala (2)
38-42: Configuration setup looks good.The SparkSession configuration includes setting
"spark.chronon.join.backfill.check.left_time_range"to"true", which aligns with the changes in JoinPartJob.scala.
827-827: Line formatting looks fine.Simple whitespace addition that improves readability.
spark/src/main/scala/ai/chronon/spark/JoinPartJob.scala (6)
5-6: Import statements updated appropriately.Added
toTimeRangefunction import andBuildersclass to support the new implementation.
46-53: Efficiency improvement by selecting only relevant columns.Instead of fetching all columns, the code now selects only columns needed for the join operation, which should reduce I/O and memory usage.
68-68: Simplified bloom filter generation.Directly using dateRange parameter instead of a variable reference.
120-121: Enhanced logging.Added elapsed time information to the log output for better debugging and monitoring.
213-213: Variable usage updated for consistency.Using the new
unfilledPartitionRangevariable for shifting operations.
226-226: Cleaner time range conversion.Using the new
toTimeRangeutility function provides a more maintainable approach to converting partition ranges to time ranges.
…ant cols from left (#577) ## Summary spark optimizations ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a new configuration option that lets users control time range validation during join backfill operations, providing more consistent data processing. - Added a method for converting partition ranges to time ranges, enhancing flexibility in time range handling. - Added a new test case for data generation and validation during join operations. - **Refactor** - Optimized the way data merging queries are constructed and how time ranges are applied during join operations, ensuring enhanced precision and performance across event data processing. - Simplified method signatures by removing unnecessary parameters, streamlining the process of generating Bloom filters. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]>
…ant cols from left (#577) ## Summary spark optimizations ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a new configuration option that lets users control time range validation during join backfill operations, providing more consistent data processing. - Added a method for converting partition ranges to time ranges, enhancing flexibility in time range handling. - Added a new test case for data generation and validation during join operations. - **Refactor** - Optimized the way data merging queries are constructed and how time ranges are applied during join operations, ensuring enhanced precision and performance across event data processing. - Simplified method signatures by removing unnecessary parameters, streamlining the process of generating Bloom filters. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]>
…ant cols from left (#577) ## Summary spark optimizations ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a new configuration option that lets users control time range validation during join backfill operations, providing more consistent data processing. - Added a method for converting partition ranges to time ranges, enhancing flexibility in time range handling. - Added a new test case for data generation and validation during join operations. - **Refactor** - Optimized the way data merging queries are constructed and how time ranges are applied during join operations, ensuring enhanced precision and performance across event data processing. - Simplified method signatures by removing unnecessary parameters, streamlining the process of generating Bloom filters. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]>
…ant cols from left (#577) ## Summary spark optimizations ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a new configuration option that lets users control time range validation during join backfill operations, providing more consistent data processing. - Added a method for converting partition ranges to time ranges, enhancing flexibility in time range handling. - Added a new test case for data generation and validation during join operations. - **Refactor** - Optimized the way data merging queries are constructed and how time ranges are applied during join operations, ensuring enhanced precision and performance across event data processing. - Simplified method signatures by removing unnecessary parameters, streamlining the process of generating Bloom filters. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]>
…nly relevant cols from left (#577) ## Summary spark optimizations ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a new configuration option that lets users control time range validation during join baour clientsfill operations, providing more consistent data processing. - Added a method for converting partition ranges to time ranges, enhancing flexibility in time range handling. - Added a new test case for data generation and validation during join operations. - **Refactor** - Optimized the way data merging queries are constructed and how time ranges are applied during join operations, ensuring enhanced precision and performance across event data processing. - Simplified method signatures by removing unnecessary parameters, streamlining the process of generating Bloom filters. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]>
Summary
spark optimizations
Checklist
Summary by CodeRabbit
New Features
Refactor