-
Notifications
You must be signed in to change notification settings - Fork 8
feat: Support GB backfill #1014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support GB backfill #1014
Conversation
WalkthroughA new case branch for handling the Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as BatchNodeRunner
participant Conf as Config
participant GroupBy as GroupBy
Runner->>Conf: getSetField
alt GROUP_BY_BACKFILL
Runner->>GroupBy: computeBackfill(groupByConf, endPartition, tableUtils, startPartitionOverride)
else Other node types
Runner->>...: Existing logic
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~7 minutes Possibly related PRs
Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
✨ Finishing Touches🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
spark/src/main/scala/ai/chronon/spark/batch/BatchNodeRunner.scala(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:19-47
Timestamp: 2024-11-03T14:51:40.825Z
Learning: In Scala, the `grouped` method on collections returns an iterator, allowing for efficient batch processing without accumulating all records in memory.
spark/src/main/scala/ai/chronon/spark/batch/BatchNodeRunner.scala (4)
Learnt from: nikhil-zlai
PR: #50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:19-47
Timestamp: 2024-11-03T14:51:40.825Z
Learning: In Scala, the grouped method on collections returns an iterator, allowing for efficient batch processing without accumulating all records in memory.
Learnt from: chewy-zlai
PR: #62
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:9-10
Timestamp: 2024-11-06T21:54:56.160Z
Learning: In Spark applications, when defining serializable classes, passing an implicit ExecutionContext parameter can cause serialization issues. In such cases, it's acceptable to use scala.concurrent.ExecutionContext.Implicits.global.
Learnt from: nikhil-zlai
PR: #70
File: service/src/main/java/ai/chronon/service/ApiProvider.java:6-6
Timestamp: 2024-12-03T04:04:33.809Z
Learning: The import scala.util.ScalaVersionSpecificCollectionsConverter in service/src/main/java/ai/chronon/service/ApiProvider.java is correct and should not be flagged in future reviews.
Learnt from: tchow-zlai
PR: #263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:56-57
Timestamp: 2025-01-24T23:55:40.650Z
Learning: For BigQuery table creation operations in BigQueryFormat.scala, allow exceptions to propagate directly without wrapping them in try-catch blocks, as the original BigQuery exceptions provide sufficient context.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: fetcher_tests
- GitHub Check: batch_tests
- GitHub Check: spark_tests
- GitHub Check: streaming_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: join_tests
- GitHub Check: service_tests
- GitHub Check: cloud_gcp_tests
- GitHub Check: cloud_aws_tests
- GitHub Check: service_commons_tests
- GitHub Check: api_tests
- GitHub Check: online_tests
- GitHub Check: flink_tests
- GitHub Check: aggregator_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (1)
spark/src/main/scala/ai/chronon/spark/batch/BatchNodeRunner.scala (1)
13-13: LGTM!Clean import addition following existing patterns.
| case NodeContent._Fields.GROUP_BY_BACKFILL => | ||
| GroupBy.computeBackfill( | ||
| conf.getGroupByBackfill.groupBy, | ||
| range.end, | ||
| tableUtils, | ||
| overrideStartPartition = Option(range.start) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add logging and validation for consistency.
Other case branches include logging and validation. Consider adding:
case NodeContent._Fields.GROUP_BY_BACKFILL =>
+ require(conf.getGroupByBackfill.isSetGroupBy, "GroupByBackfillNode must have a groupBy set")
+ logger.info(s"Running groupBy backfill for '${metadata.name}' for range: [${range.start}, ${range.end}]")
GroupBy.computeBackfill(
conf.getGroupByBackfill.groupBy,
range.end,
tableUtils,
overrideStartPartition = Option(range.start)
)
+ logger.info(s"Successfully completed groupBy backfill for '${metadata.name}'")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| case NodeContent._Fields.GROUP_BY_BACKFILL => | |
| GroupBy.computeBackfill( | |
| conf.getGroupByBackfill.groupBy, | |
| range.end, | |
| tableUtils, | |
| overrideStartPartition = Option(range.start) | |
| ) | |
| case NodeContent._Fields.GROUP_BY_BACKFILL => | |
| require( | |
| conf.getGroupByBackfill.isSetGroupBy, | |
| "GroupByBackfillNode must have a groupBy set" | |
| ) | |
| logger.info( | |
| s"Running groupBy backfill for '${metadata.name}' for range: [${range.start}, ${range.end}]" | |
| ) | |
| GroupBy.computeBackfill( | |
| conf.getGroupByBackfill.groupBy, | |
| range.end, | |
| tableUtils, | |
| overrideStartPartition = Option(range.start) | |
| ) | |
| logger.info( | |
| s"Successfully completed groupBy backfill for '${metadata.name}'" | |
| ) |
🤖 Prompt for AI Agents
In spark/src/main/scala/ai/chronon/spark/batch/BatchNodeRunner.scala around
lines 103 to 109, the GROUP_BY_BACKFILL case lacks logging and validation
present in other case branches. Add appropriate logging statements before and
after the GroupBy.computeBackfill call to track execution flow and outcomes.
Also, include validation checks on inputs such as
conf.getGroupByBackfill.groupBy and range values to ensure they meet expected
criteria before proceeding.
Summary
Checklist
Summary by CodeRabbit