-
Notifications
You must be signed in to change notification settings - Fork 9
[WIP] Debugging second write failure #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request introduces multiple configuration files for data processing entities across various teams and domains. The changes primarily involve adding new JSON configurations for group-by and join operations, with a focus on data aggregation, metadata management, and partition handling. Several files in the production group-bys and joins directories have been updated, along with modifications to Scala utility classes for partition and data range management. Changes
Sequence DiagramsequenceDiagram
participant Config as JSON Configuration
participant GroupBy as Group-By Operations
participant Join as Join Processing
participant PartitionRange as Partition Range Utility
Config->>GroupBy: Define Aggregation Rules
GroupBy->>Join: Provide Processed Data
Join->>PartitionRange: Manage Partition Ranges
PartitionRange-->>Join: Return Collapsed Ranges
Possibly related PRs
Suggested Reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (6)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (1)
Line range hint
635-643: Enhance assertion message readability.The multi-line error message could be more readable.
- assert( - inputStart.isDefined, - s"""Either partition range needs to have a valid start or - |an input table with valid data needs to be present - |inputTables: $inputTables, partitionRange: $outputPartitionRange - |""".stripMargin - ) + assert( + inputStart.isDefined, + s"""Validation failed: + |1. Partition range must have a valid start, or + |2. Input table must have valid data + | + |Details: + |- Input tables: $inputTables + |- Partition range: $outputPartitionRange + |""".stripMargin + )online/src/test/scala/ai/chronon/online/test/DataRangeTest.scala (1)
44-49: Add test case for null inputAdd a test case to verify behavior with null input.
+ it should "handle null input" in { + val partitions: List[String] = null + val result = collapseToRanges(partitions) + result should be (empty) + }api/py/test/sample/production/group_bys/sample_team/sample_chaining_group_by.chaining_group_by_v1 (2)
60-69: Add 'production' fieldInclude
"production": 1in the groupBy'smetaDatafor consistency.
122-132: Add 'production' fieldInclude
"production": 1in the groupBy'smetaDatafor consistency.api/py/test/sample/production/group_bys/sample_team/sample_group_by_group_by.require_backfill (2)
37-48: Document operation code 7Add a comment explaining what operation code 7 represents for better maintainability.
1-60: Add configuration documentationSince this is a debugging PR:
- Document the configuration changes
- Outline testing strategy for the changes
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (20)
api/py/test/sample/production/group_bys/sample_team/sample_chaining_group_by.chaining_group_by_v1(1 hunks)api/py/test/sample/production/group_bys/sample_team/sample_group_by.require_backfill(1 hunks)api/py/test/sample/production/group_bys/sample_team/sample_group_by_group_by.require_backfill(1 hunks)api/py/test/sample/production/group_bys/sample_team/sample_group_by_group_by.v1(1 hunks)api/py/test/sample/production/joins/kaggle/outbrain.training_set(1 hunks)api/py/test/sample/production/joins/risk/user_transactions.txn_join(1 hunks)api/py/test/sample/production/joins/sample_team/sample_chaining_join.parent_join(0 hunks)api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v1(0 hunks)api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2(0 hunks)api/py/test/sample/production/joins/sample_team/sample_join_derivation.v1(0 hunks)api/py/test/sample/production/joins/sample_team/sample_join_from_module.v1(0 hunks)api/py/test/sample/production/joins/sample_team/sample_join_with_derivations_on_external_parts.v1(0 hunks)api/py/test/sample/production/joins/sample_team/sample_label_join.v1(0 hunks)api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1(0 hunks)api/py/test/sample/production/joins/sample_team/sample_online_join.v1(0 hunks)online/src/main/scala/ai/chronon/online/DataRange.scala(1 hunks)online/src/test/scala/ai/chronon/online/test/DataRangeTest.scala(5 hunks)spark/src/main/scala/ai/chronon/spark/JoinUtils.scala(11 hunks)spark/src/main/scala/ai/chronon/spark/TableUtils.scala(6 hunks)spark/src/main/scala/ai/chronon/spark/utils/PartitionRunner.scala(2 hunks)
💤 Files with no reviewable changes (9)
- api/py/test/sample/production/joins/sample_team/sample_join_derivation.v1
- api/py/test/sample/production/joins/sample_team/sample_label_join.v1
- api/py/test/sample/production/joins/sample_team/sample_join_with_derivations_on_external_parts.v1
- api/py/test/sample/production/joins/sample_team/sample_chaining_join.parent_join
- api/py/test/sample/production/joins/sample_team/sample_join_from_module.v1
- api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v1
- api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1
- api/py/test/sample/production/joins/sample_team/sample_online_join.v1
- api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2
✅ Files skipped from review due to trivial changes (1)
- spark/src/main/scala/ai/chronon/spark/JoinUtils.scala
🧰 Additional context used
📓 Learnings (1)
api/py/test/sample/production/joins/risk/user_transactions.txn_join (1)
Learnt from: chewy-zlai
PR: zipline-ai/chronon#30
File: api/py/test/sample/production/joins/risk/user_transactions.txn_join:217-218
Timestamp: 2024-11-12T09:38:33.532Z
Learning: The JSON files in this project are automatically generated and should not be manually modified or refactored.
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: table_utils_delta_format_spark_tests
- GitHub Check: other_spark_tests
- GitHub Check: mutation_spark_tests
- GitHub Check: fetcher_spark_tests
- GitHub Check: no_spark_scala_tests
- GitHub Check: join_spark_tests
- GitHub Check: scala_compile_fmt_fix
🔇 Additional comments (9)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2)
606-619: LGTM! Clean formatting changes.The reformatting improves readability while maintaining the same functionality.
684-690: LGTM! Improved logging clarity.Using
PartitionRange.collapsedPrintprovides more concise output.online/src/main/scala/ai/chronon/online/DataRange.scala (1)
180-183: LGTM!Good defensive programming with null check.
spark/src/main/scala/ai/chronon/spark/utils/PartitionRunner.scala (1)
7-7: LGTM!Import and usage correctly updated to use collapseToRanges.
Also applies to: 82-82
api/py/test/sample/production/group_bys/sample_team/sample_chaining_group_by.chaining_group_by_v1 (1)
23-25: Confirm 'production' statusIs
"production": 0in the join'smetaDataintentional?api/py/test/sample/production/joins/risk/user_transactions.txn_join (1)
8-10: Unbounded dependencies could cause write failures.All dependencies have null start/end dates which might lead to unbounded data fetching.
Consider adding date bounds to prevent excessive data loading.
api/py/test/sample/production/joins/kaggle/outbrain.training_set (1)
104-104: Inconsistent accuracy settings.Accuracy settings vary between 0 and 1 across group-bys. Verify if this is intentional.
Also applies to: 173-173, 241-241, 312-312
✅ Verification successful
Accuracy settings are intentionally varied
The alternating pattern (1,0,0,1) in outbrain.training_set appears deliberate, while other configurations consistently use 0.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for other accuracy settings in similar configurations rg -A 1 '"accuracy":'Length of output: 1437
api/py/test/sample/production/group_bys/sample_team/sample_group_by_group_by.require_backfill (2)
6-8: Verify dependency start date alignmentDependency start date (2021-03-01) precedes backfill start date (2023-01-01). Consider aligning these dates to avoid unnecessary data processing.
27-28: Confirm if setups configuration is neededEmpty setups array. Verify if any data transformations are required.
| def compactPrint(partitions: Iterable[String])(implicit partitionSpec: PartitionSpec): String = { | ||
| if (partitions == null) return "NO_PARTITIONS_FOUND" | ||
| rangesToString(collapseToRanges(partitions)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Maintain consistent null handling
Return empty string for null input to match the behavior in collapsedPrint.
- if (partitions == null) return "NO_PARTITIONS_FOUND"
+ if (partitions == null) return ""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def compactPrint(partitions: Iterable[String])(implicit partitionSpec: PartitionSpec): String = { | |
| if (partitions == null) return "NO_PARTITIONS_FOUND" | |
| rangesToString(collapseToRanges(partitions)) | |
| } | |
| def compactPrint(partitions: Iterable[String])(implicit partitionSpec: PartitionSpec): String = { | |
| if (partitions == null) return "" | |
| rangesToString(collapseToRanges(partitions)) | |
| } |
| if (partitions == null) return null | ||
| if (partitions.isEmpty) return Seq.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return empty sequence instead of null
Returning null could cause NPEs downstream. Return Seq.empty for both null and empty cases.
- if (partitions == null) return null
+ if (partitions == null || partitions.isEmpty) return Seq.empty
- if (partitions.isEmpty) return Seq.empty📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (partitions == null) return null | |
| if (partitions.isEmpty) return Seq.empty | |
| if (partitions == null || partitions.isEmpty) return Seq.empty |
| "name": "sample_team.sample_group_by_group_by.v1", | ||
| "production": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Set 'production' flag
Consider setting "production": 1 if ready for production.
| "name": "sample_team.sample_group_by.require_backfill", | ||
| "production": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Set 'production' flag
Consider setting "production": 1 if ready for production.
| } | ||
| ] | ||
| }, | ||
| "prefix": "user" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate prefix "user" in joinParts.
The prefix "user" is used multiple times which could cause column name collisions.
- "prefix": "user"
+ "prefix": "user_data"Also applies to: 209-209
| { | ||
| "events": { | ||
| "table": "kaggle_outbrain_base_table", | ||
| "topic": "some_topic", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Placeholder topic name could cause write failure.
The topic name "some_topic" appears to be a placeholder.
Specify the correct Kafka topic name for the streaming configuration.
| "derivations": [ | ||
| { | ||
| "name": "derived_field", | ||
| "expression": "" | ||
| }, | ||
| { | ||
| "name": "*", | ||
| "expression": "*" | ||
| } | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incomplete derivation configuration
Issues found:
- Empty expression for derived_field
- Wildcard (*) usage could impact performance
Summary
Checklist
Summary by CodeRabbit
Based on the comprehensive summary, here are the release notes:
New Configurations
Metadata Refinements
tablePropertiesandoutputNamespacefieldsCode Improvements
DataRange.scalaTechnical Updates
collapseToRangemethod tocollapseToRangesacross multiple filesThese changes primarily focus on improving data processing configurations, refining metadata management, and enhancing partition range handling across the system.