-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5394] Fix tests for RowCustomColumnsSortPartitioner #8741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| boolean populateMetaFields) { | ||
| Dataset<Row> records1 = generateTestRecords(); | ||
| Dataset<Row> records2 = generateTestRecords(); | ||
| Dataset<Row> records = generateTestRecords(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why the existing test case runs the same logic twice with records1 and records2. @boneanxs any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testCustomColumnSortPartitionerWithRows was copied from testBulkInsertInternalPartitioner. And I looked org.apache.hudi.execution.bulkinsert.TestBulkInsertInternalPartitioner#testBulkInsertInternalPartitioner:177, it actually generates two records sets with different union times:
JavaRDD<HoodieRecord> records1 = generateTestRecordsForBulkInsert(jsc);
JavaRDD<HoodieRecord> records2 = generateTripleTestRecordsForBulkInsert(jsc);So I think this should be a mistake, and I think union it twice should be enough(Here different union times for different partitions?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no change for any codes in the write path, so why the tests run successfully for Spark 3.1 or 2.4 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test only passes spark 2.4 which is an coincident. The existing test logic asserts 2 rdd partitions after re-partition by the partitioner. with spark 2.4's sort and coalesce, it gives 2 and passes the test as a local partitioner. The correct expectation is the partitioner is doing global sort and the resulting num partition should be 2 or less, which is what spark 3 gives us.
| records2, true, false, true, generateExpectedPartitionNumRecords(records2), Option.of(comparator), true); | ||
| records, true, true, true, generateExpectedPartitionNumRecords(records), Option.of(comparator), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the existing test treated the partitioner as non-global and hence failed the test scenario under spark 3.2
boneanxs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the mistake. +1 for this
| boolean populateMetaFields) { | ||
| Dataset<Row> records1 = generateTestRecords(); | ||
| Dataset<Row> records2 = generateTestRecords(); | ||
| Dataset<Row> records = generateTestRecords(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testCustomColumnSortPartitionerWithRows was copied from testBulkInsertInternalPartitioner. And I looked org.apache.hudi.execution.bulkinsert.TestBulkInsertInternalPartitioner#testBulkInsertInternalPartitioner:177, it actually generates two records sets with different union times:
JavaRDD<HoodieRecord> records1 = generateTestRecordsForBulkInsert(jsc);
JavaRDD<HoodieRecord> records2 = generateTripleTestRecordsForBulkInsert(jsc);So I think this should be a mistake, and I think union it twice should be enough(Here different union times for different partitions?)
danny0405
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Change Logs
Fix test to use
RowCustomColumnsSortPartitioneras global sort partitioner.This blocks #8445
Impact
NA
Risk level
Low.
Documentation Update
NA
Contributor's checklist