Skip to content

Conversation

@xushiyan
Copy link
Member

@xushiyan xushiyan commented May 17, 2023

Change Logs

Fix test to use RowCustomColumnsSortPartitioner as global sort partitioner.

This blocks #8445

Impact

NA

Risk level

Low.

Documentation Update

NA

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

boolean populateMetaFields) {
Dataset<Row> records1 = generateTestRecords();
Dataset<Row> records2 = generateTestRecords();
Dataset<Row> records = generateTestRecords();
Copy link
Member Author

@xushiyan xushiyan May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why the existing test case runs the same logic twice with records1 and records2. @boneanxs any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testCustomColumnSortPartitionerWithRows was copied from testBulkInsertInternalPartitioner. And I looked org.apache.hudi.execution.bulkinsert.TestBulkInsertInternalPartitioner#testBulkInsertInternalPartitioner:177, it actually generates two records sets with different union times:

JavaRDD<HoodieRecord> records1 = generateTestRecordsForBulkInsert(jsc);
    JavaRDD<HoodieRecord> records2 = generateTripleTestRecordsForBulkInsert(jsc);

So I think this should be a mistake, and I think union it twice should be enough(Here different union times for different partitions?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no change for any codes in the write path, so why the tests run successfully for Spark 3.1 or 2.4 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test only passes spark 2.4 which is an coincident. The existing test logic asserts 2 rdd partitions after re-partition by the partitioner. with spark 2.4's sort and coalesce, it gives 2 and passes the test as a local partitioner. The correct expectation is the partitioner is doing global sort and the resulting num partition should be 2 or less, which is what spark 3 gives us.

@xushiyan xushiyan requested a review from yihua May 17, 2023 17:15
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Comment on lines -148 to +135
records2, true, false, true, generateExpectedPartitionNumRecords(records2), Option.of(comparator), true);
records, true, true, true, generateExpectedPartitionNumRecords(records), Option.of(comparator), true);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the existing test treated the partitioner as non-global and hence failed the test scenario under spark 3.2

Copy link
Contributor

@boneanxs boneanxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the mistake. +1 for this

boolean populateMetaFields) {
Dataset<Row> records1 = generateTestRecords();
Dataset<Row> records2 = generateTestRecords();
Dataset<Row> records = generateTestRecords();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testCustomColumnSortPartitionerWithRows was copied from testBulkInsertInternalPartitioner. And I looked org.apache.hudi.execution.bulkinsert.TestBulkInsertInternalPartitioner#testBulkInsertInternalPartitioner:177, it actually generates two records sets with different union times:

JavaRDD<HoodieRecord> records1 = generateTestRecordsForBulkInsert(jsc);
    JavaRDD<HoodieRecord> records2 = generateTripleTestRecordsForBulkInsert(jsc);

So I think this should be a mistake, and I think union it twice should be enough(Here different union times for different partitions?)

@xushiyan xushiyan requested review from danny0405 and removed request for yihua May 18, 2023 07:27
Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@xushiyan xushiyan merged commit 9ef7bd8 into apache:master May 18, 2023
@xushiyan xushiyan deleted the HUDI-5394-fix-partitioner-ut branch May 18, 2023 11:08
@xushiyan xushiyan added the priority:high Significant impact; potential bugs label May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high Significant impact; potential bugs

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants