Skip to content

Conversation

@guanlisheng
Copy link

@guanlisheng guanlisheng commented Jun 19, 2023

Change Logs

refine avg record size calculation by considering both commit and deltacommit.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@guanlisheng
Copy link
Author

#6864 introduced an optimized and restrict change.

actually, we need to consider deltacommit as well upon https://hudi.apache.org/docs/timeline

@danny0405
Copy link
Contributor

Is there any possibility we write some test cases then?

@guanlisheng
Copy link
Author

thanks @danny0405 for replying, and sure thing I will try to add a unit test later in the PR.

@guanlisheng
Copy link
Author

guanlisheng commented Jun 19, 2023

hi @danny0405 ,
I found it a little hard to add a unit test based on the current code struct. as it is to change how we call averageBytesPerRecord util function (for this PR and the related PR #6864 )

a possible way is to consolidate the change into the util function. what do you think?

@danny0405
Copy link
Contributor

I'm wondering whether we can add some tests for [UpsertPartitioner.java](https://github.com/apache/hudi/pull/6864/files#diff-c24514cd65d7c8a0adab52c40ee2a6ad18faca8b65d41e2be06094643c84c8f5)

@guanlisheng
Copy link
Author

hey @danny0405, I understood your ask generally while it is not that easy.

currently, both PRs are to change the input parameters of averageBytesPerRecord.
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L164-L174

I think locating the change within averageBytesPerRecord would be a good implementation, and also UT-friendly.

furtherly, leveraging hoodie.parquet.max.file.size along with fileSizeThreshold would achieve the same goal.
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L386-L388

@yihua yihua self-assigned this Jun 20, 2023
@yihua
Copy link
Contributor

yihua commented Jun 20, 2023

hi @danny0405 , I found it a little hard to add a unit test based on the current code struct. as it is to change how we call averageBytesPerRecord util function (for this PR and the related PR #6864 )

a possible way is to consolidate the change into the util function. what do you think?

+1, let's move averageBytesPerRecord to a util class and make it public, for unit testing.

*/
long averageRecordSize = averageBytesPerRecord(table.getMetaClient().getActiveTimeline()
.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION)).filterCompletedInstants(), config);
.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, DELTA_COMMIT_ACTION)).filterCompletedInstants(), config);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we be more strict here on the calculation of average bytes per record? deltacommits may contain log files and the number of bytes written for log files cannot be applied to writing parquet base files. Let's only consider parquet file size here for record size estimation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a transformation ratio between log and parquet, but may not be very accurate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you advise how can I distinguish/separate them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine with your change now, just need to add some tests for the method that you might abstract out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I commented previously, adding UT for func assignInserts is a little hard...
any suggestion from your side?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a UT for averageBytesPerRecord I guess ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

averageBytesPerRecord is not changed and impacted as it just goes through the commits reversely to get the first eligible one to calculate the size. this explains why #6864 did modify it either.

here we go the existing UTs for the function. i know the value of UT while it is not easy to have one on the current code struct.
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/commit/TestUpsertPartitioner.java#L168-L190

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a UT for averageBytesPerRecord to validate that both commit and delta_commit are included in the algorithm.

if (totalBytesWritten > fileSizeThreshold && totalRecordsWritten > 0) {
if (totalBytesWritten > fileSizeThreshold && totalBytesWritten <= fileSizeUpper && totalRecordsWritten > 0) {
avgSize = (long) Math.ceil((1.0 * totalBytesWritten) / totalRecordsWritten);
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this limitation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to achieve a similar goal as #6864.

this consolidated implementation avoids logic fragments and is also UT-friendly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the totalBytesWritten is the size accumulated from all the written files, what the meaning of it to compare with a single file size threshold?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are correct and I misunderstood the totalBytesWritten.

then there might be two options:

  1. consider extra FilesInsert/FileUpdated
  2. filter out REPLACE commit within the averageBytesPerRecord function.

which one do you prefer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just updated the code to reflect option 2.

while (instants.hasNext()) {
HoodieInstant instant = instants.next();
if (instant.getAction().equals(REPLACE_COMMIT_ACTION)) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer (CollectionUtils.createSet(COMMIT_ACTION, DELTA_COMMIT_ACTION), one thing needs to note is the avro file uses the Java reflection to calcute the in-memory size of a record, while Parquet file size is the actual file size so it is more accurate, not sure whether we should include the log if there are parquets, that means, the parquet file size should have higher priority than log.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @danny0405, I just reverted the code to the very first one.
Whether including the log or not, the current one would be better than the released one, at least for MoR tables

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it is great if you can add a test for for the method.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 changed the title refine avg record size by considering both commit and deltacommit [HUDI-6437] Refine avg record size by considering both commit and deltacommit Jun 26, 2023
@github-actions github-actions bot added the size:XS PR with lines of changes in <= 10 label Feb 26, 2024
@danny0405
Copy link
Contributor

Close this one as it is resolved in: #10763

@danny0405 danny0405 closed this Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XS PR with lines of changes in <= 10

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

4 participants