[HUDI-6437] Refine avg record size by considering both commit and deltacommit #9013

guanlisheng · 2023-06-19T03:26:03Z

Change Logs

refine avg record size calculation by considering both commit and deltacommit.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

guanlisheng · 2023-06-19T03:28:59Z

#6864 introduced an optimized and restrict change.

actually, we need to consider deltacommit as well upon https://hudi.apache.org/docs/timeline

danny0405 · 2023-06-19T05:07:25Z

Is there any possibility we write some test cases then?

guanlisheng · 2023-06-19T05:17:45Z

thanks @danny0405 for replying, and sure thing I will try to add a unit test later in the PR.

guanlisheng · 2023-06-19T08:04:25Z

hi @danny0405 ,
I found it a little hard to add a unit test based on the current code struct. as it is to change how we call averageBytesPerRecord util function (for this PR and the related PR #6864 )

a possible way is to consolidate the change into the util function. what do you think?

danny0405 · 2023-06-20T02:26:48Z

I'm wondering whether we can add some tests for [UpsertPartitioner.java](https://github.com/apache/hudi/pull/6864/files#diff-c24514cd65d7c8a0adab52c40ee2a6ad18faca8b65d41e2be06094643c84c8f5)

guanlisheng · 2023-06-20T09:06:18Z

hey @danny0405, I understood your ask generally while it is not that easy.

currently, both PRs are to change the input parameters of averageBytesPerRecord.
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L164-L174

I think locating the change within averageBytesPerRecord would be a good implementation, and also UT-friendly.

furtherly, leveraging hoodie.parquet.max.file.size along with fileSizeThreshold would achieve the same goal.
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L386-L388

yihua · 2023-06-20T17:56:38Z

hi @danny0405 , I found it a little hard to add a unit test based on the current code struct. as it is to change how we call averageBytesPerRecord util function (for this PR and the related PR #6864 )

a possible way is to consolidate the change into the util function. what do you think?

+1, let's move averageBytesPerRecord to a util class and make it public, for unit testing.

yihua · 2023-06-20T18:04:41Z

...t/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java

     */
    long averageRecordSize = averageBytesPerRecord(table.getMetaClient().getActiveTimeline()
-        .getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION)).filterCompletedInstants(), config);
+        .getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, DELTA_COMMIT_ACTION)).filterCompletedInstants(), config);


Could we be more strict here on the calculation of average bytes per record? deltacommits may contain log files and the number of bytes written for log files cannot be applied to writing parquet base files. Let's only consider parquet file size here for record size estimation.

There is a transformation ratio between log and parquet, but may not be very accurate.

would you advise how can I distinguish/separate them?

I think it's fine with your change now, just need to add some tests for the method that you might abstract out.

as I commented previously, adding UT for func assignInserts is a little hard...
any suggestion from your side?

We can add a UT for averageBytesPerRecord I guess ?

averageBytesPerRecord is not changed and impacted as it just goes through the commits reversely to get the first eligible one to calculate the size. this explains why #6864 did modify it either.

here we go the existing UTs for the function. i know the value of UT while it is not easy to have one on the current code struct.
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/commit/TestUpsertPartitioner.java#L168-L190

We can add a UT for averageBytesPerRecord to validate that both commit and delta_commit are included in the algorithm.

danny0405 · 2023-06-21T06:07:26Z

...t/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java

-          if (totalBytesWritten > fileSizeThreshold && totalRecordsWritten > 0) {
+          if (totalBytesWritten > fileSizeThreshold && totalBytesWritten <= fileSizeUpper && totalRecordsWritten > 0) {
            avgSize = (long) Math.ceil((1.0 * totalBytesWritten) / totalRecordsWritten);
            break;


What's the purpose of this limitation?

to achieve a similar goal as #6864.

this consolidated implementation avoids logic fragments and is also UT-friendly.

But the totalBytesWritten is the size accumulated from all the written files, what the meaning of it to compare with a single file size threshold?

you are correct and I misunderstood the totalBytesWritten.

then there might be two options:

consider extra FilesInsert/FileUpdated

filter out REPLACE commit within the averageBytesPerRecord function.

which one do you prefer?

I just updated the code to reflect option 2.

danny0405 · 2023-06-21T10:45:08Z

...t/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java

        while (instants.hasNext()) {
          HoodieInstant instant = instants.next();
+          if (instant.getAction().equals(REPLACE_COMMIT_ACTION)) {
+            continue;


Would prefer (CollectionUtils.createSet(COMMIT_ACTION, DELTA_COMMIT_ACTION), one thing needs to note is the avro file uses the Java reflection to calcute the in-memory size of a record, while Parquet file size is the actual file size so it is more accurate, not sure whether we should include the log if there are parquets, that means, the parquet file size should have higher priority than log.

hi @danny0405, I just reverted the code to the very first one.
Whether including the log or not, the current one would be better than the released one, at least for MoR tables

Agree, it is great if you can add a test for for the method.

hudi-bot · 2023-06-25T05:55:36Z

CI report:

c479d88 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2024-03-12T02:53:59Z

Close this one as it is resolved in: #10763

refine avg record size by considering both commit and deltacommit

b8f2ec2

yihua self-assigned this Jun 20, 2023

yihua added the writer-core label Jun 20, 2023

yihua reviewed Jun 20, 2023

View reviewed changes

consolidate the logic in one place

018fdea

danny0405 reviewed Jun 21, 2023

View reviewed changes

Update UpsertPartitioner.java

dc3f7fd

danny0405 reviewed Jun 21, 2023

View reviewed changes

guanlisheng added 2 commits June 25, 2023 10:18

revert to the very first solution

1e63510

revert to the very first solution

c479d88

danny0405 changed the title ~~refine avg record size by considering both commit and deltacommit~~ [HUDI-6437] Refine avg record size by considering both commit and deltacommit Jun 26, 2023

github-actions bot added the size:XS PR with lines of changes in <= 10 label Feb 26, 2024

danny0405 closed this Mar 12, 2024

[HUDI-6437] Refine avg record size by considering both commit and deltacommit #9013

[HUDI-6437] Refine avg record size by considering both commit and deltacommit #9013

Uh oh!

Conversation

guanlisheng commented Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Contributor's checklist

Uh oh!

guanlisheng commented Jun 19, 2023

Uh oh!

danny0405 commented Jun 19, 2023

Uh oh!

guanlisheng commented Jun 19, 2023

Uh oh!

guanlisheng commented Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danny0405 commented Jun 20, 2023

Uh oh!

guanlisheng commented Jun 20, 2023

Uh oh!

yihua commented Jun 20, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jun 25, 2023

CI report:

Uh oh!

danny0405 commented Mar 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

guanlisheng commented Jun 19, 2023 •

edited

Loading

guanlisheng commented Jun 19, 2023 •

edited

Loading