[HUDI-6437] Refine avg record size by considering both commit and deltacommit #9013

yihua · 2023-06-20T18:04:41Z

Could we be more strict here on the calculation of average bytes per record? deltacommits may contain log files and the number of bytes written for log files cannot be applied to writing parquet base files. Let's only consider parquet file size here for record size estimation.

There is a transformation ratio between log and parquet, but may not be very accurate.

would you advise how can I distinguish/separate them?

I think it's fine with your change now, just need to add some tests for the method that you might abstract out.

as I commented previously, adding UT for func assignInserts is a little hard...
any suggestion from your side?

We can add a UT for averageBytesPerRecord I guess ?

averageBytesPerRecord is not changed and impacted as it just goes through the commits reversely to get the first eligible one to calculate the size. this explains why #6864 did modify it either.

here we go the existing UTs for the function. i know the value of UT while it is not easy to have one on the current code struct.
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/commit/TestUpsertPartitioner.java#L168-L190

We can add a UT for averageBytesPerRecord to validate that both commit and delta_commit are included in the algorithm.

-Original file line number
+Diff line change
@@ Expand Up / @@ -56,6 +56,7 @@ @@
     import scala.Tuple2;
     import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+    import static org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
     /**
      * Packs incoming records to be upserted, into buckets (1 bucket = 1 RDD partition).
@@ Expand Down Expand Up @@
          * may result in OOM by making spark underestimate the actual input record sizes.
          */
         long averageRecordSize = averageBytesPerRecord(table.getMetaClient().getActiveTimeline()
-            .getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION)).filterCompletedInstants(), config);
+            .getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, DELTA_COMMIT_ACTION)).filterCompletedInstants(), config);
         LOG.info("AvgRecordSize => " + averageRecordSize);
         Map<String, List<SmallFile>> partitionSmallFilesMap =
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-6437] Refine avg record size by considering both commit and deltacommit #9013

Uh oh!

Diff view

Diff view

There are no files selected for viewing

yihua Jun 20, 2023

Uh oh!

danny0405 Jun 21, 2023

Uh oh!

guanlisheng Jun 21, 2023

Uh oh!

danny0405 Jun 25, 2023

Uh oh!

guanlisheng Jun 25, 2023

Uh oh!

danny0405 Jun 26, 2023

Uh oh!

guanlisheng Jun 26, 2023

Uh oh!

danny0405 Jun 26, 2023

Uh oh!

[HUDI-6437] Refine avg record size by considering both commit and deltacommit #9013

Uh oh!

[HUDI-6437] Refine avg record size by considering both commit and deltacommit #9013

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!