[HUDI-3069] compact improve #4400

scxwhite · 2021-12-20T06:38:46Z

Brief change log

compact improve

I found that when the compact plan is generated, the delta log files under each filegroup are arranged in the natural order of instant time. in the majority of cases,We can think that the latest data is in the latest delta log file, so we sort it from large to small according to the instance time, which can largely avoid rewriting the data in the compact process, and then optimize the compact time.

In addition, when reading the delta log file, we compare the data in the external spillablemap with the delta log data. If oldrecord is selected, there is no need to rewrite the data in the external spillablemap. Rewriting data will waste a lot of resources when data is spill to disk

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

[*] Has a corresponding JIRA in PR title & commit()
[*] Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

vinothchandar

Have a clarification on the first fix. Could you add some UTs for this?

vinothchandar · 2021-12-21T00:36:44Z

...t/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java

        .getLatestFileSlices(partitionPath)
        .filter(slice -> !fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
        .map(s -> {
+          // We can think that the latest data is in the latest delta log file, so we sort it from large


I think you are assuming the later writes in the log always overwrites the earlier ones? this is not true always.

You're right, but in most cases, the new data is often in the latest delta log, so we sort it from large to small according to the instance time. The program will avoid updating the data in the externalspillablemap to save compact time. What do you think

Have a clarification on the first fix. Could you add some UTs for this?

OK, I'll try to add some UTs

I think you are assuming the later writes in the log always overwrites the earlier ones? this is not true always.

In the compact plan generation phase, I just changed the order of reading delta log files. In the internal production environment, I have used this method for a month, and no data exceptions have occurred. Now, I don't know how I should test this place. Can you give me some suggestions

In addition, I changed the reading order of deltalog to avoid data rewriting to the greatest extent. Houdierecordpayload#precombine will still execute and select the correct data.

vinothchandar · 2021-12-21T00:38:15Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java

-      HoodieOperation operation = choosePrev ? oldRecord.getOperation() : hoodieRecord.getOperation();
-      records.put(key, new HoodieRecord<>(new HoodieKey(key, hoodieRecord.getPartitionPath()), combinedValue, operation));
+      // If combinedValue is oldValue, no need rePut oldRecord
+      if (!combinedValue.equals(oldValue)) {


This feels like a valid optimization.

nsivabalan · 2022-02-16T04:59:53Z

@yihua : Can you follow up on the review please.

Co-authored-by: [email protected] <loukey_7821>

… failure (apache#4824)

… 0 (apache#4823) Co-authored-by: Hui An <[email protected]>

* [HUDI-3389] fix ColumnarArrayData ClassCastException issue * [HUDI-3389] remove MapColumnVector.java, RowColumnVector.java, and add test case for array<int> field

…pache#4837) * [HUDI-3446] Supports batch Reader in BootstrapOperator#loadRecords

…che#4843)

…he#3887) Co-authored-by: yuezhang <[email protected]>

* Fixing restore with metadata enabled * Fixing test failures

…e#4850)

…e#3646)

…d remove duplicates (apache#4845) Co-authored-by: Hui An <[email protected]>

…n operations are present using a config. (apache#4212) Co-authored-by: sivabalan <[email protected]>

…#4787)

…e#4865)

…ot be reused (apache#4861) * Before the patch, the flink streaming reader caches the meta client thus the archived timeline, when fetching the instant details from the reused timeline, the exception throws * Add a method in HoodieTableMetaClient to return a fresh new archived timeline each time

…apache#4869)

…pache#4870)

…che#4808) Co-authored-by: yuzhaojing <[email protected]>

…mnStats (apache#4875)

…etxxBaseFile() (apache#4810)

…e#4898) Co-authored-by: 苏承祥 <[email protected]>

…clean ups (apache#4895)

…ss to string (apache#4919)

…field does not have diff data type (apache#4852)

ParquetColumnarRowSplitReader#batchSize is 2048, so Changing MINI_BATCH_SIZE to 2048 will reduce memory cache.

…e Index (apache#4840) Co-authored-by: guanziyue <[email protected]>

…ypedProperties.java (apache#4920)

…4897)

…he#4844) Co-authored-by: Wenning Ding <[email protected]>

…he#4809) Co-authored-by: yuzhaojing <[email protected]>

…eMetadataTableValidator (apache#4878)

* Use iterator to void eager materialization to be memory friendly

…h the latest schema (apache#4000) Co-authored-by: yuzhaojing <[email protected]>

…cible Builds (apache#4866)

scxwhite · 2022-03-02T06:49:27Z

I found that after modifying the reading order of the delta log, HoodieRecordPayload#preCombine may have some problems when compacting (when the orderingVal of the two data is the same, the latest submitted data will not be selected). Later I will submit a new pr separately to fix this issue.

So this pr just optimizes the code。

hudi-bot · 2022-03-02T06:58:29Z

CI report:

d67b336 Azure: CANCELED
b7f7a4e UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

scxwhite · 2022-03-02T06:58:54Z

Very sorry, my fault, there was a problem with the merge. I will split it into two PRs and resubmit.

[HUDI-3069] compact improve

2b453c1

yihua self-assigned this Dec 20, 2021

vinothchandar reviewed Dec 21, 2021

View reviewed changes

scxwhite force-pushed the master branch from 4fa5ba7 to 2b453c1 Compare February 18, 2022 08:00

loukey-lj and others added 25 commits February 18, 2022 13:31

HoodieSortedMergeHandle#close write data disorder (apache#4841)

de8161a

Co-authored-by: [email protected] <loukey_7821>

[HUDI-3430] Fix Deltastreamer to properly shut down the services upon…

fba5822

… failure (apache#4824)

[HUDI-3438] Avoid getSmallFiles if hoodie.parquet.small.file.limit is…

5009138

… 0 (apache#4823) Co-authored-by: Hui An <[email protected]>

[HUDI-3389] fix ColumnarArrayData ClassCastException issue (apache#4842)

f15125c

* [HUDI-3389] fix ColumnarArrayData ClassCastException issue * [HUDI-3389] remove MapColumnVector.java, RowColumnVector.java, and add test case for array<int> field

[HUDI-3446] Supports batch reader in BootstrapOperator#loadRecords (a…

8327997

…pache#4837) * [HUDI-3446] Supports batch Reader in BootstrapOperator#loadRecords

[MINOR] Moving spark scheduling configs out of DataSourceOptions (apa…

66ac144

…che#4843)

[HUDI-3458] Fix BulkInsertPartitioner generic type (apache#4854)

0938f55

[HUDI-2648] Retry FileSystem action instead of failed directly. (apac…

359fbfd

…he#3887) Co-authored-by: yuezhang <[email protected]>

[HUDI-2732][RFC-38] Spark Datasource V2 Integration (apache#3964)

76b6ad6

[HUDI-3432] Fixing restore with metadata enabled (apache#4849)

17cb5cb

* Fixing restore with metadata enabled * Fixing test failures

[HUDI-3455] Fixing checkpoint management in hoodie incr source (apach…

d36fe24

…e#4850)

[HUDI-349]: Added new cleaning policy based on number of hours (apach…

bf16bc1

…e#3646)

[HUDI-3042] Abstract Spark update Strategy to make code more clean an…

801fdab

…d remove duplicates (apache#4845) Co-authored-by: Hui An <[email protected]>

[HUDI-3423] upgrade spark to 3.2.1 (apache#4815)

0c95018

[HUDI-2925] Fix duplicate cleaning of same files when unfinished clea…

0dee8ed

…n operations are present using a config. (apache#4212) Co-authored-by: sivabalan <[email protected]>

[MINOR] Fix typos and improve docs in HoodieMetadataConfig (apache#4867)

7e1ea06

[HUDI-2189] Adding delete partitions support to DeltaStreamer (apache…

14dbbdf

…#4787)

[HUDI-3464] Fix wrong exception thrown from HiveSchemaProvider (apach…

4d1f74e

…e#4865)

[HUDI-3476] Remove the shade pattern for parquet for flink bundle jar (…

b87e95d

…apache#4869)

[MINOR] Fixing checkpoint management in S3IncrSource (apache#4871)

9678c3f

Add hive-standalone-metastore dependency to hudi-flink-bundle module (a…

01cbdde

…pache#4870)

[HUDI-3420] Remove duplicates type in HoodieClusteringGroup.avsc (apa…

dabae80

…che#4808) Co-authored-by: yuzhaojing <[email protected]>

[HUDI-3486] Fix wrong field order for constructing HoodieMetadataColu…

4e8accc

…mnStats (apache#4875)

[HUDI-3489] Unify config to avoid duplicate code (apache#4883)

2a93b8e

zhangyue19921010 and others added 22 commits February 25, 2022 16:46

[HUDI-3421]Pending clustering may break AbstractTableFileSystemView#g…

7428100

…etxxBaseFile() (apache#4810)

[HUDI-3042] Refactor clustering executors (apache#4847)

b50f4b4

[HUDI-3515] Making rdd unpersist optional at the end of writes (apach…

92cdc59

…e#4898) Co-authored-by: 苏承祥 <[email protected]>

[MINOR] Fix table type in input format test (apache#4912)

6a5cfb4

[HUDI-3483] Adding insert override nodes to integ test suite and few …

1379300

…clean ups (apache#4895)

[HUDI-2439] Remove SparkBoundedInMemoryExecutor (apache#4860)

c77b259

[HUDI-3521] Fixing kakfa key and value serializer value type from cla…

2f99e84

…ss to string (apache#4919)

[HUDI-3018] Adding validation to dataframe scheme to ensure reserved …

d5444ff

…field does not have diff data type (apache#4852)

[MINOR] Change MINI_BATCH_SIZE to 2048 (apache#4862)

1932152

ParquetColumnarRowSplitReader#batchSize is 2048, so Changing MINI_BATCH_SIZE to 2048 will reduce memory cache.

[HUDI-2917] rollback insert data appended to log file when using Hbas…

4a59876

…e Index (apache#4840) Co-authored-by: guanziyue <[email protected]>

[HUDI-3528] Fix String convert issue and overwrite putAll method in T…

8f1e4f5

…ypedProperties.java (apache#4920)

[HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x (apache#…

05e395a

…4897)

[HUDI-3450] Avoid passing empty string spark master to hudi cli (apac…

18dc89c

…he#4844) Co-authored-by: Wenning Ding <[email protected]>

[HUDI-3418] Save timeout option for remote RemoteFileSystemView (apac…

44b8ab6

…he#4809) Co-authored-by: yuzhaojing <[email protected]>

[HUDI-3465] Add validation of column stats and bloom filters in Hoodi…

257052a

…eMetadataTableValidator (apache#4878)

[HUDI-3497] Adding Datatable validator tool (apache#4902)

f7088a9

[HUDI-3441] Add support for "marker delete" in hudi-cli (apache#4922)

a81a632

[HUDI-3516] Implement record iterator for HoodieDataBlock (apache#4909)

3fdc933

* Use iterator to void eager materialization to be memory friendly

[MINOR] fix get builtin function issue from Hudi catalog (apache#4917)

3cfb52c

[HUDI-2631] In CompactFunction, set up the write schema each time wit…

3b2da9f

…h the latest schema (apache#4000) Co-authored-by: yuzhaojing <[email protected]>

[HUDI-3469] Refactor HoodieTestDataGenerator to provide for reprodu…

85f47b5

…cible Builds (apache#4866)

revert compactor delta file order desc

d67b336

苏承祥 added 3 commits March 2, 2022 14:52

[HUDI-3069] compact improve

6206d28

revert compactor delta file order desc

6774a83

rebase master

b7f7a4e

scxwhite closed this Mar 2, 2022

scxwhite mentioned this pull request Mar 2, 2022

[HUDI-3069] Improve HoodieMergedLogRecordScanner avoid putting unnecessary hoodie records #4932

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HUDI-3069] compact improve #4400

[HUDI-3069] compact improve #4400

Uh oh!

scxwhite commented Dec 20, 2021

Uh oh!

vinothchandar left a comment

Uh oh!

vinothchandar Dec 21, 2021

Uh oh!

scxwhite Dec 21, 2021

Uh oh!

scxwhite Dec 21, 2021

Uh oh!

scxwhite Jan 15, 2022

Uh oh!

scxwhite Jan 15, 2022

Uh oh!

vinothchandar Dec 21, 2021

Uh oh!

nsivabalan commented Feb 16, 2022

Uh oh!

scxwhite commented Mar 2, 2022

Uh oh!

hudi-bot commented Mar 2, 2022

Uh oh!

scxwhite commented Mar 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

[HUDI-3069] compact improve #4400

[HUDI-3069] compact improve #4400

Uh oh!

Conversation

scxwhite commented Dec 20, 2021

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

vinothchandar Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

scxwhite Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

scxwhite Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

scxwhite Jan 15, 2022

Choose a reason for hiding this comment

Uh oh!

scxwhite Jan 15, 2022

Choose a reason for hiding this comment

Uh oh!

vinothchandar Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Feb 16, 2022

Uh oh!

scxwhite commented Mar 2, 2022

Uh oh!

hudi-bot commented Mar 2, 2022

CI report:

Uh oh!

scxwhite commented Mar 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants