[HUDI-2833][Design] Merge small archive files instead of expanding indefinitely. #4078

zhangyue19921010 · 2021-11-23T02:43:59Z

https://issues.apache.org/jira/browse/HUDI-2833

What is the purpose of the pull request

public enum StorageSchemes {
  // Local filesystem
  FILE("file", false),
  // Hadoop File System
  HDFS("hdfs", true),
  // Baidu Advanced File System
  AFS("afs", true),
  // Mapr File System
  MAPRFS("maprfs", true),
  // Apache Ignite FS
  IGNITE("igfs", true),
  // AWS S3
  S3A("s3a", false), S3("s3", false),
  // Google Cloud Storage
  GCS("gs", false),
  // Azure WASB
  WASB("wasb", false), WASBS("wasbs", false),
  // Azure ADLS
  ADL("adl", false),
  // Azure ADLS Gen2
  ABFS("abfs", false), ABFSS("abfss", false),
  // Aliyun OSS
  OSS("oss", false),
  // View FS for federated setups. If federating across cloud stores, then append support is false
  VIEWFS("viewfs", true),
  //ALLUXIO
  ALLUXIO("alluxio", false),
  // Tencent Cloud Object Storage
  COSN("cosn", false),
  // Tencent Cloud HDFS
  CHDFS("ofs", true),
  // Tencent Cloud CacheFileSystem
  GOOSEFS("gfs", false),
  // Databricks file system
  DBFS("dbfs", false),
  // IBM Cloud Object Storage
  COS("cos", false),
  // Huawei Cloud Object Storage
  OBS("obs", false),
  // Kingsoft Standard Storage ks3
  KS3("ks3", false),
  // JuiceFileSystem
  JFS("jfs", true),
  // Baidu Object Storage
  BOS("bos", false);

As we know, most of storage do not support append action, so that hoodie will create a new archive file under archived dictionary when archiving.

As time goes by, there may be thousands of archive files, which most of them is not useful anymore.

Maybe it is meaningful to have a function to merge small archive files into bigger one.

Add three configs to control merge small archive files behavior

hoodie.archive.auto.merge.enable (default false)
When enable, hoodie will auto merge several small archive files into larger one. It's useful when storage scheme doesn't support append operation.

hoodie.archive.files.merge.batch.size (default 10)
The numbers of small archive files are merged at once.

hoodie.archive.merge.small.file.limit.bytes (default 20971520)
This config sets the archive file size limit below which an archive file becomes a candidate to be selected as such a small file.

Add a new plan named HoodieMergeArchiveFilePlan

{
   "namespace":"org.apache.hudi.avro.model",
   "type":"record",
   "name":"HoodieMergeArchiveFilePlan",
   "fields":[
     {
         "name":"candidate",
         "type":["null", {
            "type":"array",
            "items": "string"
         }],
       "default": null
    },
    {
       "name":"mergedArchiveFileName",
       "type":["null", "string"],
       "default": null
    }
  ]
}

We use this plan to record which candidate small archives files are merged into which bigger archive file.
It's useful to deal with merge action failure

Code Flow

zhangyue19921010 · 2021-12-06T07:29:52Z

Hi @bhasudha. Sorry to bother you. Would you mind to take a look at this hudi on S3 related issue?
Really appreciate it if you could help me.
By the way we have deployed this feature into our prd and it works fine.

yihua

@zhangyue19921010 Thanks for putting thoughts on improving the archived timeline. Some functionality still relies on the archived timeline, such as HoodieRepairTool. My concern is that simply keeping the most recent few archive files may cause side effects given information loss. Some further improvements I can think of are:
(1) Rewrite the archived timeline content into a smaller number of files
(2) When deleting the archived files, make sure the table does not have any corresponding base or log files from the contained instants, so there is essentially no information loss of the table states.
Wdyt?

zhangyue19921010 · 2021-12-13T08:48:01Z

Hi @yihua Thanks a lot for your attention.

I agree with your opinion, indeed deleting archived files will lose historical instants information and affect some hudi functions such as HoodieRepairTool.
The current implementation is a simple but can solve most of the problems, at least from my experience.

At present, the user's use of hudi still involves the their own judgment, such as

How to make the cleaner delete the data, and the deletion of the data will affect the time travial.
How to make hudi archive instant, once instant is archived, it will affect the use of active timeline.

Sometimes users still need to have a clear understanding of their configuration, just as enable archive files number will lose historical instant information, same as time travial to cleaner.(Of course we need to remind users in the document)

Fortunately, users have options to use this function according on their own circumstances. If users need to keep all instant information, just disable it. If the user does not care about the instant after the archive, they can turn it on and keep a smaller value.

On the other hand, I think the loss of information is inevitable, and we cannot keep all the data forever. The questions are when and how.

Of course, the improvement you mentioned is very reasonable such as let hoodie implement append archive files function for unsupport-append dfs.

Do you think we need to get it done in this PR or maybe we can walk step by step to reach the final state.
By the way we need to take care of performance issue which is very important for Streaming Job.

(1) Rewrite the archived timeline content into a smaller number of files --> will lead a archive file write amplify
(2) When deleting the archived files, make sure the table does not have any corresponding base or log files from the contained instants, so there is essentially no information loss of the table states ---> maybe need to list and collect all the table data files name which is heavy for large hudi table.

nsivabalan · 2021-12-14T16:12:30Z

I also echo Ethan's comment, but if this patch is guarded by a config flag, and default is not to clean up any archive files, guess we should be good. Until we have ways to fold N no of archive into 1 file, atleast users will have some way to trim it and not keep expanding indefinitely. As @zhangyue19921010 pointed out, we can call it out in our documentation and let users decide if they are ok enabling the config.

nsivabalan

LGTM on high level. left some minor comments.

nsivabalan · 2021-12-14T16:14:04Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java

+      .withDocumentation("The numbers of kept archive files under archived");
+
+  public static final ConfigProperty<String> CLEAN_ARCHIVE_FILE_ENABLE_DROP = ConfigProperty
+      .key("hoodie.archive.clean.enable")


this confuses me with regular clean. Can we call it as "hoodie.auto.trim.archive.files" or "hoodie.auto.delete.archive.files" or something on that end.
and and "hoodie.max.archive.files" for the previous config.

Tanks for your review. Changed as hoodie.auto.trim.archive.files and hoodie.max.archive.files

nsivabalan · 2021-12-14T16:20:11Z

...-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiveLog.java

+    while (iter.hasNext()) {
+      files.add(iter.next());
+    }
+    assertEquals(archiveFilesToKeep, files.size());


Can we also add assertion that earliest files are deleted and not latest ones in archive folder.

yihua · 2021-12-16T00:11:59Z

As discussed offline, we should warn users to avoid the config if they don't understand the mechanism. They should only use it if they know what they are doing. We can follow up with more comprehensive mechanism around cleaning the archived timeline. @zhangyue19921010 you can create a Jira ticket to track the future directions.

yihua · 2021-12-16T00:13:13Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java


+  public static final ConfigProperty<String> MAX_ARCHIVE_FILES_TO_KEEP_PROP = ConfigProperty
+      .key("hoodie.max.archive.files")
+      .defaultValue("10")


Let's make this noDefault() in case it's accidentally invoked?

Sure, thing. Changed.

yihua · 2021-12-16T00:19:15Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java

+  public static final ConfigProperty<String> AUTO_TRIM_ARCHIVE_FILES_DROP = ConfigProperty
+      .key("hoodie.auto.trim.archive.files")
+      .defaultValue("false")
+      .withDocumentation("When enabled, Hoodie will keep the most recent " + MAX_ARCHIVE_FILES_TO_KEEP_PROP.key()


Let's add a WARNING in both configs. sth like : WARNING: do not use this config unless you know what you're doing. If enabled, details of older archived instants are deleted, resulting in information loss in the archived timeline, which may affect tools like CLI and repair. Only enable this if you hit severe performance issues for retrieving archived timeline. (feel free to add more details)

Appreciate it. Changed

yihua · 2021-12-16T00:20:00Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java

+  public static final ConfigProperty<String> MAX_ARCHIVE_FILES_TO_KEEP_PROP = ConfigProperty
+      .key("hoodie.max.archive.files")
+      .defaultValue("10")
+      .withDocumentation("The numbers of kept archive files under archived.");
+
+  public static final ConfigProperty<String> AUTO_TRIM_ARCHIVE_FILES_DROP = ConfigProperty
+      .key("hoodie.auto.trim.archive.files")
+      .defaultValue("false")
+      .withDocumentation("When enabled, Hoodie will keep the most recent " + MAX_ARCHIVE_FILES_TO_KEEP_PROP.key()
+          + " archive files and delete older one which lose part of archived instants information.");


Should these configs live in HoodieWriteConfig instead of HoodieCompactionConfig?

Emmm, because all the archive related configs such as hoodie.archive.automatic, hoodie.commits.archival.batch and hoodie.keep.min.commits, etc are all lived in HoodieCompactionConfig , maybe it's better to be the same :)

Got it. Ideally, the archive configs should not be in HoodieCompactionConfig. Let's keep it as is for now and clean this up in a follow-up PR.

yihua · 2021-12-16T00:29:25Z

...-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java

+      if (!skipped.isEmpty()) {
+        LOG.info("Deleting archive files :  " + skipped);
+        context.setJobStatus(this.getClass().getSimpleName(), "Delete archive files");
+        Map<String, Boolean> result = deleteFilesParallelize(metaClient, skipped, context, true);


remove local variable assignment since it's not used?

yihua · 2021-12-16T00:30:52Z

...-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java

+        "");
+    List<HoodieLogFile> sortedLogFilesList = allLogFiles.sorted(HoodieLogFile.getReverseLogFileComparator()).collect(Collectors.toList());
+    if (!sortedLogFilesList.isEmpty()) {
+      List<String> skipped = sortedLogFilesList.stream().skip(maxArchiveFilesToKeep).map(HoodieLogFile::getPath).map(Path::toString).collect(Collectors.toList());


nit: skipped -> archiveFilesToDelete

yihua · 2021-12-16T00:40:53Z

...-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiveLog.java

+      assertFalse(currentExistArchiveFiles.containsAll(archiveFilesDeleted));
+      // assert most recent archive files are preserved
+      assertTrue(currentExistArchiveFiles.containsAll(archiveFilesKept));
+    }


Add a check when archive trim is disabled as well?

zhangyue19921010 · 2021-12-16T11:09:52Z

As discussed offline, we should warn users to avoid the config if they don't understand the mechanism. They should only use it if they know what they are doing. We can follow up with more comprehensive mechanism around cleaning the archived timeline. @zhangyue19921010 you can create a Jira ticket to track the future directions.

Just raise a new Ticket to track the further improvements. https://issues.apache.org/jira/browse/HUDI-3038
Thanks a lot for your review.

yihua

Overall LGTM. Left a couple of nits on the naming.

yihua · 2021-12-16T19:44:12Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java

          + " This is critical in computing the insert parallelism and bin-packing inserts into small files.");

+  public static final ConfigProperty<String> MAX_ARCHIVE_FILES_TO_KEEP_PROP = ConfigProperty
+      .key("hoodie.max.archive.files")


nit: after thinking about the naming again, let's use hoodie.archive prefix for the archive configs and update the variable naming accordingly.
For this one, it can be hoodie.archive.max.files.

I was thinking more like "hoodie.max.archive.files.to.retain" sort of.

This falls under archival timeline so better to have the same hoodie.archive prefix. Given that the writer archives instants instead of files, this shouldn't create confusion.

yihua · 2021-12-16T19:47:36Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java

+      .withDocumentation("The numbers of kept archive files under archived.");
+
+  public static final ConfigProperty<String> AUTO_TRIM_ARCHIVE_FILES_DROP = ConfigProperty
+      .key("hoodie.auto.trim.archive.files")


this one can be: hoodie.archive.auto.trim.enable

zhangyue19921010 · 2021-12-17T11:47:01Z

@hudi-bot run azure

zhangyue19921010 · 2021-12-17T23:10:12Z

@hudi-bot run azure

zhangyue19921010 · 2021-12-18T05:23:44Z

@hudi-bot run azure

zhangyue19921010 · 2021-12-19T07:47:15Z

@hudi-bot run azure

...-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiveLog.java

zhangyue19921010 · 2022-01-11T09:29:08Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

+        } catch (Exception originalException) {
+          // merge small archive files may left uncompleted archive file which will cause exception.
+          // need to ignore this kind of exception here.
+          try {
+            Path planPath = new Path(metaClient.getArchivePath(), "mergeArchivePlan");
+            HoodieWrapperFileSystem fileSystem = metaClient.getFs();
+            if (fileSystem.exists(planPath)) {
+              HoodieMergeArchiveFilePlan plan = TimelineMetadataUtils.deserializeAvroMetadata(FileIOUtils.readDataFromPath(fileSystem, planPath).get(), HoodieMergeArchiveFilePlan.class);
+              String mergedArchiveFileName = plan.getMergedArchiveFileName();
+              if (!StringUtils.isNullOrEmpty(mergedArchiveFileName) && fs.getPath().getName().equalsIgnoreCase(mergedArchiveFileName)) {
+                LOG.warn("Catch exception because of reading uncompleted merging archive file " + mergedArchiveFileName + ". Ignore it here.");
+                continue;
+              }
+            }
+            throw originalException;
+          } catch (Exception e) {
+            // If anything wrong during parsing merge archive plan, we need to throw the original exception.
+            // For example corrupted archive file and corrupted plan are both existed.
+            throw originalException;
+          }


We use these code to check if originalException is caused by corrupted mergedArchiveFile and ignore it.
Anything else needs to be threw again.

Hi @nsivabalan and @yihua
The common concern is incomplete/duplicate data left after last merging of small archive files fails and the current Hudi writer / commit is configured to disable archive file merging.

Ideally we need to check and clean dirty data before every archive.

Why we need this button before do clean works I think are :
This is a new feature, it's more safe with a default false control here.
I am pretty worried about multi-writer here, at least we have a way to control only one writer could do merge works.

As for making sure that incomplete data will cause no damage for loading archived timeline until next clean up:

we use HashSet to avoid duplicate instants during loading archive instants.

we use this try-catch to deal with exception caused by loading incomplete merged small archive files.

In the next step, maybe we can take care about multi-writer, runs stable for some time in my staging/production environment and finally removed this strict restrictions for verifyLastMergeArchiveFilesIfNecessary here :)

I agree that we should have a feature flag to turn all new logic off and skip the corrupted merged archive files when loading the archive timeline, in case there is an incomplete archive merge operation and the feature is turned off in the next run.

zhangyue19921010 · 2022-01-11T11:55:49Z

@hudi-bot run azure

zhangyue19921010 · 2022-01-12T01:35:53Z

Hi @yihua all comments are addressed. And Azure success. PTAL. Thanks a lot :)

nsivabalan

LGTM. @yihua : can you please follow up.

yihua · 2022-01-17T07:06:17Z

hudi-common/src/main/java/org/apache/hudi/common/util/FileIOUtils.java

    }
  }
+
+  public static void createFileInPath(FileSystem fileSystem, org.apache.hadoop.fs.Path fullPath, Option<byte[]> content) {


nit: have throws HoodieIOException in the method signature? so that the caller can decide if the exception can be ignored.

Sure. Changed.

yihua · 2022-01-17T07:07:03Z

hudi-common/src/main/java/org/apache/hudi/common/util/FileIOUtils.java

+    }
+  }
+
+  public static Option<byte[]> readDataFromPath(FileSystem fileSystem, org.apache.hadoop.fs.Path detailPath) {


Similar here.

yihua · 2022-01-17T07:32:30Z

...-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java

  private final int minInstantsToKeep;
  private final HoodieTable<T, I, K, O> table;
  private final HoodieTableMetaClient metaClient;
+  private final String mergeArchivePlanName = "mergeArchivePlan";


Should this be a public static final variable?

Changed. Just add public static final String MERGE_ARCHIVE_PLAN_NAME = "mergeArchivePlan"; in HoodieArchivedTimeline.java .

yihua · 2022-01-17T07:32:50Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

+          // merge small archive files may left uncompleted archive file which will cause exception.
+          // need to ignore this kind of exception here.
+          try {
+            Path planPath = new Path(metaClient.getArchivePath(), "mergeArchivePlan");


Reuse HoodieTimelineArchiveLog::mergeArchivePlanName?

We have to add public static final String MERGE_ARCHIVE_PLAN_NAME = "mergeArchivePlan"; in HoodieArchivedTimeline.java and let HoodieTimelineArchiveLog.java use it because of dependency issue.

yihua · 2022-01-17T07:37:42Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

+        } catch (Exception originalException) {
+          // merge small archive files may left uncompleted archive file which will cause exception.
+          // need to ignore this kind of exception here.
+          try {
+            Path planPath = new Path(metaClient.getArchivePath(), "mergeArchivePlan");
+            HoodieWrapperFileSystem fileSystem = metaClient.getFs();
+            if (fileSystem.exists(planPath)) {
+              HoodieMergeArchiveFilePlan plan = TimelineMetadataUtils.deserializeAvroMetadata(FileIOUtils.readDataFromPath(fileSystem, planPath).get(), HoodieMergeArchiveFilePlan.class);
+              String mergedArchiveFileName = plan.getMergedArchiveFileName();
+              if (!StringUtils.isNullOrEmpty(mergedArchiveFileName) && fs.getPath().getName().equalsIgnoreCase(mergedArchiveFileName)) {
+                LOG.warn("Catch exception because of reading uncompleted merging archive file " + mergedArchiveFileName + ". Ignore it here.");
+                continue;
+              }
+            }
+            throw originalException;
+          } catch (Exception e) {
+            // If anything wrong during parsing merge archive plan, we need to throw the original exception.
+            // For example corrupted archive file and corrupted plan are both existed.
+            throw originalException;
+          }


I agree that we should have a feature flag to turn all new logic off and skip the corrupted merged archive files when loading the archive timeline, in case there is an incomplete archive merge operation and the feature is turned off in the next run.

yihua · 2022-01-17T07:52:12Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

+            Path planPath = new Path(metaClient.getArchivePath(), "mergeArchivePlan");
+            HoodieWrapperFileSystem fileSystem = metaClient.getFs();
+            if (fileSystem.exists(planPath)) {
+              HoodieMergeArchiveFilePlan plan = TimelineMetadataUtils.deserializeAvroMetadata(FileIOUtils.readDataFromPath(fileSystem, planPath).get(), HoodieMergeArchiveFilePlan.class);


The logic here looks okay to me. Could you add a few unit tests to guard this logic and the failure recovery logic in the archival merging logic as well, since the logic are critical?

I'm thinking about the following two cases:
(1) Construct a corrupted mergeArchivePlan file with random content so that it cannot be deserialized.
(1.1) When archival merging is enabled, the plan should be deleted first.
(1.2) When archival merging is disabled, the archived timeline can still be read successfully.
(1.3) If there are other corrupted archived files not from merging, the loading of archived timeline should fail and original exception should be thrown.
(2) Construct a working mergeArchivePlan file and a corrupted merged archive file with random content so that it cannot be deserialized.
(2.1) When archival merging is enabled, the corrupted merged archive file should be deleted first and proceed.
(2.2) When archival merging is disabled, the archived timeline can still be read successfully and the corrupted archive file is skipped.

Sure thing. added. Just

testMergeSmallArchiveFilesRecoverFromBuildPlanFailed to cover 1

testMergeSmallArchiveFilesRecoverFromMergeFailed to cover 2

Also add testLoadArchiveTimelineWithDamagedPlanFile and testLoadArchiveTimelineWithUncompletedMergeArchiveFile to guard loading activedTimeline logic.

zhangyue19921010 · 2022-01-18T10:53:30Z

@hudi-bot run azure

yihua

LGTM. I made one small nit fix to your PR.

hudi-bot · 2022-01-19T06:39:10Z

CI report:

5ba0b03 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…definitely. (apache#4078) Co-authored-by: yuezhang <[email protected]>

yihua · 2022-02-02T23:03:27Z

cc @vinothchandar this PR adds new functionality in archived timeline with a feature flag and a piece of error handling logic which cannot be feature flagged. You may want to take another look.

…definitely. (apache#4078) Co-authored-by: yuezhang <[email protected]>

yuezhang added 2 commits November 23, 2021 10:33

clean up unused archive files instead of expanding indefinitely.

2846e88

merge from master

8f8ae38

zhangyue19921010 force-pushed the HUDI-2833 branch from 9df4e27 to 8f8ae38 Compare December 6, 2021 07:19

yuezhang added 2 commits December 6, 2021 15:23

merge from master

019e161

merge from master

20a7639

zhangyue19921010 mentioned this pull request Dec 11, 2021

[SUPPORT] How can I control the number of archive files #4275

Closed

nsivabalan self-assigned this Dec 12, 2021

nsivabalan added the priority:high Significant impact; potential bugs label Dec 12, 2021

yihua self-assigned this Dec 13, 2021

yihua reviewed Dec 13, 2021

View reviewed changes

nsivabalan added the core-flow-ds label Dec 14, 2021

nsivabalan reviewed Dec 14, 2021

View reviewed changes

code review

65f3ac5

yihua reviewed Dec 16, 2021

View reviewed changes

yuezhang added 3 commits December 16, 2021 17:40

merge from master

e0c979f

code review

a6a0889

code review

0cde37b

yihua reviewed Dec 16, 2021

View reviewed changes

nsivabalan removed their assignment Dec 16, 2021

yuezhang added 2 commits December 17, 2021 10:00

code review

2d2ee77

code review

9b9620a

yihua reviewed Jan 9, 2022

View reviewed changes

...-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiveLog.java Show resolved Hide resolved

...-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiveLog.java Show resolved Hide resolved

yuezhang added 4 commits January 10, 2022 16:23

merge from master

1012596

code review

01ed38a

code review

9680dd5

code review

7ed3f01

zhangyue19921010 commented Jan 11, 2022

View reviewed changes

nsivabalan approved these changes Jan 13, 2022

View reviewed changes

zhangyue19921010 changed the title ~~[HUDI-2833] Clean up unused archive files instead of expanding indefinitely.~~ [HUDI-2833][Design] Merge small archive files instead of expanding indefinitely. Jan 14, 2022

yihua reviewed Jan 17, 2022

View reviewed changes

yuezhang added 2 commits January 17, 2022 18:12

merge from master

894b96e

code review

0329837

yuezhang and others added 2 commits January 19, 2022 10:23

code review

14a6337

Fix nit

5ba0b03

yihua approved these changes Jan 19, 2022

View reviewed changes

apache deleted a comment from hudi-bot Jan 19, 2022

yihua merged commit 7647562 into apache:master Jan 19, 2022

vinishjail97 mentioned this pull request Jan 24, 2022

FixIgnoreKey nsivabalan/hudi#11

Closed

5 tasks

vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022

[HUDI-2833][Design] Merge small archive files instead of expanding in…

e15a005

…definitely. (apache#4078) Co-authored-by: yuezhang <[email protected]>

liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022

[HUDI-2833][Design] Merge small archive files instead of expanding in…

ba1c6b7

…definitely. (apache#4078) Co-authored-by: yuezhang <[email protected]>

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-2833][Design] Merge small archive files instead of expanding in…

3215269

…definitely. (apache#4078) Co-authored-by: yuezhang <[email protected]>

kasured mentioned this pull request Nov 19, 2022

[SUPPORT] Controlling the Archival process retention #7246

Closed

This was referenced Nov 30, 2025

Comprehensive mechanism around cleaning the archived timeline #14959

Open

Tuning merge small archive files #14993

Open

[HUDI-2833][Design] Merge small archive files instead of expanding indefinitely. #4078

[HUDI-2833][Design] Merge small archive files instead of expanding indefinitely. #4078

Uh oh!

Conversation

zhangyue19921010 commented Nov 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Add three configs to control merge small archive files behavior

Add a new plan named HoodieMergeArchiveFilePlan

Code Flow

Uh oh!

zhangyue19921010 commented Dec 6, 2021

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

zhangyue19921010 commented Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nsivabalan commented Dec 14, 2021

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua commented Dec 16, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangyue19921010 commented Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangyue19921010 commented Nov 23, 2021 •

edited

Loading

zhangyue19921010 commented Dec 13, 2021 •

edited

Loading

zhangyue19921010 commented Dec 16, 2021 •

edited

Loading

nsivabalan Dec 16, 2021 •

edited

Loading

yihua Jan 17, 2022 •

edited

Loading