[HUDI-6200] Enhancements to the MDT for improving performance of larger indexes. #8684

prashantwason · 2023-05-10T22:54:15Z

[HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

Change Logs

Major changes

Moved MDT compaction out of the pre-commit phase into the initialization phase within write client
- MDT compaction and clean is attempted during WriteClient initialization stage before any changes to the dataset are initiated.
- This ensures correct ordering of the compactions
- This also creates a codepath to call compaction on the MDT without actual commits
Removed support for setting fileGroup count to the MetadataPartitionType enum
- Since enums are singletons, this design prevented having multiple different datasets with different fileGroup count for MDT parttitions in the same JVM instance
- When there are large number of fileGroups (assume 10K+) for a MDT partition, listing fileSlices for it may take a considerable amount of time (10K slices * M basefile version + 10K * N log files per slice). Hence, this should not be done as part of the HoodieBackedTableMetadataWriter constructor and delayed to whenever actually required (prepRecords).
When initializing multiple MDT partitions together, they are handled one at a time
- Reduces memory requirements as larger indexes may need a huge amount of memory
- Improves chances of indexes being built.
- If the FILES partition is already initialized, it is listed to find the files thereby saving expensive entire dataset listing while
  initializing addtional partitions
Initial commit into MDT to initialize a new partition is not a bulkCommit
- HFiles are created directly for the first commit
- Since first commits write a large amount of data, bulkCommit into a basefile is much faster than writing log files
- Added SparkHoodieMetadataBulkInsertPartitioner as the bulk insert partitioner. It is optimized to partition records in a single pass.
The number of fileGroups for each MDT partition are determined as part of the initialization process.
- This allows the fileGroup count to be optimized based on the number of records being added

Other changes:

Created helper method to find if MDT partition is enabled: HoodieTableConfig.isMetadataPartitionEnabled()
Created helper method to set state (enabled/disabled) of MDT partition: HoodieTableConfig.setMetadataPartitionState()
Created helper method to delete MDT: HoodieTableMetadataUtil.deleteMetadataTable(). It correctly updates the hoodie.properties file.
Created helper method to set inflight state (enabled/disabled) of MDT partition: HoodieTableMetadataUtil.setMetadataInflightPartitions
Created helper method to delete MDT partition: deleteMetadataTablePartition. It correctly updates the hoodie.properties file.

Impact

TBD

Risk level (write none, low medium or high below)

TBD

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

nsivabalan

will continue to review tmrw. sending the feedback I have for now

...-client/src/main/java/org/apache/hudi/metadata/SparkHoodieMetadataBulkInsertPartitioner.java

hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java

hudi-common/src/main/java/org/apache/hudi/metadata/MetadataPartitionType.java

...ent-common/src/main/java/org/apache/hudi/table/action/index/ScheduleIndexActionExecutor.java

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

danny0405 · 2023-05-11T09:43:39Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

-        .initTable(hadoopConf.get(), metadataWriteConfig.getBasePath());
+        .setPopulateMetaFields(DEFAULT_METADATA_POPULATE_META_FIELDS)
+            .setKeyGeneratorClassProp(HoodieTableMetadataKeyGenerator.class.getCanonicalName())
+            .initTable(hadoopConf.get(), metadataWriteConfig.getBasePath());


Fix the indentation.

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

...park-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java

nsivabalan

Couple of pending items to me to review or follow up

Revisit async indexer code/flow
where exactly we determine the file group count for every partition?

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

nsivabalan

I ran into compilation issue while reviewing this patch. here is the diff to fix it

git diff
diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
index 5313d63575..a3bd3d9536 100644
--- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
+++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
@@ -598,7 +598,7 @@ public class TestCleaner extends HoodieClientTestBase {
           }
         })
     );
-    metadataWriter.update(commitMetadata, "00000000000001", false);
+    metadataWriter.update(commitMetadata, "00000000000001");
     metaClient.getActiveTimeline().saveAsComplete(
         new HoodieInstant(State.INFLIGHT, HoodieTimeline.COMMIT_ACTION, "00000000000001"),
         Option.of(commitMetadata.toJsonString().getBytes(StandardCharsets.UTF_8)));
@@ -1053,7 +1053,7 @@ public class TestCleaner extends HoodieClientTestBase {
           }
         })
     );
-    metadataWriter.update(commitMetadata, "00000000000001", false);
+    metadataWriter.update(commitMetadata, "00000000000001");
     metaClient.getActiveTimeline().saveAsComplete(
         new HoodieInstant(State.INFLIGHT, HoodieTimeline.COMMIT_ACTION, "00000000000001"),
         Option.of(commitMetadata.toJsonString().getBytes(StandardCharsets.UTF_8)));
@@ -1179,7 +1179,7 @@ public class TestCleaner extends HoodieClientTestBase {
         throw new RuntimeException(e);
       }
     });
-    metadataWriter.update(commitMeta, instantTime, false);
+    metadataWriter.update(commitMeta, instantTime);
     metaClient.getActiveTimeline().saveAsComplete(
         new HoodieInstant(State.INFLIGHT, HoodieTimeline.COMMIT_ACTION, instantTime),
         Option.of(commitMeta.toJsonString().getBytes(StandardCharsets.UTF_8)));

Went through async indexer flow. looks ok to me. I have not tested it though. will test it out while we try to land this patch

...ent-common/src/main/java/org/apache/hudi/table/action/index/ScheduleIndexActionExecutor.java

...i-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java

nsivabalan

re-reviewed again

...-client/src/main/java/org/apache/hudi/metadata/SparkHoodieMetadataBulkInsertPartitioner.java

danny0405

Thanks for the contribution, I have reviewed and left some comments

danny0405 · 2023-05-16T07:27:56Z

hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java

+      partitionsInflight.remove(partition.getPartitionPath());
+    }
+    setValue(TABLE_METADATA_PARTITIONS, partitions.stream().sorted().collect(Collectors.joining(CONFIG_VALUES_DELIMITER)));
+    setValue(TABLE_METADATA_PARTITIONS_INFLIGHT, partitionsInflight.stream().sorted().collect(Collectors.joining(CONFIG_VALUES_DELIMITER)));


Do we need to persist these options?

It is persisted from the caller side - HoodieTableMetadataUtil.setMetadataPartitionState

I have removed the HoodieTableMetadataUtil.setMetadataPartitionState as it was unnecessary.

danny0405 · 2023-05-16T07:32:26Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java

+          case UPSERT_PREPPED:
+          case BULK_INSERT:
+          case BULK_INSERT_PREPPED:
+          case DELETE:


Enuming the write operation is really hard to maintain, can we trigger the table service whatever the operation is ?

On second thoughts, the switch is not necessary. The above code is within a transaction lock so there should not be any conflicts of multiple writers optimizing MDT together. The checks within performTableServices should be light enough or we can optimize them.

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

danny0405 · 2023-05-16T08:00:36Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

      if (DEFAULT_METADATA_POPULATE_META_FIELDS != metadataMetaClient.getTableConfig().populateMetaFields()) {
        LOG.info("Re-initiating metadata table properties since populate meta fields have changed");
-        metadataMetaClient = initializeMetaClient(DEFAULT_METADATA_POPULATE_META_FIELDS);
+        metadataMetaClient = initializeMetaClient();


If MDT does not exist and metadataMetaClient.getTableConfig().populateMetaFields() is true, the initializeMetaClient() could be invoked 2 times, which could incur exception.

Good catch. Moved this block to within the try-catch

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

danny0405 · 2023-05-16T08:06:53Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

-            .withFileExtension(HoodieLogFile.DELTA_EXTENSION).build();
+                .withSizeThreshold(metadataWriteConfig.getLogFileMaxSize())
+                .withFs(dataMetaClient.getFs())
+                .withRolloverLogWriteToken(HoodieLogFormat.DEFAULT_WRITE_TOKEN)


Fix the indentation.

danny0405 · 2023-05-16T08:08:10Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

+
+      // Do timeline validation before scheduling compaction/logcompaction operations.
+      if (!validateTimelineBeforeSchedulingCompaction(inFlightInstantTimestamp, latestDeltacommitTime)) {
+        return;


We should not return directly because the archiving is also blocked, if no compaction plan should be scheduled, the archiving should also be triggered.

I see your point. but in reality, I feel we may not gain much. unless compaction in MDT kicks in, archival might not have anything to do after last time it was able to archive something. So, not a bad idea to not trigger archival if there is no compaction.
Same applies for clean as well. but lets not optimize anything in this patch.

unless compaction in MDT kicks in, archival might not have anything to do after last time it was able to archive something.

Then archiving will always be blocked by the compaction.

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java

danny0405

Overall looks good, there are many unnecessary changes for the indentation, can you set up the checkstyle of your IDEA again following the rules here: https://github.com/apache/hudi/blob/master/style/checkstyle.xml (if you installed the checkstyle plugin, you can import it manually)

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

danny0405 · 2023-06-06T08:58:06Z

@prashantwason Did you notice that there are conflicts in the codes?

…er indexes.

danny0405

+1, looks good, we are good to land once all the CI tests are green.

nsivabalan · 2023-06-06T18:38:59Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

        } else if (FSUtils.isDataFile(status.getPath())) {
          // Regular HUDI data file (base file or log file)
-          filenameToSizeMap.put(status.getPath().getName(), status.getLen());
+          String dataFileCommitTime = FSUtils.getCommitTime(status.getPath().getName());


Incase of MOR table, for a log file, the base instance time could be < actual delta commit time. So, we might skip the log files based on this logic?
since in L1125, we are filtering for files whose commit time < max instant time. May be we should use last mod time instead?

may be this bug exists even if not for this patch. we might need a follow up fix.

nsivabalan · 2023-06-06T18:55:59Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java

    try (HoodieTableMetadataWriter writer = SparkHoodieBackedTableMetadataWriter.create(context.getHadoopConf().get(), config,
-            context, Option.empty(), inFlightInstantTimestamp)) {
+        context, Option.empty(), inFlightInstantTimestamp)) {
      if (writer.isInitialized()) {


We added the guard rail for table services in MDT to be triggered only by regular writers in Data table, so that for a single writer modes with async table services, there won't be any race conditions.
Ref: #3900
But the code evolved and we automatically enable in process lock provider for single writer mode w/ async table services. And so, we should be good to remove the constraint. Just that we might keep triggering the schedule of compaction in MDT everytime. May be we can intercept from the active timeline on when is the last time, compaction was triggered and add in some optimization.
Nothing required for this patch. But as a follow up.

nsivabalan

LGTM. once CI passes, we can land

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

codope · 2023-06-08T07:23:23Z

...link-client/src/main/java/org/apache/hudi/metadata/FlinkHoodieBackedTableMetadataWriter.java

-      if (canTriggerTableService) {
-        // trigger compaction before doing the delta commit. this is to ensure, if this delta commit succeeds in metadata table, but failed in data table,
-        // we would have compacted metadata table and so could have included uncommitted data which will never be ignored while reading from metadata
-        // table (since reader will filter out only from delta commits)
-        compactIfNecessary(writeClient, instantTime);
-      }


why is this removed? Don't we want compaction to run?

You are right, we should trigger the metadata table compaction on each commit.

hudi-bot · 2023-06-08T17:18:22Z

CI report:

da0e00e UNKNOWN
fd2ec46 UNKNOWN
86902aa Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2023-06-08T20:43:52Z

CI is green

danny0405 · 2023-06-11T03:45:08Z

...-client/src/main/java/org/apache/hudi/metadata/SparkHoodieMetadataBulkInsertPartitioner.java

+
+  @Override
+  public String getFileIdPfx(int partitionId) {
+    return fileIDPfxs.get(partitionId);


This may not be right when one Hfile in one file group is too large to write with, there is a upper threshold for each base file handle, HoodieAvroHFileWrite#canWrite may return false, then the write handle factory would suffix the file group id with another auto inc number like -1 which is not correct any more.

yes. this is known. we have set file size as 1GB. so, users have to set the right config for num of file groups.

No one can have awareness the file group number is pertinent with the correctness. It is a bug, not a usability issue.

The idea behind sharding in MDT is that you can create more shards rather than splitting a shard into two files. Spliiting will cause one large and one small size file which is not optimal.
For optimal performance, all shards should be of similar size which we achieve by hash partitioning the keys.

This PR also adds support for automatic estimation of the shard counts for each partition, that can be enhanced.

automatic estimation of the shard counts for each partition, that can be enhanced

This may be a solution if we can make accurate estimation of the file group size.

nsivabalan reviewed May 11, 2023

View reviewed changes

danny0405 reviewed May 11, 2023

View reviewed changes

nsivabalan requested changes May 11, 2023

View reviewed changes

prashantwason force-pushed the pw_one_index_at_a_time branch from e2785f4 to 364b1fe Compare May 12, 2023 17:05

nsivabalan reviewed May 12, 2023

View reviewed changes

...ent-common/src/main/java/org/apache/hudi/table/action/index/ScheduleIndexActionExecutor.java Show resolved Hide resolved

nsivabalan reviewed May 12, 2023

View reviewed changes

...i-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java Show resolved Hide resolved

nsivabalan reviewed May 12, 2023

View reviewed changes

...-client/src/main/java/org/apache/hudi/metadata/SparkHoodieMetadataBulkInsertPartitioner.java Show resolved Hide resolved

danny0405 requested changes May 16, 2023

View reviewed changes

prashantwason force-pushed the pw_one_index_at_a_time branch from 7be538f to cc0da23 Compare May 18, 2023 19:51

prashantwason mentioned this pull request May 23, 2023

[HUDI-6098] Use bulk insert prepped for the initial write into MDT. #8493

Closed

4 tasks

prashantwason force-pushed the pw_one_index_at_a_time branch from da0e00e to 589c8cd Compare May 24, 2023 09:20

danny0405 reviewed May 24, 2023

View reviewed changes

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java Outdated Show resolved Hide resolved

danny0405 reviewed May 24, 2023

View reviewed changes

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java Outdated Show resolved Hide resolved

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java Show resolved Hide resolved

prashantwason mentioned this pull request Jun 1, 2023

[HUDI-6151] Rollback previously applied commits to MDT when operations are retried. #8604

Merged

4 tasks

nsivabalan added release-0.14.0 priority:blocker Production down; release blocker labels Jun 5, 2023

nsivabalan self-assigned this Jun 5, 2023

prashantwason added 7 commits June 6, 2023 02:49

[HUDI-6200] Enhancements to the MDT for improving performance of larg…

8df566a

…er indexes.

Addressed review comments

0f55c5e

Addressed reivew comments

f50b3a6

Fixed unit tests

a73f6b7

Review comments

87ca73f

Rebased on master

04d6838

Removed debug logs and unnecessary indentation changes

ad81a08

prashantwason force-pushed the pw_one_index_at_a_time branch from 226c96c to ad81a08 Compare June 6, 2023 09:57

danny0405 approved these changes Jun 6, 2023

View reviewed changes

nsivabalan reviewed Jun 6, 2023

View reviewed changes

nsivabalan added 2 commits June 6, 2023 19:17

Fixing build fialure

fd2ec46

Fixing fileId for files partition in MDT

cae57c2

danny0405 reviewed Jun 7, 2023

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java Show resolved Hide resolved

codope and others added 4 commits June 7, 2023 19:41

Fix re-init of MDT and some other tests

a21fb87

Disable some async indexer tests and revisit in HUDI-6332

4dfaefd

Fixing more test failures

ca59a37

fixing more tests

1d49e65

codope reviewed Jun 8, 2023

View reviewed changes

Compact in Flink MDT writer and fix other flink tests

da92c6c

codope force-pushed the pw_one_index_at_a_time branch from df3a5b5 to da92c6c Compare June 8, 2023 10:26

Disable testReadability for higher compactNumDeltaCommits

86902aa

nsivabalan approved these changes Jun 8, 2023

View reviewed changes

nsivabalan merged commit 7ae8da0 into apache:master Jun 8, 2023

danny0405 mentioned this pull request Jun 9, 2023

[HUDI-6344] Flink MDT bulk_insert for initial commit #8914

Merged

4 tasks

codope mentioned this pull request Jun 9, 2023

[MINOR] Consider all initialization timestamps from MDT timeline to be valid #8915

Closed

4 tasks

suryaprasanna mentioned this pull request Jun 10, 2023

[HUDI-53] Implementation of record_index - a HUDI index based on the metadata table. #8758

Merged

4 tasks

danny0405 reviewed Jun 11, 2023

View reviewed changes

[HUDI-6200] Enhancements to the MDT for improving performance of larger indexes. #8684

[HUDI-6200] Enhancements to the MDT for improving performance of larger indexes. #8684

Uh oh!

Conversation

prashantwason commented May 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danny0405 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prashantwason commented May 10, 2023 •

edited

Loading

danny0405 May 16, 2023 •

edited

Loading