[HUDI-53] Implementation of a native DFS based index based on the metadata table. #5581

prashantwason · 2022-05-13T23:09:06Z

Complete implementation
Metrics
Performance tuning of metadata table

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…ta table. 1. Complete implementation 2. Metrics 3. Performance tuning of metadata table

prashantwason · 2022-05-13T23:11:47Z

@vinothchandar @nsivabalan This is the complete code from my internal branch ported to 0.10.1. I will udpate HUDI-53 to list the various enhancements that are part of this large PR. I will need your help to break down the various enhancements into separate PR to apply to master after discussion.

Some code had merge conflicts so I removed some testing code. Disregard the failing tests for now.

hudi-bot · 2022-05-14T00:01:17Z

CI report:

f618228 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

prasannarajaperumal

This is super-exciting to see the implementation in a PR :). Thanks for doing this.
I am about half-way through, need to look into the read/write part a bit more closely. Will do it very soon.

prasannarajaperumal · 2022-09-16T08:23:13Z

hudi-cli/src/main/java/org/apache/hudi/cli/commands/MetadataCommand.java

-  }
-
-  @CliCommand(value = "metadata init", help = "Update the metadata table from commits since the creation")
-  public String init(@CliOption(key = {"readonly"}, unspecifiedDefaultValue = "false",


is init being removed?

prasannarajaperumal · 2022-09-16T08:57:17Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

+   * @param maxFileGroupSizeBytes Maximum size of the file group.
+   * @return The estimated number of file groups.
+   */
+  private int estimateFileGroupCount(String partitionName, long recordCount, int averageRecordSize, int minFileGroupCount,


Is the plan to generate file group count for other metadata partitions (bloom, column stats) also using some heuristics like this?

yes, we should. it definitely helps.

prasannarajaperumal · 2022-09-16T09:08:18Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

-      List<HoodieRecord> records = convertMetadataFunction.convertMetadata();
-      commit(engineContext.parallelize(records, 1), MetadataPartitionType.FILES.partitionPath(), instantTime, canTriggerTableService);
-    }
+  private void commitFileListingUpdate(String instantTime, List<HoodieRecord> recordList, boolean canTriggerTableService) {


Thanks for renaming this to make it more clear.

prasannarajaperumal · 2022-09-16T09:11:35Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

+    // File groups in each partitions are fixed at creation time and we do not want them to be split into muliple files
+    // ever. Hence we use a very large basefile size in metadata table. The actual size of the HFiles created will
+    // eventually depend on the number of file groups seleted for each partition (See estimateFileGroupCount function)
+    final long maxHFileSizeBytes = 10 * 1024 * 1024 * 1024L; // 10GB


Does the mapping of row key to file group breaks down if the HFile actually gets to this size and a new base file gets created? Should we just disable the max file check here?

prasannarajaperumal · 2022-09-16T09:23:36Z

...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

-  public boolean commitStats(String instantTime, List<HoodieWriteStat> stats, Option<Map<String, String>> extraMetadata,
-                             String commitActionType) {
-    return commitStats(instantTime, stats, extraMetadata, commitActionType, Collections.emptyMap());
+  public boolean commitStats(String instantTime, HoodieData<WriteStatus> writeStatuses, List<HoodieWriteStat> stats,


Just clarifying - WriteStatus contains only deflated HoodieRecord - right? Making sure we dont hold on to the data in memory until we update the metadata table.

prasannarajaperumal · 2022-09-16T09:25:09Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

+  /**
+   * If a valid location is specified, a copy of the write config is saved before each operation.
+   */
+  public static final ConfigProperty<String> CONFIG_EXPORT_DIR = ConfigProperty


Not clear where this is used? Write configs are persisted after each operation?

prasannarajaperumal · 2022-09-16T09:27:13Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java

+    // We need to track written records within WriteStatus in two cases:
+    // 1. When the HoodieIndex being used is not implicit with storage
+    // 2. If any of the metadata table partitions (record index, etc) which require written record tracking are enabled
+    final boolean trackSuccessRecords = !hoodieTable.getIndex().isImplicitWithStorage()


nit: move !hoodieTable.getIndex().isImplicitWithStorage() also into HoodieTableMetadataUtil.needsWriteStatusTracking

prasannarajaperumal · 2022-09-16T09:29:15Z

hudi-common/src/main/java/org/apache/hudi/metadata/MetadataPartitionType.java

+   * These partitions need the list of written records so that they can update their metadata.
+   */
+  public static List<MetadataPartitionType> needWriteStatusTracking() {
+    return Arrays.asList(MetadataPartitionType.RECORD_INDEX);


nit: Just simplify this to boolean and check if its RECORD_INDEX?

YuweiXiao · 2022-05-21T06:34:51Z

...ent/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkMetadataTableRecordIndex.java

+        taggedRecords.add(rec);
+      }
+
+      List<String> recordKeys = keyToIndexMap.keySet().stream().sorted().collect(Collectors.toList());


Hi, why do we need sort here? And It seems we can directly maintain a mapping from key to records, instead of another list index indirection.

I assume this sorting is to ensure lookup in hfile does not have unnecessary seeks.
but, we already sort all lookups within HFileDataBlock

if (!enableFullScan) { // HFile read will be efficient if keys are sorted, since on storage, records are sorted by key. This will avoid unnecessary seeks. Collections.sort(keys); }

So, I don't see a need to sort here. Or do you mean to sort for some other reason?

YuweiXiao · 2022-05-21T06:38:05Z

...-client/src/main/java/org/apache/hudi/metadata/SparkHoodieMetadataBulkInsertPartitioner.java

+ * A {@code BulkInsertPartitioner} implementation for Metadata Table to improve performance of initialization of metadata
+ * table partition when a very large number of records are inserted.
+ *
+ * This partitioner requires the records to tbe already tagged with the appropriate file slice.


nsivabalan · 2022-11-19T15:21:50Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java

      this.txnManager.beginTransaction(Option.of(clusteringInstant), Option.empty());
      finalizeWrite(table, clusteringCommitTime, writeStats);
-      writeTableMetadataForTableServices(table, metadata,clusteringInstant);
+      preCommit(clusteringInstant, metadata);


Looking at this fix, I was bumbed that we did not have preCommit even for compaction. (preCommit is where we do conflict resolution).

Probably it was designed that way.

For compaction: conflict resolution is done during compaction planning. So, there is no other condition with which we might abort compaction during completion. so probably thats why it was left out.

For clustering: for most common scenario (update reject strategy), again, an incoming write if overlaps w/ clustering file groups(to be replaced), it will be aborted right away. And new file groups that are being created by clustering is not known to any new writes until clustering completes. So, there is no real necessity to do conflict resolution during clustering completion.

Did you identify any other gaps wrt conflict resolution ?

nsivabalan · 2022-11-19T15:36:22Z

...ent/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkMetadataTableRecordIndex.java

+        taggedRecords.add(rec);
+      }
+
+      List<String> recordKeys = keyToIndexMap.keySet().stream().sorted().collect(Collectors.toList());


I assume this sorting is to ensure lookup in hfile does not have unnecessary seeks.
but, we already sort all lookups within HFileDataBlock

if (!enableFullScan) { // HFile read will be efficient if keys are sorted, since on storage, records are sorted by key. This will avoid unnecessary seeks. Collections.sort(keys); }

So, I don't see a need to sort here. Or do you mean to sort for some other reason?

nsivabalan · 2022-11-19T15:43:42Z

...park-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java

+  protected void commit(HoodieData<HoodieRecord> records, String instantTime, String partitionName,
+      int fileGroupCount) {
+    LOG.info("Performing bulk insert for partition " + partitionName + " with " + fileGroupCount + " file groups");
+    SparkHoodieMetadataBulkInsertPartitioner partitioner = new SparkHoodieMetadataBulkInsertPartitioner(fileGroupCount);


good optimization 👏

nsivabalan · 2022-11-19T15:53:13Z

...-client/src/main/java/org/apache/hudi/metadata/SparkHoodieMetadataBulkInsertPartitioner.java

+        HoodieRecord record = recordItr.next();
+        final String fileID = record.getCurrentLocation().getFileId();
+        // Remove the write-token from the fileID as we need to return only the prefix
+        int index = fileID.lastIndexOf("-");


lets move this to FSUtils or something like PathUtils. i.e. parsing fileID given a file name

nsivabalan · 2022-11-19T15:54:39Z

...-client/src/main/java/org/apache/hudi/metadata/SparkHoodieMetadataBulkInsertPartitioner.java

+  }
+
+  @Override
+  public List<String> generateFileIDPfxs(int numPartitions) {


I am not sure where is this used. can you throw some light.

bcoz, within SparkBulkInsertHelper, we are not leveraging this new method. looks like we are generating inline there.
L110 in SparkBulkInsertHelper

// generate new file ID prefixes for each output partition final List<String> fileIDPrefixes = IntStream.range(0, parallelism).mapToObj(i -> FSUtils.createNewFileIdPfx()).collect(Collectors.toList());

nsivabalan · 2022-11-19T16:00:06Z

hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java

-      .withDocumentation("Enable the internal metadata table which serves table metadata like level file listings");
+      .withDocumentation("Create and use the metadata table (MDT) which serves table metadata like file listings "
+          + "and indexes. If set to true, MDT will be created. Once created, this setting only controls if the MDT "
+          + "will be used on reader side. MDT cannot be disabled/removed by setting this to false.");


actually we have fix this flow in master. ie. if on write path, if metadata was disabled and hudi detects that metadata exists, it will delete the metadata table.

nsivabalan · 2022-11-19T16:07:15Z

hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieParquetReader.java

+      }
+
+      @Override
+      public String next() {


we can probably use parquetUtils.readRowKeys()

nsivabalan · 2022-11-19T16:08:59Z

hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java

+   * @param recordKeys The list of record keys to read
+   */
+  @Override
+  public Map<String, HoodieRecordGlobalLocation> readRecordIndex(List<String> recordKeys) {


List of records is gonna scale. we need to fix this as HoodieData.

nsivabalan · 2022-11-19T16:16:19Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java

+        }
+        // TODO: This does not work with clustering
+        if (!previousRecord.getRecordGlobalLocation().equals(getRecordGlobalLocation())) {
+          throw new HoodieMetadataException(String.format("Location for %s should not change from %s to %s", previousRecord.key,


why do we need this check. w/ clustering, records could move from 1 location to another right. any harm in removing this?

nsivabalan · 2022-11-19T20:38:10Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

+   * @param maxFileGroupSizeBytes Maximum size of the file group.
+   * @return The estimated number of file groups.
+   */
+  private int estimateFileGroupCount(String partitionName, long recordCount, int averageRecordSize, int minFileGroupCount,


yes, we should. it definitely helps.

prashantwason · 2023-05-18T16:26:09Z

Closing this as a new PR is being raised.
#8758

[UBER] Implementation of a native DFS based index based on the metada…

f618228

…ta table. 1. Complete implementation 2. Metrics 3. Performance tuning of metadata table

prashantwason changed the title ~~[UBER] Implementation of a native DFS based index based on the metadata table.~~ [HUDI-53] Implementation of a native DFS based index based on the metadata table. May 13, 2022

prashantwason mentioned this pull request May 13, 2022

[HUDI-53] Adding a record level index based on the Metadata Table v2 #3508

Closed

5 tasks

prashantwason requested review from nsivabalan and yihua May 17, 2022 00:23

vinothchandar added the big-needle-movers label May 17, 2022

vinothchandar self-assigned this Jul 13, 2022

prasannarajaperumal self-requested a review August 29, 2022 14:56

nsivabalan self-assigned this Sep 2, 2022

yihua added priority:critical Production degraded; pipelines stalled index labels Sep 13, 2022

prasannarajaperumal reviewed Sep 16, 2022

View reviewed changes

YuweiXiao reviewed Nov 7, 2022

View reviewed changes

nsivabalan reviewed Nov 19, 2022

View reviewed changes

nsivabalan mentioned this pull request Dec 21, 2022

[HUDI-53] Record Level Index #7429

Closed

4 tasks

prashantwason closed this May 18, 2023

[HUDI-53] Implementation of a native DFS based index based on the metadata table. #5581

[HUDI-53] Implementation of a native DFS based index based on the metadata table. #5581

Uh oh!

Conversation

prashantwason commented May 13, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

prashantwason commented May 13, 2022

Uh oh!

hudi-bot commented May 14, 2022

CI report:

Uh oh!

prasannarajaperumal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prashantwason commented May 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants