[HUDI-2488][HUDI-3175] Implement async metadata indexing #4693

codope · 2022-01-26T15:37:14Z

What is the purpose of the pull request

This PR adds the support for asynchronous metadata indexing. Please see the RFC for the design.

An index planner in ScheduleIndexActionExecutor.
Index plan executor in RunIndexActionExecutor.
Adding index API in HoodieTableMetadataWriter.

Brief change log

Add a new action called INDEX, whose state transition is described in the RFC.
Changes in timeline to support the new action.
Add an index planner in ScheduleIndexActionExecutor.
Add index plan executor in RunIndexActionExecutor.
Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition.
Add 2 new table configs to serve as the source of truth for inflight and completed indexes.
Support upgrade/downgrade taking care of the newly added configs.
Add tool to trigger indexing in HoodieIndexer.

Verify this pull request

(Please pick either of the following options)

Added test for the indexer with continuous deltastreamer.
Added test for timeline changes.
Manually verified the functionality by running the indexer with concurrent spark datasource writes. Here's a gist containing scripts used for testing. Below is the screenshot of the hoodie data/metadata timeline and partition.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

manojpec

Comments so far.

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java

prashantwason

Good progress on this one. Getting close to being complete.

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

prashantwason · 2022-03-11T09:43:56Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieIndexer.java

+    return UtilHelpers.retry(retry, () -> {
+      switch (cfg.runningMode.toLowerCase()) {
+        case SCHEDULE: {
+          LOG.info("Running Mode: [" + SCHEDULE + "]; Do schedule");


if you move all this code into scheduleIndexing(), you can simplify SCHEDULE_AND_EXECUTE and remove some code duplication.

prashantwason · 2022-03-11T09:44:58Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieIndexer.java

+          return result;
+        }
+        case SCHEDULE_AND_EXECUTE: {
+          LOG.info("Running Mode: [" + SCHEDULE_AND_EXECUTE + "]");


This could be simply scheduleIndexing(..); followed by runIndexing(..)

prashantwason · 2022-03-11T09:45:33Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieIndexer.java

+  }
+
+  private int scheduleAndRunIndexing(JavaSparkContext jsc) throws Exception {
+    String schemaStr = UtilHelpers.getSchemaFromLatestInstant(metaClient);


code duplication here. I suggested above on how to remove this function.

prashantwason · 2022-03-11T09:46:08Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieIndexer.java

+    }
+  }
+
+  private int handleError(Option<HoodieIndexCommitMetadata> commitMetadata) {


boolean seems a better return value from this func.

hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieIndexer.java

codope · 2022-03-11T13:26:32Z

Good progress on this one. Getting close to being complete.

Thanks @prashantwason for reviewing. I'll address your comments soon.

nsivabalan

will sync up directly w/ you on some of the feedback

nsivabalan · 2022-03-15T21:47:42Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

      final String fileGroupFileId = String.format("%s%04d", metadataPartition.getFileIdPrefix(), i);
+      // if a writer or async indexer had already initialized the filegroup then continue
+      if (!fileSlices.isEmpty() && fileSlices.stream().anyMatch(fileSlice -> fileGroupFileId.equals(fileSlice.getFileGroupId().getFileId()))) {
+        continue;


can you help me understand how does partially failed filegroup instantiation is handled. Do we clean up all file groups and start from scratch or do we continue from where we left ? I mean, if indexer restarts next time around.

Not handled currently.
So, first we check whether a particular partition needs to be initialized or not. If yes, then initialize but in case of partial failed filegroup instantiation, we will clean up all file groups and start from scratch. Will add this logic.

nsivabalan · 2022-03-15T21:48:11Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

  }

+  public void dropIndex(List<MetadataPartitionType> indexesToDrop) throws IOException {
+    // TODO: update table config and do it in a transaction


please file a tracking ticket if we don't have one.

If a writer is holding onto an instance of hoodieTableConfig, it may not refresh from time to time right. So, if a partition was deleted mid-way, when the writer tries to apply a commit to metadata table, wont hoodieTableConfig.getMetadataPartitionsToUpdate() return stale values?
Do we ensure such flow succeeds even if there are partitions to update, but actual MD partition is deleted?

nsivabalan · 2022-03-15T21:51:37Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

   *    record-index-bucket-0000, .... -> ..., record-index-bucket-0009
   */
-  private void initializeFileGroups(HoodieTableMetaClient dataMetaClient, MetadataPartitionType metadataPartition, String instantTime,
+  public void initializeFileGroups(HoodieTableMetaClient dataMetaClient, MetadataPartitionType metadataPartition, String instantTime,


Can we check the bootstrapping code snippet. for eg, we check latest synced instant in metadata table and check if its already archived in data table.
With multiple partitions, each partition could be instantiated at different points in time. Can we check all such guards/conditions and ensure its all intact with latest state of metadata table.

even with each partition could be instantiated at different physical times, the logical times (hudi instatnt timestamp) will be the same. Are you taking from rescheduing index pov? Anyway, i think we should check run the same checks as in initializeIfNeeded before initializing file groups.

I get it.
what happens if someone trigger hoodie Indexer even w/o enabling MDT for regular writers?

think thru what incase someone sets FILES partition also to be indexed?

nsivabalan · 2022-03-15T23:29:28Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

+
+  private List<String> getMetadataPartitionsToUpdate() {
+    // find last (pending or) completed index instant and get partitions (to be) written
+    Option<HoodieInstant> lastIndexingInstant = dataMetaClient.getActiveTimeline()


guess we have to fix this to read from table Properties ?

nsivabalan · 2022-03-15T23:30:39Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

-    if (enabled && metadata != null) {
-      Map<MetadataPartitionType, HoodieData<HoodieRecord>> partitionRecordsMap = convertMetadataFunction.convertMetadata();
-      commit(instantTime, partitionRecordsMap, canTriggerTableService);
+    List<String> partitionsToUpdate = getMetadataPartitionsToUpdate();


how does this work for a table that migrated from 0.10.0 for eg. they may not have added "files" partition to table properties right? i.e. list of fully completed metadata partitions.

scyced up offline. but can you show me, where do we update files partition to hoodieTable config

I see it in 3 ot 4 upgrade handler. but what if somone enables after few commits after upgrade?

nsivabalan · 2022-03-16T00:25:28Z

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

+      Future<?> postRequestIndexingTaskFuture = executorService.submit(new PostRequestIndexingTask(metadataWriter, finalRemainingInstantsToIndex));
+      try {
+        // TODO: configure timeout
+        postRequestIndexingTaskFuture.get(60, TimeUnit.SECONDS);


60 secs is too short. if there are 100+ instants to catch up, would we complete in 60 secs.

right now, configured it to be 5 minutes by default. i did 10 small deltastreamer commits (12 columns, 1000 records in each round) using ksql-datagen and it was fine. I understand this could be time-consuming. I'll run a scale test later and try to figure out a better default value.

nsivabalan · 2022-03-16T00:27:46Z

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

+      HoodieTableMetaClient metadataMetaClient = HoodieTableMetaClient.builder().setConf(hadoopConf).setBasePath(metadataBasePath).build();
+      Set<HoodieInstant> metadataCompletedTimeline = metadataMetaClient.getActiveTimeline()
+          .getCommitsTimeline().filterCompletedInstants().getInstants().collect(Collectors.toSet());
+      List<HoodieInstant> finalRemainingInstantsToIndex = remainingInstantsToIndex.map(


I see we fetch all instants (pending, complete) at L106. so, I assume finalRemainingInstantsToIndex could contain inflight commits as well. And so, there are chances that when executing PostRequestIndexingTask, the actual writer would have already applied the commit to MDT. have we considered this scenario.

so now, we're checking against completed instants in MDT timeline (see getCompletedArchivedAndActiveInstantsAfter method). Only if an instant in data timeline that is yet to be indexed but not present in MDT, we wait till it gets completed (reload timeline pariodically until timeout).

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java

nsivabalan · 2022-03-26T05:52:06Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java

+  }
+
+  private boolean scheduleIndexingAtInstant(List<MetadataPartitionType> partitionTypes, String instantTime) throws HoodieIOException {
+    Option<HoodieIndexPlan> indexPlan = createTable(config, hadoopConf, config.isMetadataTableEnabled())


what happens if someone tries to trigger indexing twice? I expect we would fail the 2nd trigger conveying that already an indexing is in progress

we check table config to see inflight/completed indexes and this would return false in case triggered twice.

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

nsivabalan · 2022-03-26T15:40:37Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieIndexer.java

+      System.exit(1);
+    }
+
+    final JavaSparkContext jsc = UtilHelpers.buildSparkContext("indexing-" + cfg.tableName, cfg.sparkMaster, cfg.sparkMemory);


can we validate hoodie.metadata.enable is set to true. if not, let's throw an exception.

lets also think through what is a user wants to initialize the entire MDT via hoodieIndexer. i.e. they are not bringing up any regular writers. But first bring up HoodieIndexer and wait for everything to be built out. and then start other regular writers. from what I see, its taken care of. but do verify it once.

validated this by:

writing 2 commits with metadata disabled.

schedule and build index.

do another upsert with metadata disabled.

schedule index

do another upsert with metadata disabled.

run index (it also does catchup).

But first bring up HoodieIndexer and wait for everything to be built out.

But we should be able to support fully async mode of operation right

hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieIndexer.java

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

...ent-common/src/main/java/org/apache/hudi/table/action/index/ScheduleIndexActionExecutor.java

Fix timeline tests Make MetadataPartitionType#all generic Add instant info to plan and address review comments [HUDI-3175] Add index planner and executor Fix enum constant Rebase and resolve conflicts Fix some failing tests in CI Add indexer job utility

Resolve minor rebase conflict

checkstyle fix

vinothchandar

I need to make another pass on the core logic for RunIndexActionExecutor. But there is enough code fixes, as well as design/config/corner case clarifications to first address.

vinothchandar · 2022-03-29T18:30:51Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java

+   * @param partitionTypes - list of {@link MetadataPartitionType} which needs to be indexed
+   * @return instant time for the requested INDEX action
+   */
+  public Option<String> scheduleIndexing(List<MetadataPartitionType> partitionTypes) {


Should this api also take additional args for what kind of indexes to build?

consistent use of indexing vs index

vinothchandar · 2022-03-29T18:35:04Z

hudi-common/src/main/avro/HoodieIndexCommitMetadata.avsc

+{
+  "namespace": "org.apache.hudi.avro.model",
+  "type": "record",
+  "name": "HoodieIndexCommitMetadata",


Better to call everything "Indexing" vs "index"

vinothchandar · 2022-03-29T18:37:48Z

hudi-common/src/main/avro/HoodieIndexCommitMetadata.avsc

+  "fields": [
+    {
+      "name": "version",
+      "doc": "This field replaces the field filesToBeDeletedPerPartition",


vinothchandar · 2022-03-29T18:38:43Z

hudi-common/src/main/avro/HoodieIndexCommitMetadata.avsc

+      ],
+      "default": 1
+    },
+    {


do we need this

hudi-common/src/main/avro/HoodieIndexPartitionInfo.avsc

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

vinothchandar · 2022-03-30T03:20:18Z

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

+
+      // index catchup for all remaining instants with a timeout
+      currentIndexedInstant = indexUptoInstant;
+      ExecutorService executorService = Executors.newFixedThreadPool(MAX_CONCURRENT_INDEXING);


double check to shut this down in all failure scenarios

vinothchandar · 2022-03-30T03:21:45Z

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

+        updateTableConfig(table.getMetaClient(), finalIndexPartitionInfos);
+        table.getActiveTimeline().saveAsComplete(
+            new HoodieInstant(true, INDEX_ACTION, indexInstant.getTimestamp()),
+            TimelineMetadataUtils.serializeIndexCommitMetadata(indexCommitMetadata));


what happens if we fail before reaching L165. i.e the timeline saving. How do we recover/reconcile

vinothchandar · 2022-03-30T03:22:00Z

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

+    }
+  }
+
+  private static List<HoodieInstant> getRemainingArchivedAndActiveInstantsSince(String instant, HoodieTableMetaClient metaClient) {


tests for all these methods

- Abort gracefully after deleting partition and instant. - Handle other actions in timeline to consider before catching up

vinothchandar

Couple nits

vinothchandar · 2022-03-30T21:14:52Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

-    completedIndexes.addAll(partitionTypes.stream().map(MetadataPartitionType::getPartitionPath).collect(Collectors.toList()));
-    dataMetaClient.getTableConfig().setValue(HoodieTableConfig.TABLE_METADATA_INDEX_COMPLETED.key(), String.join(",", completedIndexes));
+  private void updateInitializedPartitionsInTableConfig(List<MetadataPartitionType> partitionTypes) {
+    Set<String> completedIndexes = getCompletedMetadataPartitions(dataMetaClient.getTableConfig());


completedPartitions

vinothchandar · 2022-03-30T21:25:53Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java

  String REQUESTED_EXTENSION = ".requested";
  String RESTORE_ACTION = "restore";
-  String INDEX_ACTION = "index";
+  String INDEX_ACTION = "indexing";


INDEXING_ACTION

zhangyue19921010 · 2022-03-31T03:23:42Z

...udi-client-common/src/main/java/org/apache/hudi/table/upgrade/ThreeToFourUpgradeHandler.java

+    // schema for the files partition is same between the two versions
+    if (config.isMetadataTableEnabled() && metadataPartitionExists(config.getBasePath(), context, MetadataPartitionType.FILES)) {
+      tablePropsToAdd.put(TABLE_METADATA_PARTITIONS, MetadataPartitionType.FILES.getPartitionPath());
+    }


Hi @codope Just thinking, when users set current version 4 which means there is no need for upgrade/downgrade. Then how can we update the TABLE_METADATA_PARTITIONS column here?

@zhangyue19921010 Good question! So, if no upgrade is required.. or let's say you upgraded to current version with metadata disabled and then later after few commits metadata was enabled, then this table config will get update in the metadata initialization path i.e. where HoodieBackedTableMetadataWriter#updateInitializedPartitionsInTableConfig is called.

vinothchandar

I have one conceptual question On considering non_write actions during scheduling. We can fix minor things, land and then follow up.

vinothchandar · 2022-03-31T12:22:16Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

+      }
+      HoodieTableConfig.update(dataMetaClient.getFs(), new Path(dataMetaClient.getMetaPath()), dataMetaClient.getTableConfig().getProps());
+      LOG.warn("Deleting Metadata Table partitions: " + partitionPath);
+      dataMetaClient.getFs().delete(new Path(metadataWriteConfig.getBasePath(), partitionPath), true);


What happens if delete fails midway before finishing? There is a follow on to use DELETE_PARTITION instead? Even there we could have that operation fail midway and we need some mechanism to reconcile/retry next time we tryto build that partition?

yes, this will be replaced by DELETE_PARTITION path. we just got the lazy deletion of partitions landed.
indeed there are multiple point of failure but unlike schedule/run index delete is a bit safer in terms of partial failures. we would be in trouble if partition gets deleted but table cnfig is not updated.. so we update the table config first.. if table config is updated but partitions is not fully deleted, users can re-trigger drop.

vinothchandar · 2022-03-31T13:09:51Z

...ent-common/src/main/java/org/apache/hudi/table/action/index/ScheduleIndexActionExecutor.java

+    indexesInflightOrCompleted.addAll(getCompletedMetadataPartitions(tableConfig));
+    Set<String> requestedPartitions = partitionIndexTypes.stream().map(MetadataPartitionType::getPartitionPath).collect(Collectors.toSet());
+    requestedPartitions.removeAll(indexesInflightOrCompleted);
+    if (!requestedPartitions.isEmpty()) {


if empty then return ?

vinothchandar · 2022-03-31T13:11:54Z

...ent-common/src/main/java/org/apache/hudi/table/action/index/ScheduleIndexActionExecutor.java

+      LOG.warn(String.format("Following partitions already exist or inflight: %s. Going to index only these partitions: %s",
+          indexesInflightOrCompleted, requestedPartitions));
+    }
+    List<MetadataPartitionType> finalPartitionsToIndex = partitionIndexTypes.stream()


Why cant we just use requestedPartitions?

need to pass List instead of List
will change when we get to secondary indexes.. it should be List i.e. list of partition paths

vinothchandar · 2022-03-31T13:14:38Z

...ent-common/src/main/java/org/apache/hudi/table/action/index/ScheduleIndexActionExecutor.java

+    indexesInflightOrCompleted.addAll(getCompletedMetadataPartitions(tableConfig));
+    Set<String> requestedPartitions = partitionIndexTypes.stream().map(MetadataPartitionType::getPartitionPath).collect(Collectors.toSet());
+    requestedPartitions.removeAll(indexesInflightOrCompleted);
+    if (!requestedPartitions.isEmpty()) {


return if empty?

vinothchandar · 2022-03-31T13:18:26Z

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

+      // ensure the metadata partitions for the requested indexes are not already available (or inflight)
+      HoodieTableConfig tableConfig = table.getMetaClient().getTableConfig();
+      Set<String> indexesInflightOrCompleted = getInflightMetadataPartitions(tableConfig);
+      indexesInflightOrCompleted.addAll(getCompletedMetadataPartitions(tableConfig));


This code Can move to a helper and shared with schedule A E ?

vinothchandar · 2022-03-31T13:22:50Z

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

+    Set<String> inflightPartitions = getInflightMetadataPartitions(table.getMetaClient().getTableConfig());
+    Set<String> completedPartitions = getCompletedMetadataPartitions(table.getMetaClient().getTableConfig());
+    // delete metadata partition
+    requestedPartitions.forEach(partition -> {


Think about what happens if this fails midway

there are multipe points of failure here.. to reduce the blast radius we will make changes to the table config first because after this patch we mostly rely on table configs. additionally, we need more cli commands to allow users to recover easily.. tracking in HUDI-3753

vinothchandar · 2022-03-31T13:25:07Z

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

+    // since only write timeline was considered while scheduling index, which gives us the indexUpto instant
+    // here we consider other valid actions to pick catchupStart instant
+    Set<String> validActions = CollectionUtils.createSet(CLEAN_ACTION, RESTORE_ACTION, ROLLBACK_ACTION);
+    HoodieInstant catchupStartInstant = table.getMetaClient().reloadActiveTimeline()


Use Option?

vinothchandar · 2022-03-31T13:26:47Z

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

+  private List<HoodieInstant> getInstantsToCatchup(String indexUptoInstant) {
+    // since only write timeline was considered while scheduling index, which gives us the indexUpto instant
+    // here we consider other valid actions to pick catchupStart instant
+    Set<String> validActions = CollectionUtils.createSet(CLEAN_ACTION, RESTORE_ACTION, ROLLBACK_ACTION);


should we do it Such that any non write actions are picked up?

Also why not have scheduling consider non write actions?

should we do it Such that any non write actions are picked up?

Only savepoint is remaining right. deliberately avoided savepoint as it does not alter the filegroup in any way right (except for marking it so as to avoid cleaner). so i did not consider that.

Also why not have scheduling consider non write actions?

yes, that's the way to go.
we consider non write actions to determine the catchup start instant. going back to the table we discussed https://github.com/apache/hudi/pull/4693/files#r837817961 we need both indexUpto and catchupStart instants. i plan to write them to the index plan rather than pass as parameters. i'm going to revamp the index plan schema so that the API exposes minimal arguments and the plan is the source of truth as we discussed.Tracking it in HUDI-3755

vinothchandar · 2022-03-31T13:34:07Z

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

+            // we need take a lock here as inflight writer could also try to update the timeline
+            txnManager.beginTransaction(Option.of(instant), Option.empty());
+            LOG.info("Updating metadata table for instant: " + instant);
+            switch (instant.getAction()) {


is n't there a top level method in metadata writer to handle different instant types ? We can reuse that or move this code there

no, will take this refactoring along with a followup task.. it needs a little bit more than extracting to a method.

vinothchandar · 2022-03-31T13:35:25Z

...i-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java

+            throw new HoodieIndexException(String.format("Thread interrupted while running indexing check for instant: %s", instant), e);
+          }
+        }
+        // if instant completed, ensure that there was metadata commit, else update metadata for this completed instant


So this is just so that any race causing the inflight to miss this is handled?

yes right..

hudi-bot · 2022-03-31T19:55:47Z

CI report:

01120c1 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codope mentioned this pull request Jan 28, 2022

[WIP][HUDI-3173] Add INDEX action type and corresponding commit metadata #4523

Closed

5 tasks

codope force-pushed the HUDI-3175-async-index branch from 238b128 to ca12a78 Compare January 31, 2022 06:32

manojpec reviewed Feb 2, 2022

View reviewed changes

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java Outdated Show resolved Hide resolved

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java Outdated Show resolved Hide resolved

codope force-pushed the HUDI-3175-async-index branch from c5c563f to 06c6dd9 Compare February 3, 2022 18:00

manojpec reviewed Feb 4, 2022

View reviewed changes

codope force-pushed the HUDI-3175-async-index branch from 06c6dd9 to 7920cb1 Compare February 8, 2022 11:16

nsivabalan self-assigned this Feb 16, 2022

vinothchandar added the rfc Request for comments label Feb 17, 2022

vinothchandar removed the rfc Request for comments label Mar 3, 2022

prashantwason reviewed Mar 11, 2022

View reviewed changes

codope force-pushed the HUDI-3175-async-index branch from 7920cb1 to e6e3e16 Compare March 11, 2022 13:21

codope force-pushed the HUDI-3175-async-index branch from e6e3e16 to 4a036d8 Compare March 14, 2022 08:18

nsivabalan requested changes Mar 16, 2022

View reviewed changes

codope added the priority:blocker Production down; release blocker label Mar 16, 2022

codope force-pushed the HUDI-3175-async-index branch 2 times, most recently from 4e8f6e0 to 680a99a Compare March 17, 2022 15:33

codope force-pushed the HUDI-3175-async-index branch 5 times, most recently from d02e0c2 to 69071c6 Compare March 25, 2022 11:49

codope changed the title ~~[WIP][HUDI-3175][RFC-45] Implement async metadata indexing~~ [HUDI-2488][HUDI-3175] Implement async metadata indexing Mar 25, 2022

nsivabalan requested changes Mar 26, 2022

View reviewed changes

nsivabalan reviewed Mar 26, 2022

View reviewed changes

hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieIndexer.java Show resolved Hide resolved

nsivabalan reviewed Mar 26, 2022

View reviewed changes

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java Outdated Show resolved Hide resolved

nsivabalan reviewed Mar 26, 2022

View reviewed changes

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java Outdated Show resolved Hide resolved

nsivabalan reviewed Mar 26, 2022

View reviewed changes

...ent-common/src/main/java/org/apache/hudi/table/action/index/ScheduleIndexActionExecutor.java Outdated Show resolved Hide resolved

codope force-pushed the HUDI-3175-async-index branch from ee361b1 to be08ba4 Compare March 29, 2022 13:24

codope added 16 commits March 29, 2022 21:26

Take lock and initialize filegroup while scheduling

3d0b5f0

Support indexing subset of columns

93d8b17

[HUDI-3368] Add support for bloom index for secondary keys

03600ac

[HUDI-3382] Add support for drop index in metadata writer

e9c528d

Handle upgrade downgrade and consider archival timeline

ab0b369

Add drop index action to utility and avoid fs.exists check

b4d4100

Resolve minor rebase conflict

Minor fix for empty partition path

3b85bb0

Minor fix for no columns configured

3e37433

Fix active timeline test

808934d

Take lock before writing index completed to data timeline

2b4871b

Add test for indexer with continuous deltastreamer

80fee23

checkstyle fix

Address feedback from second review

6d6178c

Cleanup and fix one bug in index catchup

a8ab116

Update table configs for files partition

d25a8fb

Fix processAndCommit to consider partitions from table config

010de76

codope force-pushed the HUDI-3175-async-index branch from be08ba4 to 010de76 Compare March 29, 2022 17:21

vinothchandar requested changes Mar 30, 2022

View reviewed changes

codope added 3 commits March 30, 2022 21:50

Tidying up, renames, refactoring

a3ee4cd

- Handle corner cases related to partial failures.

514c051

- Abort gracefully after deleting partition and instant. - Handle other actions in timeline to consider before catching up

Check for existing indexes in HoodieIndexer

18b9acd

vinothchandar reviewed Mar 30, 2022

View reviewed changes

Rename index_action and other nits

fc9ac46

zhangyue19921010 reviewed Mar 31, 2022

View reviewed changes

vinothchandar approved these changes Mar 31, 2022

View reviewed changes

Address some minors from last pass

01120c1

codope merged commit 28dafa7 into apache:master Mar 31, 2022

hudi-bot mentioned this pull request Dec 9, 2025

Support async metadata index creation while regular writers and table services are in progress #14870

Open

[HUDI-2488][HUDI-3175] Implement async metadata indexing #4693

[HUDI-2488][HUDI-3175] Implement async metadata indexing #4693

Uh oh!

Conversation

codope commented Jan 26, 2022 • edited by nsivabalan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

manojpec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

prashantwason left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codope commented Mar 11, 2022

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

codope commented Jan 26, 2022 •

edited by nsivabalan

Loading