-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-2488][HUDI-3175] Implement async metadata indexing #4693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b71143d
3d0b5f0
93d8b17
03600ac
e9c528d
ab0b369
b4d4100
3b85bb0
3e37433
808934d
2b4871b
80fee23
6d6178c
a8ab116
d25a8fb
010de76
a3ee4cd
514c051
18b9acd
fc9ac46
01120c1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,6 +24,8 @@ | |
| import org.apache.hudi.avro.model.HoodieCleanerPlan; | ||
| import org.apache.hudi.avro.model.HoodieClusteringPlan; | ||
| import org.apache.hudi.avro.model.HoodieCompactionPlan; | ||
| import org.apache.hudi.avro.model.HoodieIndexCommitMetadata; | ||
| import org.apache.hudi.avro.model.HoodieIndexPlan; | ||
| import org.apache.hudi.avro.model.HoodieRestoreMetadata; | ||
| import org.apache.hudi.avro.model.HoodieRestorePlan; | ||
| import org.apache.hudi.avro.model.HoodieRollbackMetadata; | ||
|
|
@@ -62,11 +64,13 @@ | |
| import org.apache.hudi.exception.HoodieCommitException; | ||
| import org.apache.hudi.exception.HoodieException; | ||
| import org.apache.hudi.exception.HoodieIOException; | ||
| import org.apache.hudi.exception.HoodieIndexException; | ||
| import org.apache.hudi.exception.HoodieRestoreException; | ||
| import org.apache.hudi.exception.HoodieRollbackException; | ||
| import org.apache.hudi.exception.HoodieSavepointException; | ||
| import org.apache.hudi.index.HoodieIndex; | ||
| import org.apache.hudi.metadata.HoodieTableMetadataWriter; | ||
| import org.apache.hudi.metadata.MetadataPartitionType; | ||
| import org.apache.hudi.metrics.HoodieMetrics; | ||
| import org.apache.hudi.table.BulkInsertPartitioner; | ||
| import org.apache.hudi.table.HoodieTable; | ||
|
|
@@ -400,7 +404,6 @@ protected void rollbackFailedBootstrap() { | |
| public abstract O bulkInsert(I records, final String instantTime, | ||
| Option<BulkInsertPartitioner> userDefinedBulkInsertPartitioner); | ||
|
|
||
|
|
||
| /** | ||
| * Loads the given HoodieRecords, as inserts into the table. This is suitable for doing big bulk loads into a Hoodie | ||
| * table for the very first time (e.g: converting an existing table to Hoodie). The input records should contain no | ||
|
|
@@ -925,6 +928,53 @@ public boolean scheduleCompactionAtInstant(String instantTime, Option<Map<String | |
| return scheduleTableService(instantTime, extraMetadata, TableServiceType.COMPACT).isPresent(); | ||
| } | ||
|
|
||
|
|
||
| /** | ||
| * Schedules INDEX action. | ||
| * | ||
| * @param partitionTypes - list of {@link MetadataPartitionType} which needs to be indexed | ||
| * @return instant time for the requested INDEX action | ||
| */ | ||
| public Option<String> scheduleIndexing(List<MetadataPartitionType> partitionTypes) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this api also take additional args for what kind of indexes to build?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. consistent use of |
||
| String instantTime = HoodieActiveTimeline.createNewInstantTime(); | ||
| Option<HoodieIndexPlan> indexPlan = createTable(config, hadoopConf, config.isMetadataTableEnabled()) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what happens if someone tries to trigger indexing twice? I expect we would fail the 2nd trigger conveying that already an indexing is in progress
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we check table config to see inflight/completed indexes and this would return false in case triggered twice. |
||
| .scheduleIndexing(context, instantTime, partitionTypes); | ||
| return indexPlan.isPresent() ? Option.of(instantTime) : Option.empty(); | ||
| } | ||
|
|
||
| /** | ||
| * Runs INDEX action to build out the metadata partitions as planned for the given instant time. | ||
| * | ||
| * @param indexInstantTime - instant time for the requested INDEX action | ||
| * @return {@link Option<HoodieIndexCommitMetadata>} after successful indexing. | ||
| */ | ||
| public Option<HoodieIndexCommitMetadata> index(String indexInstantTime) { | ||
codope marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| return createTable(config, hadoopConf, config.isMetadataTableEnabled()).index(context, indexInstantTime); | ||
| } | ||
|
|
||
| /** | ||
| * Drops the index and removes the metadata partitions. | ||
| * | ||
| * @param partitionTypes - list of {@link MetadataPartitionType} which needs to be indexed | ||
| */ | ||
| public void dropIndex(List<MetadataPartitionType> partitionTypes) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are there tests for these APIs?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will add a test for dropIndex.. the scheduleIndex and buildIndex APIs are covered in a deltastreamer test. i'll add more failure scenarios in TestHoodieIndexer. |
||
| HoodieTable table = createTable(config, hadoopConf); | ||
| String dropInstant = HoodieActiveTimeline.createNewInstantTime(); | ||
| this.txnManager.beginTransaction(); | ||
| try { | ||
| context.setJobStatus(this.getClass().getSimpleName(), "Dropping partitions from metadata table"); | ||
| table.getMetadataWriter(dropInstant).ifPresent(w -> { | ||
| try { | ||
| ((HoodieTableMetadataWriter) w).dropMetadataPartitions(partitionTypes); | ||
| } catch (IOException e) { | ||
| throw new HoodieIndexException("Failed to drop metadata index. ", e); | ||
| } | ||
| }); | ||
| } finally { | ||
| this.txnManager.endTransaction(); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Performs Compaction for the workload stored in instant-time. | ||
| * | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -50,12 +50,14 @@ | |
| import org.apache.hudi.common.util.Option; | ||
| import org.apache.hudi.common.util.ReflectionUtils; | ||
| import org.apache.hudi.common.util.SizeEstimator; | ||
| import org.apache.hudi.common.util.StringUtils; | ||
| import org.apache.hudi.config.HoodieWriteConfig; | ||
| import org.apache.hudi.exception.HoodieAppendException; | ||
| import org.apache.hudi.exception.HoodieException; | ||
| import org.apache.hudi.exception.HoodieUpsertException; | ||
| import org.apache.hudi.table.HoodieTable; | ||
|
|
||
| import org.apache.avro.Schema; | ||
| import org.apache.avro.generic.GenericRecord; | ||
| import org.apache.avro.generic.IndexedRecord; | ||
| import org.apache.hadoop.fs.Path; | ||
|
|
@@ -69,8 +71,10 @@ | |
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.Properties; | ||
| import java.util.Set; | ||
| import java.util.concurrent.atomic.AtomicLong; | ||
| import java.util.stream.Collectors; | ||
| import java.util.stream.Stream; | ||
|
|
||
| import static org.apache.hudi.metadata.HoodieTableMetadataUtil.accumulateColumnRanges; | ||
| import static org.apache.hudi.metadata.HoodieTableMetadataUtil.aggregateColumnStats; | ||
|
|
@@ -343,16 +347,27 @@ private void processAppendResult(AppendResult result, List<IndexedRecord> record | |
| updateWriteStatus(stat, result); | ||
| } | ||
|
|
||
| if (config.isMetadataIndexColumnStatsForAllColumnsEnabled()) { | ||
| if (config.isMetadataColumnStatsIndexEnabled()) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nts: follow up on all this code. needs to be more modular. |
||
| final List<Schema.Field> fieldsToIndex; | ||
| if (!StringUtils.isNullOrEmpty(config.getColumnsEnabledForColumnStatsIndex())) { | ||
| Set<String> columnsToIndex = Stream.of(config.getColumnsEnabledForColumnStatsIndex().split(",")) | ||
| .map(String::trim).filter(s -> !s.isEmpty()).collect(Collectors.toSet()); | ||
| fieldsToIndex = writeSchemaWithMetaFields.getFields().stream() | ||
| .filter(field -> columnsToIndex.contains(field.name())).collect(Collectors.toList()); | ||
| } else { | ||
| // if column stats index is enabled but columns not configured then we assume that all columns should be indexed | ||
| fieldsToIndex = writeSchemaWithMetaFields.getFields(); | ||
| } | ||
|
|
||
| Map<String, HoodieColumnRangeMetadata<Comparable>> columnRangeMap = stat.getRecordsStats().isPresent() | ||
| ? stat.getRecordsStats().get().getStats() : new HashMap<>(); | ||
| final String filePath = stat.getPath(); | ||
| // initialize map of column name to map of stats name to stats value | ||
| Map<String, Map<String, Object>> columnToStats = new HashMap<>(); | ||
| writeSchemaWithMetaFields.getFields().forEach(field -> columnToStats.putIfAbsent(field.name(), new HashMap<>())); | ||
| fieldsToIndex.forEach(field -> columnToStats.putIfAbsent(field.name(), new HashMap<>())); | ||
| // collect stats for columns at once per record and keep iterating through every record to eventually find col stats for all fields. | ||
| recordList.forEach(record -> aggregateColumnStats(record, writeSchemaWithMetaFields, columnToStats, config.isConsistentLogicalTimestampEnabled())); | ||
| writeSchemaWithMetaFields.getFields().forEach(field -> accumulateColumnRanges(field, filePath, columnRangeMap, columnToStats)); | ||
| recordList.forEach(record -> aggregateColumnStats(record, fieldsToIndex, columnToStats, config.isConsistentLogicalTimestampEnabled())); | ||
| fieldsToIndex.forEach(field -> accumulateColumnRanges(field, filePath, columnRangeMap, columnToStats)); | ||
| stat.setRecordsStats(new HoodieDeltaWriteStat.RecordsStats<>(columnRangeMap)); | ||
| } | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.