-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[MINOR] Consider all initialization timestamps from MDT timeline to be valid #8915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -36,11 +36,11 @@ | |
| import org.apache.hudi.common.table.HoodieTableMetaClient; | ||
| import org.apache.hudi.common.table.timeline.HoodieInstant; | ||
| import org.apache.hudi.common.table.view.HoodieTableFileSystemView; | ||
| import org.apache.hudi.common.util.collection.ClosableIterator; | ||
| import org.apache.hudi.common.util.HoodieTimer; | ||
| import org.apache.hudi.common.util.Option; | ||
| import org.apache.hudi.common.util.SpillableMapUtils; | ||
| import org.apache.hudi.common.util.StringUtils; | ||
| import org.apache.hudi.common.util.collection.ClosableIterator; | ||
| import org.apache.hudi.common.util.collection.Pair; | ||
| import org.apache.hudi.exception.HoodieException; | ||
| import org.apache.hudi.exception.HoodieIOException; | ||
|
|
@@ -471,12 +471,9 @@ public Pair<HoodieMetadataLogRecordReader, Long> getLogRecordScanner(List<Hoodie | |
|
|
||
| // Only those log files which have a corresponding completed instant on the dataset should be read | ||
| // This is because the metadata table is updated before the dataset instants are committed. | ||
| Set<String> validInstantTimestamps = HoodieTableMetadataUtil | ||
| .getValidInstantTimestamps(dataMetaClient, metadataMetaClient); | ||
|
|
||
| Set<String> validInstantTimestamps = HoodieTableMetadataUtil.getValidInstantTimestamps(dataMetaClient, metadataMetaClient); | ||
| Option<HoodieInstant> latestMetadataInstant = metadataMetaClient.getActiveTimeline().filterCompletedInstants().lastInstant(); | ||
| String latestMetadataInstantTime = latestMetadataInstant.map(HoodieInstant::getTimestamp).orElse(SOLO_COMMIT_TIMESTAMP); | ||
|
|
||
| String latestMetadataInstantTime = latestMetadataInstant.map(HoodieInstant::getTimestamp).orElse(HoodieTableMetadataUtil.createIndexInitTimestamp(SOLO_COMMIT_TIMESTAMP, 0)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see much benefit here too. this code will be invoked only after any partition in MDT will be initialized(which means the table config is updated). which means, the latestMetadataInstant should already be valid (Option will be non empty). So, what are the chances that we will call getRecordsByKey with BaseTableMetadata when any of MDT partitions have been initialized.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense. Let's actually do
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I agree, we should not be in that state. So, throwing an exception makes sense. |
||
| boolean allowFullScan = allowFullScanOverride.orElseGet(() -> isFullScanAllowedForPartition(partitionName)); | ||
|
|
||
| // Load the schema | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,13 +18,6 @@ | |
|
|
||
| package org.apache.hudi.metadata; | ||
|
|
||
| import org.apache.avro.AvroTypeException; | ||
| import org.apache.avro.LogicalTypes; | ||
| import org.apache.avro.Schema; | ||
| import org.apache.avro.generic.GenericRecord; | ||
| import org.apache.avro.generic.IndexedRecord; | ||
| import org.apache.hadoop.fs.FileSystem; | ||
| import org.apache.hadoop.fs.Path; | ||
| import org.apache.hudi.avro.ConvertingGenericData; | ||
| import org.apache.hudi.avro.model.HoodieCleanMetadata; | ||
| import org.apache.hudi.avro.model.HoodieMetadataColumnStats; | ||
|
|
@@ -63,10 +56,19 @@ | |
| import org.apache.hudi.io.storage.HoodieFileReader; | ||
| import org.apache.hudi.io.storage.HoodieFileReaderFactory; | ||
| import org.apache.hudi.util.Lazy; | ||
|
|
||
| import org.apache.avro.AvroTypeException; | ||
| import org.apache.avro.LogicalTypes; | ||
| import org.apache.avro.Schema; | ||
| import org.apache.avro.generic.GenericRecord; | ||
| import org.apache.avro.generic.IndexedRecord; | ||
| import org.apache.hadoop.fs.FileSystem; | ||
| import org.apache.hadoop.fs.Path; | ||
| import org.slf4j.Logger; | ||
| import org.slf4j.LoggerFactory; | ||
|
|
||
| import javax.annotation.Nonnull; | ||
|
|
||
| import java.io.FileNotFoundException; | ||
| import java.io.IOException; | ||
| import java.math.BigDecimal; | ||
|
|
@@ -1361,7 +1363,13 @@ public static Set<String> getValidInstantTimestamps(HoodieTableMetaClient dataMe | |
| }); | ||
|
|
||
| // SOLO_COMMIT_TIMESTAMP is used during bootstrap so it is a valid timestamp | ||
| validInstantTimestamps.add(createIndexInitTimestamp(SOLO_COMMIT_TIMESTAMP, PARTITION_INITIALIZATION_TIME_SUFFIX)); | ||
| List<String> metadataInitializationTimestamps = metadataMetaClient.getActiveTimeline() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am also considering if this will give us any benefit. Lets consider diff cases:
let me know if I am missing any flow. Just tryin to avoid going through entire active timeline of MDT to filter for SOLO COMMIT TIME if its never going to be used.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In other words,
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree, the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @danny0405 @nsivabalan Agree with your points but should a util method be aware of different cases? Let's say tomorrow for a new MDT partition, the semantics change and it writes a data block with the initializing commit itself, then the author/reviewer needs to come back to util method and fix it. This is going to be harder to maintain. IMO, the better way to handle such cases is by keeping the util method dumb, and do any case handling at the call site or have assertions for invariants such as data block can never have initializing commit time. Wdyt? Btw, the change is backward compatible as it checks for
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Valid instants are mainly needed for reading the log files.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, that's why I'm saying |
||
| .filterCompletedInstants() | ||
| .getInstantsAsStream() | ||
| .map(HoodieInstant::getTimestamp) | ||
| .filter(timestamp -> timestamp.startsWith(SOLO_COMMIT_TIMESTAMP)) | ||
| .collect(Collectors.toList()); | ||
| validInstantTimestamps.addAll(metadataInitializationTimestamps); | ||
| return validInstantTimestamps; | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this solves/gives us much.
1: if we happened to initialize more than 1 MDT partition, the initialization time will be different. its 010 suffix for 1st and 011 for 2nd.
2: this api is used only in logging.
So, may not be worth fixing it. atleast for this (getLatestDataInstantTime).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm scared by the hardcode
-0, it is hard to maintain, at least we should fine a constant for it.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm gonna remove this method.. we don't really need it and also change the log level to debug.