-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[MINOR] Consider all initialization timestamps from MDT timeline to be valid #8915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e valid Fix in some other places
| protected String getLatestDataInstantTime() { | ||
| return dataMetaClient.getActiveTimeline().filterCompletedInstants().lastInstant() | ||
| .map(HoodieInstant::getTimestamp).orElse(SOLO_COMMIT_TIMESTAMP); | ||
| .map(HoodieInstant::getTimestamp).orElse(HoodieTableMetadataUtil.createIndexInitTimestamp(SOLO_COMMIT_TIMESTAMP, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this solves/gives us much.
1: if we happened to initialize more than 1 MDT partition, the initialization time will be different. its 010 suffix for 1st and 011 for 2nd.
2: this api is used only in logging.
So, may not be worth fixing it. atleast for this (getLatestDataInstantTime).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm scared by the hardcode -0, it is hard to maintain, at least we should fine a constant for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm gonna remove this method.. we don't really need it and also change the log level to debug.
| Option<HoodieInstant> latestMetadataInstant = metadataMetaClient.getActiveTimeline().filterCompletedInstants().lastInstant(); | ||
| String latestMetadataInstantTime = latestMetadataInstant.map(HoodieInstant::getTimestamp).orElse(SOLO_COMMIT_TIMESTAMP); | ||
|
|
||
| String latestMetadataInstantTime = latestMetadataInstant.map(HoodieInstant::getTimestamp).orElse(HoodieTableMetadataUtil.createIndexInitTimestamp(SOLO_COMMIT_TIMESTAMP, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see much benefit here too.
but not too strong.
this code will be invoked only after any partition in MDT will be initialized(which means the table config is updated). which means, the latestMetadataInstant should already be valid (Option will be non empty). So, what are the chances that we will call getRecordsByKey with BaseTableMetadata when any of MDT partitions have been initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Let's actually do orElseThrow(() -> new IllegalStateException("No completed instant in the metadata timeline.")) instead of setting some initial timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree, we should not be in that state. So, throwing an exception makes sense.
Also, this is under a public API, the construction of this class is also not safe. Since, we are allowing the object to be created even if the metadata table is not there. So, should we also throw exception in initIfNeeded method?
|
|
||
| // SOLO_COMMIT_TIMESTAMP is used during bootstrap so it is a valid timestamp | ||
| validInstantTimestamps.add(createIndexInitTimestamp(SOLO_COMMIT_TIMESTAMP, PARTITION_INITIALIZATION_TIME_SUFFIX)); | ||
| List<String> metadataInitializationTimestamps = metadataMetaClient.getActiveTimeline() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also considering if this will give us any benefit.
validInstantTimestamps are used within LogRecordReader to ignore log blocks which was written using commits which are not completed yet.
Lets consider diff cases:
- for an existing table, we may never use SOLO commit time only since there will def be a latest completed commit from data table that we will use.
- For a new table. the base commit time to initialize MDT will be chosen as SOLO COMMIT TIME + suffix (one for each partition being initialized). and bulk insert will kick in. So, the base instant time will have prefix of SOLO COMMIT TIME . but any new log files will be added using new delta commits which will have diff commit times. so, I don't see a necessity to add SOLO COMMIT TIME to list of valid instant times only.
let me know if I am missing any flow. Just tryin to avoid going through entire active timeline of MDT to filter for SOLO COMMIT TIME if its never going to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words,
if valid instant times are used only to filter log blocks, and if there won't be any flow where we might write log blocks with instant times w/ SOLO COMMIT TIME as prefix, we don't need to make this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but any new log files will be added using new delta commits which will have diff commit times.
I agree, the SOLO_COMMIT_TIMESTAMP should be kept for backward compatibility BTW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 @nsivabalan Agree with your points but should a util method be aware of different cases? Let's say tomorrow for a new MDT partition, the semantics change and it writes a data block with the initializing commit itself, then the author/reviewer needs to come back to util method and fix it. This is going to be harder to maintain. IMO, the better way to handle such cases is by keeping the util method dumb, and do any case handling at the call site or have assertions for invariants such as data block can never have initializing commit time. Wdyt?
Btw, the change is backward compatible as it checks for startsWith(SOLO_COMMIT_TIMESTAMP). Also, there is no reloading of the timeline but to avoid going through active timeline again, i can merge the filter with the stream operation at lines 1349-1350.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid instants are mainly needed for reading the log files.
Whereas, first bootstrap commit on the partition, will create one set of dummy delete log blocks and a base file. So, all the data written into partition on the boostrap commit will be present in the base file. Even if dont include these SOLO_COMMIT_TIMESTAMP we should be ok I guess.
But it will be good to include them to keep it consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if dont include these SOLO_COMMIT_TIMESTAMP we should be ok I guess.
Yes, that's why I'm saying SOLO_COMMIT_TIMESTAMP is for compatibility.
|
hey @codope : if this is not valid, can we close |
Change Logs
After #8684 landed, the initialization timestamp format changed. This PR accounts for the changed format at some places we missed.
Impact
No perf impact.
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist