Skip to content

Conversation

@codope
Copy link
Member

@codope codope commented Jun 9, 2023

Change Logs

After #8684 landed, the initialization timestamp format changed. This PR accounts for the changed format at some places we missed.

Impact

No perf impact.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@hudi-bot
Copy link
Collaborator

hudi-bot commented Jun 9, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

protected String getLatestDataInstantTime() {
return dataMetaClient.getActiveTimeline().filterCompletedInstants().lastInstant()
.map(HoodieInstant::getTimestamp).orElse(SOLO_COMMIT_TIMESTAMP);
.map(HoodieInstant::getTimestamp).orElse(HoodieTableMetadataUtil.createIndexInitTimestamp(SOLO_COMMIT_TIMESTAMP, 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this solves/gives us much.
1: if we happened to initialize more than 1 MDT partition, the initialization time will be different. its 010 suffix for 1st and 011 for 2nd.
2: this api is used only in logging.

So, may not be worth fixing it. atleast for this (getLatestDataInstantTime).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm scared by the hardcode -0, it is hard to maintain, at least we should fine a constant for it.

Copy link
Member Author

@codope codope Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm gonna remove this method.. we don't really need it and also change the log level to debug.

Option<HoodieInstant> latestMetadataInstant = metadataMetaClient.getActiveTimeline().filterCompletedInstants().lastInstant();
String latestMetadataInstantTime = latestMetadataInstant.map(HoodieInstant::getTimestamp).orElse(SOLO_COMMIT_TIMESTAMP);

String latestMetadataInstantTime = latestMetadataInstant.map(HoodieInstant::getTimestamp).orElse(HoodieTableMetadataUtil.createIndexInitTimestamp(SOLO_COMMIT_TIMESTAMP, 0));
Copy link
Contributor

@nsivabalan nsivabalan Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much benefit here too.
but not too strong.

this code will be invoked only after any partition in MDT will be initialized(which means the table config is updated). which means, the latestMetadataInstant should already be valid (Option will be non empty). So, what are the chances that we will call getRecordsByKey with BaseTableMetadata when any of MDT partitions have been initialized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Let's actually do orElseThrow(() -> new IllegalStateException("No completed instant in the metadata timeline.")) instead of setting some initial timestamp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree, we should not be in that state. So, throwing an exception makes sense.
Also, this is under a public API, the construction of this class is also not safe. Since, we are allowing the object to be created even if the metadata table is not there. So, should we also throw exception in initIfNeeded method?


// SOLO_COMMIT_TIMESTAMP is used during bootstrap so it is a valid timestamp
validInstantTimestamps.add(createIndexInitTimestamp(SOLO_COMMIT_TIMESTAMP, PARTITION_INITIALIZATION_TIME_SUFFIX));
List<String> metadataInitializationTimestamps = metadataMetaClient.getActiveTimeline()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also considering if this will give us any benefit.
validInstantTimestamps are used within LogRecordReader to ignore log blocks which was written using commits which are not completed yet.

Lets consider diff cases:

  1. for an existing table, we may never use SOLO commit time only since there will def be a latest completed commit from data table that we will use.
  2. For a new table. the base commit time to initialize MDT will be chosen as SOLO COMMIT TIME + suffix (one for each partition being initialized). and bulk insert will kick in. So, the base instant time will have prefix of SOLO COMMIT TIME . but any new log files will be added using new delta commits which will have diff commit times. so, I don't see a necessity to add SOLO COMMIT TIME to list of valid instant times only.

let me know if I am missing any flow. Just tryin to avoid going through entire active timeline of MDT to filter for SOLO COMMIT TIME if its never going to be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words,
if valid instant times are used only to filter log blocks, and if there won't be any flow where we might write log blocks with instant times w/ SOLO COMMIT TIME as prefix, we don't need to make this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but any new log files will be added using new delta commits which will have diff commit times.

I agree, the SOLO_COMMIT_TIMESTAMP should be kept for backward compatibility BTW.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 @nsivabalan Agree with your points but should a util method be aware of different cases? Let's say tomorrow for a new MDT partition, the semantics change and it writes a data block with the initializing commit itself, then the author/reviewer needs to come back to util method and fix it. This is going to be harder to maintain. IMO, the better way to handle such cases is by keeping the util method dumb, and do any case handling at the call site or have assertions for invariants such as data block can never have initializing commit time. Wdyt?

Btw, the change is backward compatible as it checks for startsWith(SOLO_COMMIT_TIMESTAMP). Also, there is no reloading of the timeline but to avoid going through active timeline again, i can merge the filter with the stream operation at lines 1349-1350.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid instants are mainly needed for reading the log files.
Whereas, first bootstrap commit on the partition, will create one set of dummy delete log blocks and a base file. So, all the data written into partition on the boostrap commit will be present in the base file. Even if dont include these SOLO_COMMIT_TIMESTAMP we should be ok I guess.
But it will be good to include them to keep it consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if dont include these SOLO_COMMIT_TIMESTAMP we should be ok I guess.

Yes, that's why I'm saying SOLO_COMMIT_TIMESTAMP is for compatibility.

@nsivabalan
Copy link
Contributor

hey @codope : if this is not valid, can we close

@codope codope closed this Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants