Skip to content

Conversation

@satishkotha
Copy link
Member

What is the purpose of the pull request

  • Add incremental timeline support to update pending clustering operations
  • Fix timeline to include information in inflight clustering operations

Brief change log

  • Change timeline in filesystem views to include pending replacecommits (Previously it only included completed commits and pending compaction instants).
  • Because filesystem view includes pending clustering operations, change HoodieFileGroup#lastInstant to track only completed instants. Note that this required changing some assumption in TestUpgradeDowngrade tests, please take a close look.
  • Add incremental timeline support to refresh view based on pending clustering operations
  • Change replacecommit.inflight file also to include clustering plan (Previously only requested file has clustering plan). This is needed to block updates on file groups in pending clustering correctly. One disadvantage is replacecommit.inflight has sometimes avro and sometimes json (WorkloadProfile used by insert_overwrite) structure. So there is a hack needed to figure out if a inflight file is created by insert_overwrite or clustering.

Let me know if you have any suggestions .

Verify this pull request

This change added tests. See TestIncrementalFSViewSync.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codecov-io
Copy link

codecov-io commented Dec 29, 2020

Codecov Report

Merging #2388 (4423c0c) into master (81ccb0c) will increase coverage by 19.15%.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #2388       +/-   ##
=============================================
+ Coverage     50.27%   69.43%   +19.15%     
+ Complexity     3050      357     -2693     
=============================================
  Files           419       53      -366     
  Lines         18897     1930    -16967     
  Branches       1937      230     -1707     
=============================================
- Hits           9500     1340     -8160     
+ Misses         8622      456     -8166     
+ Partials        775      134      -641     
Flag Coverage Δ Complexity Δ
hudicli ? ?
hudiclient ? ?
hudicommon ? ?
hudiflink ? ?
hudihadoopmr ? ?
hudisparkdatasource ? ?
hudisync ? ?
huditimelineservice ? ?
hudiutilities 69.43% <ø> (ø) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
.../java/org/apache/hudi/common/util/CommitUtils.java
...rg/apache/hudi/common/bloom/SimpleBloomFilter.java
...apache/hudi/common/model/HoodieCommitMetadata.java
...pache/hudi/io/storage/HoodieFileReaderFactory.java
...i/common/table/log/block/HoodieHFileDataBlock.java
...hudi/common/config/DFSPropertiesConfiguration.java
.../org/apache/hudi/hadoop/utils/HoodieHiveUtils.java
...e/hudi/common/model/HoodieRollingStatMetadata.java
...pache/hudi/common/model/HoodieArchivedLogFile.java
.../hive/SlashEncodedHourPartitionValueExtractor.java
... and 346 more

@nsivabalan
Copy link
Contributor

@vinothchandar @n3nash : gentle reminder to review :) its been few days.

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed most of the code. Regarding your final item in the description (below), can you point me to the relevant section.

hange replacecommit.inflight file also to include clustering plan (Previously only requested file has clustering plan). This is needed to block updates on file groups in pending clustering correctly. One disadvantage is replacecommit.inflight has sometimes avro and sometimes json (WorkloadProfile used by insert_overwrite) structure. So there is a hack needed to figure out if a inflight file is created by insert_overwrite or clustering.

List<HoodieInstant> finishedCompactionInstants = compactionInstants.stream()
.filter(instantPair -> instantPair.getValue().getAction().equals(HoodieTimeline.COMMIT_ACTION)
&& instantPair.getValue().isCompleted())
List<HoodieInstant> finishedViewChangingInstants = viewChangingInstants.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you construct a timeline and call timeline.viewAlteringInstants() instead to avoid duplicating the logic ?

List<Pair<HoodieInstant, HoodieInstant>> allTransitions = new ArrayList<>();

return oldTimeline.filterPendingCompactionTimeline().getInstants().map(instant -> {
allTransitions.addAll(oldTimeline.filterPendingCompactionTimeline().getInstants().map(instant -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of oldTimeline.filterPendingCompactionTimeline().getInstants(),
is there scope to use
oldTimeline.viewChangingInstants() and consolidate both this and below statement where we handle replace commits ?

void addFileGroupsInPendingClustering(Stream<Pair<HoodieFileGroupId, HoodieInstant>> fileGroups) {
fileGroups.forEach(fileGroupInstantPair -> {
ValidationUtils.checkArgument(fgIdToPendingClustering.containsKey(fileGroupInstantPair.getLeft()),
ValidationUtils.checkArgument(!fgIdToPendingClustering.containsKey(fileGroupInstantPair.getLeft()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug in 0.7 right which will fail when RocksDBFileSystemView is used ?

diffResult.getFinishedViewChangingInstants().stream().forEach(instant -> {
try {
removePendingCompactionInstant(timeline, instant);
if (HoodieTimeline.COMPACTION_ACTION.equals(instant.getAction())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we introduce something like HoodieInstant.isAction(String action) instead of directly checking the action names here ? There could be many such occurrence like this ?

// Adding mandatory parameters - Last instants affecting file-slice
timeline.lastInstant().ifPresent(instant -> builder.addParameter(LAST_INSTANT_TS, instant.getTimestamp()));
builder.addParameter(TIMELINE_HASH, timeline.getTimelineHash());
builder.addParameter(TIMELINE_HASH, timeline.filterCompletedAndCompactionInstants().getTimelineHash());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be filterViewChangingInstants ?

this.fileSlices = new TreeMap<>(HoodieFileGroup.getReverseCommitTimeComparator());
this.timeline = timeline;
this.lastInstant = timeline.lastInstant();
this.lastInstant = timeline.filterCompletedAndCompactionInstants().lastInstant();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid this. FileGroup should just be acting on the timeline given to make them composable. Can you elaborate on why is there a need to ensure lastInstant must include only completed instants ?

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will review the rest of the section after the diff gets updated.

@n3nash
Copy link
Contributor

n3nash commented Mar 30, 2021

@satishkotha can you address @bvaradar comments and rebase ? I can take a pass after that.

@n3nash
Copy link
Contributor

n3nash commented Apr 8, 2021

@satishkotha gentle reminder

@satishkotha
Copy link
Member Author

@n3nash i dont have time in next 2-3 weeks to get this done. If you prefer, we can close this one. i can reopen (same PR or a different one) when i'm ready

@vinothchandar
Copy link
Member

@n3nash @satishkotha Any updates on this? generally love to get these follow ups from clustering over the fence if we can

@codope
Copy link
Member

codope commented Nov 3, 2021

Incremental read in presence of pending clustering covered as part of #3419 and #3802
This PR can be closed.

@nsivabalan nsivabalan closed this Nov 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants