Skip to content

Conversation

@YuweiXiao
Copy link
Contributor

@YuweiXiao YuweiXiao commented Jan 8, 2022

Tips

What is the purpose of the pull request

Fix MOR snapshot query path during compaction for HIVE read.

In current implementation, if a write comes in and complete during compaction, it will not be visible to snapshot query until the compaction completes. This is caused by filter logic of getting file group's log files.

Brief change log

  • Include pending compactions in the timeline for the file group retrieval logic.

Verify this pull request

This change added tests and can be verified as follows:

  • Add functional test about snapshot query during compaction.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@YuweiXiao YuweiXiao changed the title [HUDI-3194] fix MOR snapshot query (HIVE) during compaction [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction Jan 8, 2022
@nsivabalan
Copy link
Contributor

@YuweiXiao : I see "WIP" in title. Is the patch good to review or is it still being worked upon ?

@YuweiXiao
Copy link
Contributor Author

@YuweiXiao : I see "WIP" in title. Is the patch good to review or is it still being worked upon ?

Yes, it is still in process, as there are failed UT need to be fixed.

@nsivabalan
Copy link
Contributor

@xiarixiaoyao : hey, can you review this patch please. Touches part of the code authored by you.

@YuweiXiao YuweiXiao changed the title [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction [HUDI-3194] fix MOR snapshot query (HIVE) during compaction Jan 10, 2022
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YuweiXiao Why not use BaseFileWithLogsSplit which maps matching log files to base files?

// Both commit and delta-commits are included - pick the latest completed one
Option<HoodieInstant> latestCompletedInstant =
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant();
metaClient.getActiveTimeline().getWriteTimeline().filterCompletedInstants().lastInstant();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writeTimeline will also contain the compaction instant compared to commitsTimeline, but how does that matter for this scenario? Since latest active timeline is already being passed to createInMemoryFileSystemViewWithTimeline then latest file slice would contain the file group due to commit during ongoing compaction right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't affect the correctness. The latestCompletedInstant is used to filter file slice. Considering a compaction only case, without including the completed compaction instant, we will end up reading 'old version' file slice (i.e., base file + log) rather than the compacted one (i.e., only base file, which has better performance).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't understand the fix. can you help throw some light. From the description in this patch, the gap is, when compaction is on-going and a new write comes in and completes, it may not be visible to queries.
But the fix here, just includes compaction instants to the list of instants to process. Not sure if the description matches the fix.
or am I missing anything here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't understand the fix. can you help throw some light. From the description in this patch, the gap is, when compaction is on-going and a new write comes in and completes, it may not be visible to queries. But the fix here, just includes compaction instants to the list of instants to process. Not sure if the description matches the fix. or am I missing anything here.

Hey! In fsView::getLatestMergedFileSlicesBeforeOrOn, there is a logic where we check if a file group is under compaction (under construction), so that we could add logs files generated by concurrent writers. And only passing a timeline including compactions, this logic could work (fsView::fetchMergedFileSlice).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. thanks for explaining in detail. Fix makes sense then.

@nsivabalan nsivabalan added the priority:high Significant impact; potential bugs label Jan 10, 2022
@YuweiXiao
Copy link
Contributor Author

@YuweiXiao Why not use BaseFileWithLogsSplit which maps matching log files to base files?

Actually, I tried with BaseFileWithLogsSplit. IIUC, we need to use BaseFileWithLogsSplit at the place where we generate the split, i.e., HoodieInputFormatUtils::filterFileStatusForSnapshotMode. By looking at its implementation, I found its semantic is to generate baseFile split and log file only file slice. And callers of this function rely on this semantic, such as bootstrap.

@xiarixiaoyao
Copy link
Contributor

@YuweiXiao only a small question: Does clustering also have this problem?

@YuweiXiao
Copy link
Contributor Author

YuweiXiao commented Jan 11, 2022

@YuweiXiao only a small question: Does clustering also have this problem?

I guess not. Currently, clustering doesn't support concurrent updates. So there won't be new log files when doing the clustering.

But you remind me it may be a problem in the future. I am working on consistent hashing index recently, which could enhance clustering to support concurrent update. I will keep an eye on it.

@xiarixiaoyao
Copy link
Contributor

LGTM

@xiarixiaoyao xiarixiaoyao requested review from xiarixiaoyao and removed request for xiarixiaoyao January 11, 2022 03:48
@xiarixiaoyao
Copy link
Contributor

@codope could you pls review this pr again, thanks

@nsivabalan nsivabalan merged commit d365337 into apache:master Jan 17, 2022
@nsivabalan nsivabalan added priority:critical Production degraded; pipelines stalled and removed priority:high Significant impact; potential bugs labels Jan 17, 2022
@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants