[HUDI-3194] fix MOR snapshot query (HIVE) during compaction #4540

YuweiXiao · 2022-01-08T13:20:02Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Fix MOR snapshot query path during compaction for HIVE read.

In current implementation, if a write comes in and complete during compaction, it will not be visible to snapshot query until the compaction completes. This is caused by filter logic of getting file group's log files.

Brief change log

Include pending compactions in the timeline for the file group retrieval logic.

Verify this pull request

This change added tests and can be verified as follows:

Add functional test about snapshot query during compaction.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

nsivabalan · 2022-01-09T17:48:40Z

@YuweiXiao : I see "WIP" in title. Is the patch good to review or is it still being worked upon ?

YuweiXiao · 2022-01-10T01:36:17Z

@YuweiXiao : I see "WIP" in title. Is the patch good to review or is it still being worked upon ?

Yes, it is still in process, as there are failed UT need to be fixed.

nsivabalan · 2022-01-10T04:26:47Z

@xiarixiaoyao : hey, can you review this patch please. Touches part of the code authored by you.

hudi-bot · 2022-01-10T09:46:04Z

CI report:

49b0796 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codope

@YuweiXiao Why not use BaseFileWithLogsSplit which maps matching log files to base files?

codope · 2022-01-10T15:51:16Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java

        // Both commit and delta-commits are included - pick the latest completed one
        Option<HoodieInstant> latestCompletedInstant =
-            metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant();
+            metaClient.getActiveTimeline().getWriteTimeline().filterCompletedInstants().lastInstant();


The writeTimeline will also contain the compaction instant compared to commitsTimeline, but how does that matter for this scenario? Since latest active timeline is already being passed to createInMemoryFileSystemViewWithTimeline then latest file slice would contain the file group due to commit during ongoing compaction right?

It won't affect the correctness. The latestCompletedInstant is used to filter file slice. Considering a compaction only case, without including the completed compaction instant, we will end up reading 'old version' file slice (i.e., base file + log) rather than the compacted one (i.e., only base file, which has better performance).

I also don't understand the fix. can you help throw some light. From the description in this patch, the gap is, when compaction is on-going and a new write comes in and completes, it may not be visible to queries.
But the fix here, just includes compaction instants to the list of instants to process. Not sure if the description matches the fix.
or am I missing anything here.

I also don't understand the fix. can you help throw some light. From the description in this patch, the gap is, when compaction is on-going and a new write comes in and completes, it may not be visible to queries. But the fix here, just includes compaction instants to the list of instants to process. Not sure if the description matches the fix. or am I missing anything here.

Hey! In fsView::getLatestMergedFileSlicesBeforeOrOn, there is a logic where we check if a file group is under compaction (under construction), so that we could add logs files generated by concurrent writers. And only passing a timeline including compactions, this logic could work (fsView::fetchMergedFileSlice).

got it. thanks for explaining in detail. Fix makes sense then.

YuweiXiao · 2022-01-11T02:00:14Z

@YuweiXiao Why not use BaseFileWithLogsSplit which maps matching log files to base files?

Actually, I tried with BaseFileWithLogsSplit. IIUC, we need to use BaseFileWithLogsSplit at the place where we generate the split, i.e., HoodieInputFormatUtils::filterFileStatusForSnapshotMode. By looking at its implementation, I found its semantic is to generate baseFile split and log file only file slice. And callers of this function rely on this semantic, such as bootstrap.

xiarixiaoyao · 2022-01-11T03:11:36Z

@YuweiXiao only a small question: Does clustering also have this problem？

YuweiXiao · 2022-01-11T03:13:21Z

@YuweiXiao only a small question: Does clustering also have this problem？

I guess not. Currently, clustering doesn't support concurrent updates. So there won't be new log files when doing the clustering.

But you remind me it may be a problem in the future. I am working on consistent hashing index recently, which could enhance clustering to support concurrent update. I will keep an eye on it.

xiarixiaoyao · 2022-01-11T03:48:02Z

LGTM

xiarixiaoyao · 2022-01-12T14:07:24Z

@codope could you pls review this pr again, thanks

YuweiXiao changed the title ~~[HUDI-3194] fix MOR snapshot query (HIVE) during compaction~~ [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction Jan 8, 2022

YuweiXiao force-pushed the HUDI-3194 branch from dc6e817 to c3295aa Compare January 10, 2022 02:46

YuweiXiao force-pushed the HUDI-3194 branch from c3295aa to 52cad35 Compare January 10, 2022 07:28

[HUDI-3194] fix MOR snapshot query during compaction

49b0796

YuweiXiao force-pushed the HUDI-3194 branch from 52cad35 to 49b0796 Compare January 10, 2022 08:39

YuweiXiao changed the title ~~[HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction~~ [HUDI-3194] fix MOR snapshot query (HIVE) during compaction Jan 10, 2022

codope reviewed Jan 10, 2022

View reviewed changes

nsivabalan added the priority:high Significant impact; potential bugs label Jan 10, 2022

xiarixiaoyao requested review from xiarixiaoyao and removed request for xiarixiaoyao January 11, 2022 03:48

nsivabalan approved these changes Jan 14, 2022

View reviewed changes

nsivabalan merged commit d365337 into apache:master Jan 17, 2022

nsivabalan added priority:critical Production degraded; pipelines stalled and removed priority:high Significant impact; potential bugs labels Jan 17, 2022

nsivabalan pushed a commit that referenced this pull request Jan 19, 2022

[HUDI-3194] fix MOR snapshot query during compaction (#4540)

a8ee57f

vinishjail97 mentioned this pull request Jan 24, 2022

FixIgnoreKey nsivabalan/hudi#11

Closed

5 tasks

vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022

[HUDI-3194] fix MOR snapshot query during compaction (apache#4540)

f9858b1

liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022

[HUDI-3194] fix MOR snapshot query during compaction (apache#4540)

7527928

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-3194] fix MOR snapshot query during compaction (apache#4540)

32afc90

[HUDI-3194] fix MOR snapshot query (HIVE) during compaction #4540

[HUDI-3194] fix MOR snapshot query (HIVE) during compaction #4540

Uh oh!

Conversation

YuweiXiao commented Jan 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

nsivabalan commented Jan 9, 2022

Uh oh!

YuweiXiao commented Jan 10, 2022

Uh oh!

nsivabalan commented Jan 10, 2022

Uh oh!

hudi-bot commented Jan 10, 2022

CI report:

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

codope Jan 10, 2022

Choose a reason for hiding this comment

Uh oh!

YuweiXiao Jan 11, 2022

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 13, 2022

Choose a reason for hiding this comment

Uh oh!

YuweiXiao Jan 13, 2022

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 14, 2022

Choose a reason for hiding this comment

Uh oh!

YuweiXiao commented Jan 11, 2022

Uh oh!

xiarixiaoyao commented Jan 11, 2022

Uh oh!

YuweiXiao commented Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiarixiaoyao commented Jan 11, 2022

Uh oh!

xiarixiaoyao commented Jan 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

YuweiXiao commented Jan 8, 2022 •

edited

Loading

YuweiXiao commented Jan 11, 2022 •

edited

Loading