[HUDI-2711] Fallback to fulltable scan for IncrementalRelation if underlying files have been cleared or moved by cleaner #3946

jsbali · 2021-11-08T15:05:25Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

n3nash · 2021-11-23T18:11:04Z

@nsivabalan Could you please review this ?

nsivabalan · 2021-12-07T06:04:05Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala

+        log.info("Checking if paths exists took " + timeTaken + "ms")
+
+        val optStartTs = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
+        val isInstantArchived = optStartTs.compareTo(commitTimeline.firstInstant().get().getTimestamp) < 0 // True if optStartTs < activeTimeline.first


From a user standpoint, I would expect we should fallback to first valid commit in active timeline which cleaner has not cleaned up. But guess from an impl standpoint, we can't find this commit that easily. And so is the rational to fallback to snapshot query?

my bad. from re-reading the description, guess the fix does not sit well. cleaner will not touch the timeline right. So, how do we know if a commit has been cleaned up or not (bcoz, it could still be part of active timeline). may be I am missing something.

I revisited this patch. I get it now.
So, we are fixing two things.
1: a commit is valid in active timeline, but corresponding data files are cleaned up.
2: begin commit is archived.
Makes sense to me.

nsivabalan

left a high level comment.

nsivabalan · 2021-12-07T06:08:38Z

something to think about as a potential solution.
applicable only if cleaner policy is set to num commits.
if min and max archival commits is 5 and 10. and cleaner configs is 3 for eg.
when we encounter a commit that got cleaned up, but still part of active timeline, we can fallback to last but 3rd commit. (last but (N commits to retain for cleaner)).

nsivabalan

Left some clarifying comments. Changes look good to me in general.

nsivabalan · 2021-12-21T23:05:20Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala

+        log.info("Checking if paths exists took " + timeTaken + "ms")
+
+        val optStartTs = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
+        val isInstantArchived = optStartTs.compareTo(commitTimeline.firstInstant().get().getTimestamp) < 0 // True if optStartTs < activeTimeline.first


I revisited this patch. I get it now.
So, we are fixing two things.
1: a commit is valid in active timeline, but corresponding data files are cleaned up.
2: begin commit is archived.
Makes sense to me.

nsivabalan · 2021-12-21T23:06:55Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala

+        val timer = new HoodieTimer().startTimer();
+
+        val allFilesToCheck = filteredMetaBootstrapFullPaths ++ filteredRegularFullPaths
+        val firstNotFoundPath = allFilesToCheck.find(path => !fs.exists(new Path(path)))


fs.isExists() should be routed to metadata table.

nsivabalan · 2021-12-21T23:18:32Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala

+          .schema(usedSchema)
+          .format("hudi")
+          .load(basePath)
+          .filter(String.format("%s > '%s'", HoodieRecord.COMMIT_TIME_METADATA_FIELD, //Notice the > in place of >= because we are working with optParam instead of first commit > optParam


Can you help me understand how does this work?
Lets take the example in the tests added.
C0 C1 C2 C3 | C4 C5 | C6 C7 C8 C9

C0 to C3 is archived.
C4 and C4 are cleaned.
Active timeline: C4 to C9.

If someone tries incremental query with C4 and C5 as begin and end,
do we do full scan of table for records with commit time > C4 and <= C5?
Whats the checkpoint returned at the end ? Is it C5 so that next time the caller will make incremental query with begin time C5?
So, in this case, if records pertaining to C4 and C5 have been updated by future commits, we may return empty df is it ?

guess my question on incremental query checkpoint may not make sense. If consumer is a deltastreamer, it will keep track of commits consumed and will send back C5 for next round. The query as such may not return any explicit checkpoint. Correct me if my understanding is wrong.

nsivabalan

LGTM. I have created a follow up to address some of the feedback. Will go ahead and land this.

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala

nsivabalan

LGTM. have filed a follow up jira to address feedback
https://issues.apache.org/jira/browse/HUDI-3189

nsivabalan

@jsbali : Can you review the patch once. I made some minor updates, but had to resolve conflicts w/ latest master. just wanted to ensure things are in good shape.

jsbali · 2022-01-07T16:38:03Z

Adding more context here

Left side is the one which we get when HoodieFileIndex does its magic. Right side is the one in the commit file
Since partition is expected in the middle, this union fails
Have made the change when we call HadoopFsRelation so that order of schema is maintained.

There is one other way where we skip the HoodieFileIndex path and directly take in dir glob pattern but that is one extra config which needs to be maintained.

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

nsivabalan

Lets clarify with some experts. I am not comfortable changing the DefaultSource.

nsivabalan · 2022-01-10T20:51:49Z

I am removing this one from 0.10.1 as it needs some discussion to be resolved. but lets try to get a closure soon.

nsivabalan · 2022-01-31T21:51:25Z

@jsbali : I have made some fixes and removed the changes in DefaultSource. can you take a look.

…erlying files have been cleared or moved by cleaner

…tion column gets added to the end and fails the union of DF while doing incr scan fallback

hudi-bot · 2022-02-01T00:27:20Z

CI report:

26636f3 UNKNOWN
c6bf566 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

jsbali · 2022-02-01T00:32:01Z

@nsivabalan Thanks a lot for seeing this through. LGTM

nsivabalan · 2022-02-01T02:13:42Z

@jsbali : Do you think you can take up the similar work for MOR. I can assist you if need be.

…erlying files have been cleared or moved by cleaner (apache#3946) Co-authored-by: sivabalan <[email protected]>

jsbali mentioned this pull request Nov 8, 2021

[SUPPORT] Parquet file does not exist when trying to read hudi table incrementally #2934

Closed

n3nash requested a review from nsivabalan November 23, 2021 18:10

n3nash assigned nsivabalan Nov 23, 2021

nsivabalan reviewed Dec 7, 2021

View reviewed changes

nsivabalan reviewed Dec 21, 2021

View reviewed changes

nsivabalan added the priority:critical Production degraded; pipelines stalled label Jan 6, 2022

nsivabalan approved these changes Jan 6, 2022

View reviewed changes

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala Outdated Show resolved Hide resolved

nsivabalan force-pushed the ftscan branch from 0b9e36f to 68d52a3 Compare January 6, 2022 21:31

nsivabalan approved these changes Jan 6, 2022

View reviewed changes

nsivabalan reviewed Jan 6, 2022

View reviewed changes

nsivabalan force-pushed the ftscan branch from 68d52a3 to 0dd4ee9 Compare January 6, 2022 21:35

nsivabalan reviewed Jan 7, 2022

View reviewed changes

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala Outdated Show resolved Hide resolved

nsivabalan reviewed Jan 7, 2022

View reviewed changes

nsivabalan added priority:high Significant impact; potential bugs and removed priority:critical Production degraded; pipelines stalled labels Jan 10, 2022

Jagmeet Bali and others added 3 commits January 31, 2022 16:51

[HUDI-2711] Fallback to fulltable scan for IncrementalRelation if und…

4dd50da

…erlying files have been cleared or moved by cleaner

Fixing the test cases which fail due to schema mismatch because parti…

193d69e

…tion column gets added to the end and fails the union of DF while doing incr scan fallback

Fixing schema manipulation with incremental read

c135b29

nsivabalan force-pushed the ftscan branch from 26636f3 to c135b29 Compare January 31, 2022 21:51

Adding comments

c6bf566

nsivabalan merged commit 7ce0f45 into apache:master Feb 1, 2022

nsivabalan added a commit to onehouseinc/hudi that referenced this pull request Feb 9, 2022

[HUDI-2711] Fallback to fulltable scan for IncrementalRelation if und…

1bbe025

…erlying files have been cleared or moved by cleaner (apache#3946) Co-authored-by: sivabalan <[email protected]>

liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022

[HUDI-2711] Fallback to fulltable scan for IncrementalRelation if und…

abd492c

…erlying files have been cleared or moved by cleaner (apache#3946) Co-authored-by: sivabalan <[email protected]>

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-2711] Fallback to fulltable scan for IncrementalRelation if und…

f582871

…erlying files have been cleared or moved by cleaner (apache#3946) Co-authored-by: sivabalan <[email protected]>

codope mentioned this pull request Sep 7, 2022

[SUPPORT] Inremental query from the beginning of time #5511

Closed

[HUDI-2711] Fallback to fulltable scan for IncrementalRelation if underlying files have been cleared or moved by cleaner #3946

[HUDI-2711] Fallback to fulltable scan for IncrementalRelation if underlying files have been cleared or moved by cleaner #3946

Uh oh!

Conversation

jsbali commented Nov 8, 2021

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

n3nash commented Nov 23, 2021

Uh oh!

nsivabalan Dec 7, 2021

Choose a reason for hiding this comment

Uh oh!

nsivabalan Dec 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Dec 7, 2021

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

nsivabalan Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

nsivabalan Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

nsivabalan Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

jsbali commented Jan 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Jan 10, 2022

Uh oh!

nsivabalan commented Jan 31, 2022

Uh oh!

hudi-bot commented Feb 1, 2022

CI report:

Uh oh!

jsbali commented Feb 1, 2022

Uh oh!

nsivabalan commented Feb 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nsivabalan Dec 7, 2021 •

edited

Loading

jsbali commented Jan 7, 2022 •

edited

Loading