Skip to content

Conversation

@jsbali
Copy link
Contributor

@jsbali jsbali commented Nov 8, 2021

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@n3nash
Copy link
Contributor

n3nash commented Nov 23, 2021

@nsivabalan Could you please review this ?

log.info("Checking if paths exists took " + timeTaken + "ms")

val optStartTs = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
val isInstantArchived = optStartTs.compareTo(commitTimeline.firstInstant().get().getTimestamp) < 0 // True if optStartTs < activeTimeline.first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a user standpoint, I would expect we should fallback to first valid commit in active timeline which cleaner has not cleaned up. But guess from an impl standpoint, we can't find this commit that easily. And so is the rational to fallback to snapshot query?

Copy link
Contributor

@nsivabalan nsivabalan Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad. from re-reading the description, guess the fix does not sit well. cleaner will not touch the timeline right. So, how do we know if a commit has been cleaned up or not (bcoz, it could still be part of active timeline). may be I am missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revisited this patch. I get it now.
So, we are fixing two things.
1: a commit is valid in active timeline, but corresponding data files are cleaned up.
2: begin commit is archived.
Makes sense to me.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a high level comment.

@nsivabalan
Copy link
Contributor

something to think about as a potential solution.
applicable only if cleaner policy is set to num commits.
if min and max archival commits is 5 and 10. and cleaner configs is 3 for eg.
when we encounter a commit that got cleaned up, but still part of active timeline, we can fallback to last but 3rd commit. (last but (N commits to retain for cleaner)).

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some clarifying comments. Changes look good to me in general.

log.info("Checking if paths exists took " + timeTaken + "ms")

val optStartTs = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
val isInstantArchived = optStartTs.compareTo(commitTimeline.firstInstant().get().getTimestamp) < 0 // True if optStartTs < activeTimeline.first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revisited this patch. I get it now.
So, we are fixing two things.
1: a commit is valid in active timeline, but corresponding data files are cleaned up.
2: begin commit is archived.
Makes sense to me.

val timer = new HoodieTimer().startTimer();

val allFilesToCheck = filteredMetaBootstrapFullPaths ++ filteredRegularFullPaths
val firstNotFoundPath = allFilesToCheck.find(path => !fs.exists(new Path(path)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fs.isExists() should be routed to metadata table.

.schema(usedSchema)
.format("hudi")
.load(basePath)
.filter(String.format("%s > '%s'", HoodieRecord.COMMIT_TIME_METADATA_FIELD, //Notice the > in place of >= because we are working with optParam instead of first commit > optParam
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand how does this work?
Lets take the example in the tests added.
C0 C1 C2 C3 | C4 C5 | C6 C7 C8 C9

C0 to C3 is archived.
C4 and C4 are cleaned.
Active timeline: C4 to C9.

If someone tries incremental query with C4 and C5 as begin and end,
do we do full scan of table for records with commit time > C4 and <= C5?
Whats the checkpoint returned at the end ? Is it C5 so that next time the caller will make incremental query with begin time C5?
So, in this case, if records pertaining to C4 and C5 have been updated by future commits, we may return empty df is it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess my question on incremental query checkpoint may not make sense. If consumer is a deltastreamer, it will keep track of commits consumed and will send back C5 for next round. The query as such may not return any explicit checkpoint. Correct me if my understanding is wrong.

@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Jan 6, 2022
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have created a follow up to address some of the feedback. Will go ahead and land this.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. have filed a follow up jira to address feedback
https://issues.apache.org/jira/browse/HUDI-3189

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsbali : Can you review the patch once. I made some minor updates, but had to resolve conflicts w/ latest master. just wanted to ensure things are in good shape.

@jsbali
Copy link
Contributor Author

jsbali commented Jan 7, 2022

Adding more context here
Screenshot 2022-01-07 at 8 52 02 PM
Left side is the one which we get when HoodieFileIndex does its magic. Right side is the one in the commit file
Since partition is expected in the middle, this union fails
Have made the change when we call HadoopFsRelation so that order of schema is maintained.

There is one other way where we skip the HoodieFileIndex path and directly take in dir glob pattern but that is one extra config which needs to be maintained.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets clarify with some experts. I am not comfortable changing the DefaultSource.

@nsivabalan
Copy link
Contributor

I am removing this one from 0.10.1 as it needs some discussion to be resolved. but lets try to get a closure soon.

@nsivabalan nsivabalan added priority:high Significant impact; potential bugs and removed priority:critical Production degraded; pipelines stalled labels Jan 10, 2022
@nsivabalan
Copy link
Contributor

@jsbali : I have made some fixes and removed the changes in DefaultSource. can you take a look.

Jagmeet Bali and others added 3 commits January 31, 2022 16:51
…erlying files have been cleared or moved by cleaner
…tion column gets added to the end and fails the union of DF while doing incr scan fallback
@hudi-bot
Copy link
Collaborator

hudi-bot commented Feb 1, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@jsbali
Copy link
Contributor Author

jsbali commented Feb 1, 2022

@nsivabalan Thanks a lot for seeing this through. LGTM

@nsivabalan
Copy link
Contributor

@jsbali : Do you think you can take up the similar work for MOR. I can assist you if need be.

@nsivabalan nsivabalan merged commit 7ce0f45 into apache:master Feb 1, 2022
nsivabalan added a commit to onehouseinc/hudi that referenced this pull request Feb 9, 2022
…erlying files have been cleared or moved by cleaner (apache#3946)


Co-authored-by: sivabalan <[email protected]>
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
…erlying files have been cleared or moved by cleaner (apache#3946)


Co-authored-by: sivabalan <[email protected]>
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
…erlying files have been cleared or moved by cleaner (apache#3946)


Co-authored-by: sivabalan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high Significant impact; potential bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants