Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Feb 14, 2022

What is the purpose of the pull request

Timeline server when serving remote requests, has a logic to refresh its local view of the timeline based on timeline hash. Client sends a timeline hash and timeline server compares with its local timeline hash and if they differ, a refresh of timeline happens before serving the request. But this refresh gets triggered even if the client is behind, but the server is already caught up. This could have severe perf impact with async table services and spark streaming pipeline use-cases where commit throughput is high. So, adding a new value to be maintained by the timeline for lastUpdatedTime. and the same will be sent as param with remote request as well.

Fix: So, the fix ensures that timeline server triggers a refresh of local timeline only if its last known instant < client's last known instant.

Brief change log

  • Timeline server triggers a refresh of its local timeline only if its last known instant < client's last known instant in addition to timeline hash mismatch.

Verify this pull request

Relying on existing tests. Will ask some users in the community to test out the patch.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Feb 16, 2022
@nsivabalan nsivabalan changed the title [HUDI-2761][WIP] Fixing timeline server for repeated refreshes [HUDI-2761] Fixing timeline server for repeated refreshes Feb 25, 2022
@nsivabalan
Copy link
Contributor Author

nsivabalan commented Feb 25, 2022

@danny0405 : I might need your help in this patch. There are few test failures in flink. can you take a look and see whats happening. If you can triage a fix and update the patch, would be of great help.

Description of the patch has all the info/context.

Failing tests:
TestWriteMergeOnReadWithCompact.testIndexStateBootstrap
TestInputFormat.testReadBaseAndLogFilesWithDisorderUpdateDelete

if (!localTimelineHash.equals(timelineHashFromClient)) {
// refresh if timeline hash mismatches and if local's last known instant < client's last known instant
if (!localTimelineHash.equals(timelineHashFromClient)
&& HoodieTimeline.compareTimestamps(localLastKnownInstant, HoodieTimeline.LESSER_THAN, lastKnownInstantFromClient)) {
Copy link
Contributor

@danny0405 danny0405 Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally i got the reason:

We have a start commit method that may generate a rollback instant with greater timestamp than the actual passed in instant time:

public void startCommitWithTime(String instantTime, String actionType) {

And unfortunately, flink uses that (spark uses that too), here is how the problem comes:

delta_commit compaction delta_commit rollback_commit
--- t1 --------------- t2 ------------ t3 ------------- t4 ------------

The t4 was created before t3 was created and it was with the highest timestamp t4, then the following sequence happens:

  1. the rollback action would then refresh the remote timeline service with the latest timestamp t4 (remember the fs view as V1)
  2. the t3 delta commit start to execute and commit, say the commit was successful
  3. then we want to trigger the compaction after the commit of t3

And the tricky things happens:

the compaction scheduler takes the client, the client uses the latest timestamps on timeline and tries to fetch all the fileslices, but because the client timestamp t4 equals with the remote timeline time t4, the view does not sync and we still got V1 fs view here and we can not find any compaction plan because there was no log files in the view.

Here is my fix patch to make sure the rollback timestamp not greater than the delta commit time.

HUDI-2761.patch.zip

@hudi-bot
Copy link
Collaborator

hudi-bot commented Mar 4, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@danny0405 danny0405 merged commit 6a46130 into apache:master Mar 5, 2022
@xushiyan xushiyan added the status:triaged Issue has been reviewed and categorized label Mar 8, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
* Fixing timeline server for repeated refreshes
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled status:triaged Issue has been reviewed and categorized

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants