Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Feb 13, 2022

What is the purpose of the pull request

Timeline server when serving remote requests, has a logic to refresh its local view of the timeline based on timeline hash. Client sends a timeline hash and timeline server compares with its local timeline hash and if they differ, a refresh of timeline happens before serving the request. But this refresh gets triggered even if the client is behind, but the server is already caught up. This could have severe perf impact with async table services and spark streaming pipeline use-cases where commit throughput is high. So, adding a new value to be maintained by the timeline for lastUpdatedTime. and the same will be sent as param with remote request as well.

Fix: So, the fix ensures that timeline server triggers a refresh of local timeline only if its lastUpdatedTime < client's lastUpdatedTime.

To discuss:
Clocks could differ in timeline server compared to that of the executor (client) and there could be drift as well. So, not very sure if we can rely on the exact comparison of last updated time between client and server.

Another option: I am wondering if we can rely on lastKnownInstant(HoodieInstant) from client and compare it w/ that of timeline in timeline server and decide whether to refresh or not instead of the lastUpdatedTime.

Brief change log

  • Added lastUpdatedTime to HoodieTimeline which gets refreshed whenever the timeline is updated. The same value is sent via remote requests to Timeline server.
  • Timeline server triggers a refresh of its local timeline only if its lastUpdatedTime < client's lastUpdatedTime in addition to timeline hash mismatch.

Thanks to guanziyue who helped us with the fix.

Verify this pull request

  • Couple of users in the community actually tested this and contributed the patch.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan changed the title [HUDI-2400] Fixing refreshing of timeline in Timelineserver based on last updated time and timeline hash [HUDI-2761] Fixing refreshing of timeline in Timelineserver based on last updated time and timeline hash Feb 13, 2022
@nsivabalan
Copy link
Contributor Author

@bvaradar @xushiyan @n3nash @vinothchandar : Would appreciate if you folks can review this patch. We are making some tweaks to how we refresh local view in timeline server. Wanted to ensure I am not missing anything and there are no gaps.

@nsivabalan nsivabalan changed the title [HUDI-2761] Fixing refreshing of timeline in Timelineserver based on last updated time and timeline hash [HUDI-2761] Fixing unnecessary refreshing of timeline in Timelineserver Feb 13, 2022
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@n3nash
Copy link
Contributor

n3nash commented Feb 13, 2022

@nsivabalan Thanks for the fix. Quick comment before reviewing the diff: Is there a particular reason for choosing lastUpdatedTime instead of the HoodieInstant itself like you pointed out in your proposal ? To reduce complexity of understanding, I feel comparing HoodieInstants choice is better but I'd like to understand your reasoning.

@danny0405
Copy link
Contributor

@bvaradar @xushiyan @n3nash @vinothchandar : Would appreciate if you folks can review this patch. We are making some tweaks to how we refresh local view in timeline server. Wanted to ensure I am not missing anything and there are no gaps.

Generally i think the hoodie instant time should be the only truth for timeline versioning. In the before, i found that the timeline service refresh frequently if there are async table services to change the timeline metadata, such as cleaning and compaction,
for compaction the refresh is valid and necessary, but for cleaning, most of the refresh are invalid/unnecessary, wondering whether we can resolve this issue in the PR.

@xushiyan
Copy link
Member

@nsivabalan Thanks for the fix. Quick comment before reviewing the diff: Is there a particular reason for choosing lastUpdatedTime instead of the HoodieInstant itself like you pointed out in your proposal ? To reduce complexity of understanding, I feel comparing HoodieInstants choice is better but I'd like to understand your reasoning.

@nsivabalan I had similar view to what @n3nash was asking here. The problem boils down to a cache invalidation issue: the local timeline view is a cache and we need to compare some timestamps to decide whether to invalidate the cache and reload the timeline view or not. So to avoid unnecessary complexity, is there any strong reason why instant time can't be used here?

@nsivabalan
Copy link
Contributor Author

@xushiyan @n3nash :
I put up a patch as is what I got from the community user. and I heard that its being already run in prod if I am not wrong. so, went ahead and put up a patch. atleast I wanted to have discussions on both approaches.
but as I mentioned in the description, I am also inclined towards using lastInstantTime which makes sense. Will go ahead and fix the patch.

@danny0405 : we need to think more about cleaning not triggering any refresh. If I am not wrong, none of the apis in FileSystemView knows for which operation it is being executed for (for eg, getLatestBaseFiles). So, ignoring the timeline refresh just for cleaning will mean that we leak such information to the FileSystemView which needs some thinking. I am to take a look at the code to see how this might pan out. Will keep you posted.
but thanks for bringing up a good point. appreciate it.

@nsivabalan
Copy link
Contributor Author

Closing this in favor of #4812

@nsivabalan nsivabalan closed this Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants