-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-2761] Fixing unnecessary refreshing of timeline in Timelineserver #4800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… time and timeline hash
|
@bvaradar @xushiyan @n3nash @vinothchandar : Would appreciate if you folks can review this patch. We are making some tweaks to how we refresh local view in timeline server. Wanted to ensure I am not missing anything and there are no gaps. |
|
@nsivabalan Thanks for the fix. Quick comment before reviewing the diff: Is there a particular reason for choosing lastUpdatedTime instead of the HoodieInstant itself like you pointed out in your proposal ? To reduce complexity of understanding, I feel comparing HoodieInstants choice is better but I'd like to understand your reasoning. |
Generally i think the hoodie instant time should be the only truth for timeline versioning. In the before, i found that the timeline service refresh frequently if there are async table services to change the timeline metadata, such as cleaning and compaction, |
@nsivabalan I had similar view to what @n3nash was asking here. The problem boils down to a cache invalidation issue: the local timeline view is a cache and we need to compare some timestamps to decide whether to invalidate the cache and reload the timeline view or not. So to avoid unnecessary complexity, is there any strong reason why instant time can't be used here? |
|
@xushiyan @n3nash : @danny0405 : we need to think more about cleaning not triggering any refresh. If I am not wrong, none of the apis in FileSystemView knows for which operation it is being executed for (for eg, getLatestBaseFiles). So, ignoring the timeline refresh just for cleaning will mean that we leak such information to the FileSystemView which needs some thinking. I am to take a look at the code to see how this might pan out. Will keep you posted. |
|
Closing this in favor of #4812 |
What is the purpose of the pull request
Timeline server when serving remote requests, has a logic to refresh its local view of the timeline based on timeline hash. Client sends a timeline hash and timeline server compares with its local timeline hash and if they differ, a refresh of timeline happens before serving the request. But this refresh gets triggered even if the client is behind, but the server is already caught up. This could have severe perf impact with async table services and spark streaming pipeline use-cases where commit throughput is high. So, adding a new value to be maintained by the timeline for lastUpdatedTime. and the same will be sent as param with remote request as well.
Fix: So, the fix ensures that timeline server triggers a refresh of local timeline only if its lastUpdatedTime < client's lastUpdatedTime.
To discuss:
Clocks could differ in timeline server compared to that of the executor (client) and there could be drift as well. So, not very sure if we can rely on the exact comparison of last updated time between client and server.
Another option: I am wondering if we can rely on lastKnownInstant(HoodieInstant) from client and compare it w/ that of timeline in timeline server and decide whether to refresh or not instead of the lastUpdatedTime.
Brief change log
Thanks to guanziyue who helped us with the fix.
Verify this pull request
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.