-
Couldn't load subscription status.
- Fork 3.4k
Improve Delta lake caching of metadata #20437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Cache metadata and protocol entries so they are only read when creating the TableSnapshot in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the intention of the changes, but I feel that the PR needs some extra polishing.
| Optional<MetadataEntry> cachedMetadata, | ||
| Optional<ProtocolEntry> cachedProtocol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the code changes i see that the metadata & protocol are always obtained from the logTail.
Why do we add then those two parameters to the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to get them from the log tail every time, but most of the time they'll be empty, so we need the cached version somehow.
| transactionLogTail.getMetadataEntry().or(() -> cachedMetadata), | ||
| transactionLogTail.getProtocolEntry().or(() -> cachedProtocol))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't quite get what is happening here.
it feels like we may potentially end up down the road with unwanted state of the TableSnapshot.
I'm not comfortable with the setters for the "cached" metadata & protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log tail only contains metadata or protocol entries if needed; most of the time this will be an empty optional. The current Trino implementation is to read the transaction log back until we get the latest version; instead this PR brings in caching so that we do not have to re-read the transaction log, yet still get the updated versions when a new one appears in the log tail 👍
cc @jkylling to double-check if I misrepresented something 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To arrive at a snapshot of a Delta table we do one of:
- Read checkpoint + transactions tail since checkpoint commit.
- Use existing snapshot + transaction tail since snapshot version.
As we don't know what the snapshot will be used for, we don't eagerly load the data which is part of a snapshot, like the metadata entry, protocol entry, or add actions. However, almost every query will need the protocol entry and metadata entry. Almost all the time these entries are in the checkpoint, so we should remember these entries to avoid reading the checkpoint all the time.
To get the metadata entry for a snapshot we can do one of:
- Read metadata entry from checkpoint + metadata entries in transaction tail since checkpoint commit. Use the last metadata entry.
- Use metadata entry from existing snapshot + metadata entries in transaction tail since snapshot version. Use the last metadata entry.
Currently Trino does 1., while this PR does 2. The highlighted snippet does step 2:
It tries to use the last metadata entry in the tail, if it's present, and then uses the metadata entry from existing snapshot if there was no new entry in the tail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @findinpath , thank you for the review! I've replied to comments inline, I understand your worries as cache invalidation is always a tricky beast. If it helps at all, we've been running this change in production for several months now and getting a performance & cost boost by not re-fetching the transaction log from S3 as often.
Is there anything else you have in mind that would de-risk this PR? Any extra test cases youd like us to implement?
| transactionLogTail.getMetadataEntry().or(() -> cachedMetadata), | ||
| transactionLogTail.getProtocolEntry().or(() -> cachedProtocol))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log tail only contains metadata or protocol entries if needed; most of the time this will be an empty optional. The current Trino implementation is to read the transaction log back until we get the latest version; instead this PR brings in caching so that we do not have to re-read the transaction log, yet still get the updated versions when a new one appears in the log tail 👍
cc @jkylling to double-check if I misrepresented something 😄
| Optional<MetadataEntry> cachedMetadata, | ||
| Optional<ProtocolEntry> cachedProtocol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to get them from the log tail every time, but most of the time they'll be empty, so we need the cached version somehow.
| private final long version; | ||
|
|
||
| private TransactionLogTail(List<Transaction> entries, long version) | ||
| private final Optional<MetadataEntry> metadataEntry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should this class know about metadataEntry & protocolEntry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not need to. The getMetadataEntry and getProtocolEntry methods below can be rewritten to get this on the fly from entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Pluies should we change this to compute metadataEntry and protocolEntry on the fly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! On it 👍
| } | ||
| throw e; | ||
| } | ||
| MetadataEntry metadataEntry = (MetadataEntry) logEntries.get(MetadataEntry.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are these checks being done now?
| } | ||
|
|
||
| @Test | ||
| public void testCheckpointFileOperations() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the test in a preparatory commit so that the "gains" are easier visible where you add caching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair call 👍
|
Just to manage expectations on the timeline for this PR, we're internally upgrading to Trino 439 w/ caching, and will check whether this PR is still worth including once file-level caching is in place. I'll report back 👍 |
|
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
This is still waiting on me giving perf numbers on whether caching delta_log files (via Trino 440+) is enough, or if this PR is still relevant. Will update once we've upgraded and I have concrete numbers 👍 |
|
Sounds good @Pluies ! |
|
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
|
Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time. |
|
Reopening with the assumption that the collab here will continue. |
|
Ooookay, I'm back! After an arduous upgrade to Trino 444, here are some benchmark results. First, here is a microbenchmark of
The first run is very slow on both clusters as Trino has to fetch all the Delta log, but subsequent queries are noticeably faster with improved metadata caching. Here are some results from a different synthetic benchmark that replays user-submitted queries:
These results are a bit noisy, but the drop in analysis time is also very clear. cc @raunaqmorarka as discussed with @jkylling |
|
NB: the tests above were all run with |
|
@Pluies do you know the reason for the difference ? Can you share a JFR profile of the run with delta.checkpoint-filtering.enabled=true and fs cache covering Delta log on the coordinator ? |
|
@raunaqmorarka I've never used JFR, but |
|
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
|
Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time. |
Description
👋 This is a reworking of #17516 based off the current master branch. c/c the description of the previous issue:
Besides, this PR also:
Additional context and related issues
Fixes #17406
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: