Improve Delta lake caching of metadata #20437

Pluies · 2024-01-22T11:28:12Z

Description

👋 This is a reworking of #17516 based off the current master branch. c/c the description of the previous issue:

Currently, when a new commit is made to a Delta table, the cached metadata entry of the most up to date TableSnapshot is invalidated. This means that a metadata entry must be re-read from the checkpoint and the commits made after the checkpoint (please see c0d0937). This is unnecessary work, as we always read the new commits, and could be reconciling the cached metadata entry with the possible metadata entries loaded from the new commits.

This PR modifies the TransactionLogTail to keep track of any metadata entries it may contain. In the TableSnapshot the cached metadata entry is reconciled with any metadata entry of the TransactionLogTail.

This fixes the seconds part of #17406 .

Besides, this PR also:

Extends the caching logic above to protocolEntries
Fixes the type signature for getProtocolEntry to match getMetadataEntry (kept in a separate commit for ease of review)

Additional context and related issues

Fixes #17406

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Section
* Improve caching of metadata and protocol entries from Delta Lake logs. ({issue}`17406`)

Cache metadata and protocol entries so they are only read when creating the TableSnapshot in the first place.

findinpath

I get the intention of the changes, but I feel that the PR needs some extra polishing.

findinpath · 2024-01-24T08:24:41Z

...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java

+            Optional<MetadataEntry> cachedMetadata,
+            Optional<ProtocolEntry> cachedProtocol)


From the code changes i see that the metadata & protocol are always obtained from the logTail.
Why do we add then those two parameters to the constructor?

We try to get them from the log tail every time, but most of the time they'll be empty, so we need the cached version somehow.

findinpath · 2024-01-24T08:29:09Z

...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java

+                transactionLogTail.getMetadataEntry().or(() -> cachedMetadata),
+                transactionLogTail.getProtocolEntry().or(() -> cachedProtocol)));


i don't quite get what is happening here.
it feels like we may potentially end up down the road with unwanted state of the TableSnapshot.
I'm not comfortable with the setters for the "cached" metadata & protocol.

The log tail only contains metadata or protocol entries if needed; most of the time this will be an empty optional. The current Trino implementation is to read the transaction log back until we get the latest version; instead this PR brings in caching so that we do not have to re-read the transaction log, yet still get the updated versions when a new one appears in the log tail 👍

cc @jkylling to double-check if I misrepresented something 😄

To arrive at a snapshot of a Delta table we do one of:

Read checkpoint + transactions tail since checkpoint commit.

Use existing snapshot + transaction tail since snapshot version.

As we don't know what the snapshot will be used for, we don't eagerly load the data which is part of a snapshot, like the metadata entry, protocol entry, or add actions. However, almost every query will need the protocol entry and metadata entry. Almost all the time these entries are in the checkpoint, so we should remember these entries to avoid reading the checkpoint all the time.

To get the metadata entry for a snapshot we can do one of:

Read metadata entry from checkpoint + metadata entries in transaction tail since checkpoint commit. Use the last metadata entry.

Use metadata entry from existing snapshot + metadata entries in transaction tail since snapshot version. Use the last metadata entry.

Currently Trino does 1., while this PR does 2. The highlighted snippet does step 2:
It tries to use the last metadata entry in the tail, if it's present, and then uses the metadata entry from existing snapshot if there was no new entry in the tail.

Pluies

Hey @findinpath , thank you for the review! I've replied to comments inline, I understand your worries as cache invalidation is always a tricky beast. If it helps at all, we've been running this change in production for several months now and getting a performance & cost boost by not re-fetching the transaction log from S3 as often.
Is there anything else you have in mind that would de-risk this PR? Any extra test cases youd like us to implement?

Pluies · 2024-01-29T10:48:57Z

...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java

+                transactionLogTail.getMetadataEntry().or(() -> cachedMetadata),
+                transactionLogTail.getProtocolEntry().or(() -> cachedProtocol)));


The log tail only contains metadata or protocol entries if needed; most of the time this will be an empty optional. The current Trino implementation is to read the transaction log back until we get the latest version; instead this PR brings in caching so that we do not have to re-read the transaction log, yet still get the updated versions when a new one appears in the log tail 👍

cc @jkylling to double-check if I misrepresented something 😄

Pluies · 2024-01-29T10:49:41Z

...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java

+            Optional<MetadataEntry> cachedMetadata,
+            Optional<ProtocolEntry> cachedProtocol)


We try to get them from the log tail every time, but most of the time they'll be empty, so we need the cached version somehow.

findinpath · 2024-02-04T21:21:16Z

...ke/src/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/TransactionLogTail.java

    private final long version;

-    private TransactionLogTail(List<Transaction> entries, long version)
+    private final Optional<MetadataEntry> metadataEntry;


why should this class know about metadataEntry & protocolEntry?

It does not need to. The getMetadataEntry and getProtocolEntry methods below can be rewritten to get this on the fly from entries.

@Pluies should we change this to compute metadataEntry and protocolEntry on the fly?

Sounds good! On it 👍

findinpath · 2024-02-27T13:01:29Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

-            }
-            throw e;
-        }
-        MetadataEntry metadataEntry = (MetadataEntry) logEntries.get(MetadataEntry.class);


Where are these checks being done now?

findinpath · 2024-02-27T13:02:25Z

...in/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeFileOperations.java

    }

+    @Test
+    public void testCheckpointFileOperations()


can you add the test in a preparatory commit so that the "gains" are easier visible where you add caching?

Fair call 👍

Pluies · 2024-02-27T14:37:07Z

Just to manage expectations on the timeline for this PR, we're internally upgrading to Trino 439 w/ caching, and will check whether this PR is still worth including once file-level caching is in place. I'll report back 👍

github-actions · 2024-03-19T17:02:44Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

Pluies · 2024-03-19T17:46:19Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

This is still waiting on me giving perf numbers on whether caching delta_log files (via Trino 440+) is enough, or if this PR is still relevant. Will update once we've upgraded and I have concrete numbers 👍

mosabua · 2024-03-19T18:21:54Z

Sounds good @Pluies !

github-actions · 2024-04-10T17:46:29Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions · 2024-05-02T17:19:44Z

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

mosabua · 2024-05-02T19:24:21Z

Reopening with the assumption that the collab here will continue.

Pluies · 2024-05-21T13:54:46Z

Ooookay, I'm back! After an arduous upgrade to Trino 444, here are some benchmark results.

First, here is a microbenchmark of select * from table limit 10 from one of our Delta tables, in a fresh cluster, over several runs:

Trino 444

Elapsed Time	12.74s
Queued Time	391.13us
Analysis Time	7.57s
Planning Time	188.55ms
Execution Time	5.17s

Elapsed Time	4.01s
Queued Time	234.01us
Analysis Time	1.69s
Planning Time	110.93ms
Execution Time	2.31s

Elapsed Time	3.78s
Queued Time	167.67us
Analysis Time	1.49s
Planning Time	93.53ms
Execution Time	2.29s

Elapsed Time	3.77s
Queued Time	167.90us
Analysis Time	1.55s
Planning Time	92.17ms
Execution Time	2.22s

Elapsed Time	3.66s
Queued Time	169.85us
Analysis Time	1.45s
Planning Time	92.16ms
Execution Time	2.20s

Trino 444 with improved metadata caching

Elapsed Time	14.01s
Queued Time	11.29ms
Analysis Time	7.83s
Planning Time	529.74ms
Execution Time	6.17s

Elapsed Time	5.08s
Queued Time	309.44us
Analysis Time	235.96ms
Planning Time	118.35ms
Execution Time	4.84s

Elapsed Time	2.38s
Queued Time	481.55us
Analysis Time	118.86ms
Planning Time	106.40ms
Execution Time	2.26s

Elapsed Time	2.38s
Queued Time	481.55us
Analysis Time	118.86ms
Planning Time	106.40ms
Execution Time	2.26s

Elapsed Time	2.38s
Queued Time	481.55us
Analysis Time	118.86ms
Planning Time	106.40ms
Execution Time	2.26s

The first run is very slow on both clusters as Trino has to fetch all the Delta log, but subsequent queries are noticeably faster with improved metadata caching.

Here are some results from a different synthetic benchmark that replays user-submitted queries:

Trino 444

time=2024-05-17T16:11:11.254Z level=INFO source=/integration-tests/cmd/rerunner/main.go:155 msg="rerun stats" totalTasks=1000 concurrency=40 successRate=38% duration=1953.31s avgQueryDuration=74.11s maxQueryDuration=257.05s avgAnalysisTime=2197ms avgPlanningTime=2181ms avgExecutionTime=33960ms avgFinishingTime=3ms

Trino 444 with improved metadata caching

time=2024-05-17T16:42:30.212Z level=INFO source=/integration-tests/cmd/rerunner/main.go:155 msg="rerun stats" totalTasks=1000 concurrency=40 successRate=38% duration=1717.46s avgQueryDuration=66.55s maxQueryDuration=257.97s avgAnalysisTime=661ms avgPlanningTime=2237ms avgExecutionTime=30036ms avgFinishingTime=16ms

These results are a bit noisy, but the drop in analysis time is also very clear.

cc @raunaqmorarka as discussed with @jkylling

Pluies · 2024-05-21T13:56:21Z

NB: the tests above were all run with delta.checkpoint-filtering.enabled=true and fs cache covering Delta log on the coordinator.

raunaqmorarka · 2024-05-22T02:42:21Z

@Pluies do you know the reason for the difference ? Can you share a JFR profile of the run with delta.checkpoint-filtering.enabled=true and fs cache covering Delta log on the coordinator ?
I think whatever form of caching we have should work without requiring delta.checkpoint-filtering.enabled to be disabled.

Pluies · 2024-05-22T08:14:49Z

@raunaqmorarka I've never used JFR, but ~~from the docs it looks like it can only be used with a commercial Java SE subscription which we don't have, so I don't think I'll be able to provide that.~~ it's open-source since Java 11, will look into it!

github-actions · 2025-05-30T17:03:19Z

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

github-actions · 2025-06-20T17:03:22Z

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

Pluies added 2 commits January 22, 2024 10:43

Improve caching of metadata and protocol entries

84a8e0e

Cache metadata and protocol entries so they are only read when creating the TableSnapshot in the first place.

Refactor getProtocolEntry to match argument order of getMetadataEntry

70681fb

cla-bot bot added the cla-signed label Jan 22, 2024

Pluies requested review from ebyhr and findepi and removed request for findepi January 22, 2024 11:28

github-actions bot added the delta-lake Delta Lake connector label Jan 22, 2024

Pluies requested a review from findepi January 22, 2024 11:28

Pluies mentioned this pull request Jan 22, 2024

Improve Delta lake caching of metadata #17516

Closed

ebyhr requested a review from findinpath January 23, 2024 00:23

findinpath reviewed Jan 24, 2024

View reviewed changes

Pluies commented Jan 29, 2024

View reviewed changes

findinpath reviewed Feb 4, 2024

View reviewed changes

jkylling mentioned this pull request Feb 22, 2024

Enable caching metadata files in iceberg filesystem cache #20803

Merged

findinpath reviewed Feb 27, 2024

View reviewed changes

github-actions bot added the stale label Mar 19, 2024

github-actions bot removed the stale label Mar 20, 2024

github-actions bot added the stale label Apr 10, 2024

github-actions bot closed this May 2, 2024

mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels May 2, 2024

mosabua reopened this May 2, 2024

Pluies mentioned this pull request Dec 16, 2024

Stream large transaction log jsons instead of storing in-memory #24491

Merged

raunaqmorarka removed the stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. label May 9, 2025

github-actions bot added the stale label May 30, 2025

github-actions bot closed this Jun 20, 2025

		Optional<MetadataEntry> cachedMetadata,
		Optional<ProtocolEntry> cachedProtocol)

		transactionLogTail.getMetadataEntry().or(() -> cachedMetadata),
		transactionLogTail.getProtocolEntry().or(() -> cachedProtocol)));

Uh oh!

Improve Delta lake caching of metadata #20437

Improve Delta lake caching of metadata #20437

Uh oh!

Conversation

Pluies commented Jan 22, 2024 • edited by mosabua Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

findinpath left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pluies left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pluies commented Feb 27, 2024

Uh oh!

github-actions bot commented Mar 19, 2024

Uh oh!

Pluies commented Mar 19, 2024

Uh oh!

mosabua commented Mar 19, 2024

Uh oh!

github-actions bot commented Apr 10, 2024

Uh oh!

github-actions bot commented May 2, 2024

Uh oh!

mosabua commented May 2, 2024

Uh oh!

Pluies commented May 21, 2024

Uh oh!

Pluies commented May 21, 2024

Uh oh!

raunaqmorarka commented May 22, 2024

Uh oh!

Pluies commented May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 30, 2025

Uh oh!

github-actions bot commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

Pluies commented Jan 22, 2024 •

edited by mosabua

Loading

Pluies commented May 22, 2024 •

edited

Loading