Skip to content

Conversation

@Pluies
Copy link
Contributor

@Pluies Pluies commented Jan 22, 2024

Description

👋 This is a reworking of #17516 based off the current master branch. c/c the description of the previous issue:

Currently, when a new commit is made to a Delta table, the cached metadata entry of the most up to date TableSnapshot is invalidated. This means that a metadata entry must be re-read from the checkpoint and the commits made after the checkpoint (please see c0d0937). This is unnecessary work, as we always read the new commits, and could be reconciling the cached metadata entry with the possible metadata entries loaded from the new commits.

This PR modifies the TransactionLogTail to keep track of any metadata entries it may contain. In the TableSnapshot the cached metadata entry is reconciled with any metadata entry of the TransactionLogTail.

This fixes the seconds part of #17406 .

Besides, this PR also:

  • Extends the caching logic above to protocolEntries
  • Fixes the type signature for getProtocolEntry to match getMetadataEntry (kept in a separate commit for ease of review)

Additional context and related issues

Fixes #17406

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Section
* Improve caching of metadata and protocol entries from Delta Lake logs. ({issue}`17406`)

Cache metadata and protocol entries so they are only read when creating
the TableSnapshot in the first place.
@cla-bot cla-bot bot added the cla-signed label Jan 22, 2024
@Pluies Pluies requested review from ebyhr and findepi and removed request for findepi January 22, 2024 11:28
@github-actions github-actions bot added the delta-lake Delta Lake connector label Jan 22, 2024
@Pluies Pluies requested a review from findepi January 22, 2024 11:28
@ebyhr ebyhr requested a review from findinpath January 23, 2024 00:23
Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the intention of the changes, but I feel that the PR needs some extra polishing.

Comment on lines +76 to +77
Optional<MetadataEntry> cachedMetadata,
Optional<ProtocolEntry> cachedProtocol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code changes i see that the metadata & protocol are always obtained from the logTail.
Why do we add then those two parameters to the constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to get them from the log tail every time, but most of the time they'll be empty, so we need the cached version somehow.

Comment on lines +147 to +148
transactionLogTail.getMetadataEntry().or(() -> cachedMetadata),
transactionLogTail.getProtocolEntry().or(() -> cachedProtocol)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't quite get what is happening here.
it feels like we may potentially end up down the road with unwanted state of the TableSnapshot.
I'm not comfortable with the setters for the "cached" metadata & protocol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log tail only contains metadata or protocol entries if needed; most of the time this will be an empty optional. The current Trino implementation is to read the transaction log back until we get the latest version; instead this PR brings in caching so that we do not have to re-read the transaction log, yet still get the updated versions when a new one appears in the log tail 👍

cc @jkylling to double-check if I misrepresented something 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To arrive at a snapshot of a Delta table we do one of:

  1. Read checkpoint + transactions tail since checkpoint commit.
  2. Use existing snapshot + transaction tail since snapshot version.

As we don't know what the snapshot will be used for, we don't eagerly load the data which is part of a snapshot, like the metadata entry, protocol entry, or add actions. However, almost every query will need the protocol entry and metadata entry. Almost all the time these entries are in the checkpoint, so we should remember these entries to avoid reading the checkpoint all the time.

To get the metadata entry for a snapshot we can do one of:

  1. Read metadata entry from checkpoint + metadata entries in transaction tail since checkpoint commit. Use the last metadata entry.
  2. Use metadata entry from existing snapshot + metadata entries in transaction tail since snapshot version. Use the last metadata entry.

Currently Trino does 1., while this PR does 2. The highlighted snippet does step 2:
It tries to use the last metadata entry in the tail, if it's present, and then uses the metadata entry from existing snapshot if there was no new entry in the tail.

Copy link
Contributor Author

@Pluies Pluies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @findinpath , thank you for the review! I've replied to comments inline, I understand your worries as cache invalidation is always a tricky beast. If it helps at all, we've been running this change in production for several months now and getting a performance & cost boost by not re-fetching the transaction log from S3 as often.
Is there anything else you have in mind that would de-risk this PR? Any extra test cases youd like us to implement?

Comment on lines +147 to +148
transactionLogTail.getMetadataEntry().or(() -> cachedMetadata),
transactionLogTail.getProtocolEntry().or(() -> cachedProtocol)));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log tail only contains metadata or protocol entries if needed; most of the time this will be an empty optional. The current Trino implementation is to read the transaction log back until we get the latest version; instead this PR brings in caching so that we do not have to re-read the transaction log, yet still get the updated versions when a new one appears in the log tail 👍

cc @jkylling to double-check if I misrepresented something 😄

Comment on lines +76 to +77
Optional<MetadataEntry> cachedMetadata,
Optional<ProtocolEntry> cachedProtocol)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to get them from the log tail every time, but most of the time they'll be empty, so we need the cached version somehow.

private final long version;

private TransactionLogTail(List<Transaction> entries, long version)
private final Optional<MetadataEntry> metadataEntry;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should this class know about metadataEntry & protocolEntry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not need to. The getMetadataEntry and getProtocolEntry methods below can be rewritten to get this on the fly from entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pluies should we change this to compute metadataEntry and protocolEntry on the fly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! On it 👍

}
throw e;
}
MetadataEntry metadataEntry = (MetadataEntry) logEntries.get(MetadataEntry.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are these checks being done now?

}

@Test
public void testCheckpointFileOperations()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the test in a preparatory commit so that the "gains" are easier visible where you add caching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair call 👍

@Pluies
Copy link
Contributor Author

Pluies commented Feb 27, 2024

Just to manage expectations on the timeline for this PR, we're internally upgrading to Trino 439 w/ caching, and will check whether this PR is still worth including once file-level caching is in place. I'll report back 👍

@github-actions
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Mar 19, 2024
@Pluies
Copy link
Contributor Author

Pluies commented Mar 19, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

This is still waiting on me giving perf numbers on whether caching delta_log files (via Trino 440+) is enough, or if this PR is still relevant. Will update once we've upgraded and I have concrete numbers 👍

@mosabua
Copy link
Member

mosabua commented Mar 19, 2024

Sounds good @Pluies !

@github-actions github-actions bot removed the stale label Mar 20, 2024
@github-actions
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Apr 10, 2024
@github-actions
Copy link

github-actions bot commented May 2, 2024

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this May 2, 2024
@mosabua mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels May 2, 2024
@mosabua
Copy link
Member

mosabua commented May 2, 2024

Reopening with the assumption that the collab here will continue.

@mosabua mosabua reopened this May 2, 2024
@Pluies
Copy link
Contributor Author

Pluies commented May 21, 2024

Ooookay, I'm back! After an arduous upgrade to Trino 444, here are some benchmark results.

First, here is a microbenchmark of select * from table limit 10 from one of our Delta tables, in a fresh cluster, over several runs:

  1. Trino 444
Elapsed Time	12.74s
Queued Time	391.13us
Analysis Time	7.57s
Planning Time	188.55ms
Execution Time	5.17s

Elapsed Time	4.01s
Queued Time	234.01us
Analysis Time	1.69s
Planning Time	110.93ms
Execution Time	2.31s

Elapsed Time	3.78s
Queued Time	167.67us
Analysis Time	1.49s
Planning Time	93.53ms
Execution Time	2.29s

Elapsed Time	3.77s
Queued Time	167.90us
Analysis Time	1.55s
Planning Time	92.17ms
Execution Time	2.22s

Elapsed Time	3.66s
Queued Time	169.85us
Analysis Time	1.45s
Planning Time	92.16ms
Execution Time	2.20s
  1. Trino 444 with improved metadata caching
Elapsed Time	14.01s
Queued Time	11.29ms
Analysis Time	7.83s
Planning Time	529.74ms
Execution Time	6.17s

Elapsed Time	5.08s
Queued Time	309.44us
Analysis Time	235.96ms
Planning Time	118.35ms
Execution Time	4.84s

Elapsed Time	2.38s
Queued Time	481.55us
Analysis Time	118.86ms
Planning Time	106.40ms
Execution Time	2.26s

Elapsed Time	2.38s
Queued Time	481.55us
Analysis Time	118.86ms
Planning Time	106.40ms
Execution Time	2.26s

Elapsed Time	2.38s
Queued Time	481.55us
Analysis Time	118.86ms
Planning Time	106.40ms
Execution Time	2.26s

The first run is very slow on both clusters as Trino has to fetch all the Delta log, but subsequent queries are noticeably faster with improved metadata caching.

Here are some results from a different synthetic benchmark that replays user-submitted queries:

  1. Trino 444
time=2024-05-17T16:11:11.254Z level=INFO source=/integration-tests/cmd/rerunner/main.go:155 msg="rerun stats" totalTasks=1000 concurrency=40 successRate=38% duration=1953.31s avgQueryDuration=74.11s maxQueryDuration=257.05s avgAnalysisTime=2197ms avgPlanningTime=2181ms avgExecutionTime=33960ms avgFinishingTime=3ms
  1. Trino 444 with improved metadata caching
time=2024-05-17T16:42:30.212Z level=INFO source=/integration-tests/cmd/rerunner/main.go:155 msg="rerun stats" totalTasks=1000 concurrency=40 successRate=38% duration=1717.46s avgQueryDuration=66.55s maxQueryDuration=257.97s avgAnalysisTime=661ms avgPlanningTime=2237ms avgExecutionTime=30036ms avgFinishingTime=16ms

These results are a bit noisy, but the drop in analysis time is also very clear.

cc @raunaqmorarka as discussed with @jkylling

@Pluies
Copy link
Contributor Author

Pluies commented May 21, 2024

NB: the tests above were all run with delta.checkpoint-filtering.enabled=true and fs cache covering Delta log on the coordinator.

@raunaqmorarka
Copy link
Member

@Pluies do you know the reason for the difference ? Can you share a JFR profile of the run with delta.checkpoint-filtering.enabled=true and fs cache covering Delta log on the coordinator ?
I think whatever form of caching we have should work without requiring delta.checkpoint-filtering.enabled to be disabled.

@Pluies
Copy link
Contributor Author

Pluies commented May 22, 2024

@raunaqmorarka I've never used JFR, but from the docs it looks like it can only be used with a commercial Java SE subscription which we don't have, so I don't think I'll be able to provide that. it's open-source since Java 11, will look into it!

@raunaqmorarka raunaqmorarka removed the stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. label May 9, 2025
@github-actions
Copy link

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label May 30, 2025
@github-actions
Copy link

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

Redundant loads of metadata and protocol entries from delta log checkpoint

5 participants