-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Stream large transaction log jsons instead of storing in-memory #24491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@raunaqmorarka very interesting! I've skimmed the code, would you consider this an alternative approach to the metadata/protocol caching we were trying to push in #20437, or more as a separate improvement? I assume they'll impact the same sort of time-consuming operations. (Asking as I'm keen on getting out fork closer to upstream; it doesn't really matter which approach ends up being used as long as performance improves 👍 ) |
fd27902
to
05a9faa
Compare
Thanks for pointing out that PR, I hadn't looked at it before. For me the priority was to deal gracefully with transaction log jsons that are GBs in size. I've tweaked the PR a bit to be better about caching metadata/protocol entries. Feel free to try this out on your workloads or add review comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks promissing to me.
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java
Show resolved
Hide resolved
...src/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/TransactionLogEntries.java
Show resolved
Hide resolved
...src/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/TransactionLogEntries.java
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/BaseTransactionsTable.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeConfig.java
Show resolved
Hide resolved
...ain/java/io/trino/plugin/deltalake/transactionlog/checkpoint/MetadataAndProtocolEntries.java
Show resolved
Hide resolved
...ain/java/io/trino/plugin/deltalake/transactionlog/checkpoint/MetadataAndProtocolEntries.java
Show resolved
Hide resolved
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java
Show resolved
Hide resolved
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java
Show resolved
Hide resolved
cc4cc3a
to
d714365
Compare
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Outdated
Show resolved
Hide resolved
...ain/java/io/trino/plugin/deltalake/transactionlog/checkpoint/MetadataAndProtocolEntries.java
Outdated
Show resolved
Hide resolved
...ain/java/io/trino/plugin/deltalake/transactionlog/checkpoint/MetadataAndProtocolEntries.java
Outdated
Show resolved
Hide resolved
Operations fetching metadata and protocol entries can skip reading the rest of the json file after those entries are found
d714365
to
0301c34
Compare
Description
Operations fetching metadata and protocol entries can skip reading the rest of the json file after those entries are found
Additional context and related issues
On a example transaction log json of 1.5GB, the time taken for simple operations
like register table, DESCRIBE and SELECTs which don't use table statistics (or any read with
set session delta.statistics_enabled=false
) reduces from 18s to under 1s on local machine.Such large transaction log jsons were observed to have been produced by CLONE operation from Apache Spark.
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: