Reuse metadata and protocol entries while retrieving the active files#19410
Conversation
...in/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeFileOperations.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
The .*.parquet is pretty broad.
For example, checkpoint files would also match the pattern.
Maybe instead of this,
let's turn dataFilePattern into an ordinary catch all pattern \Q table_directory \E / ( key=value /)? [^/]+
There was a problem hiding this comment.
Modified the pattern to .*?/(?<partition>key=[^/]*/)?[^/]+ and placed it only after matching against metadata files.
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Outdated
Show resolved
Hide resolved
...in/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeFileOperations.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
remove
if (entryTypes.contains(ADD)) {
metadataAndProtocol = Optional.of(getCheckpointMetadataAndProtocolEntries(
session,
checkpointSchemaManager,
typeManager,
fileSystem,
stats,
checkpoint));
from io.trino.plugin.deltalake.transactionlog.TableSnapshot#getCheckpointTransactionLogEntries
it's now dead code
There was a problem hiding this comment.
Not sure about this:
There was a problem hiding this comment.
the linked code place already reads metadata entry.
let it read protocol as well and pass both
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java
Outdated
Show resolved
Hide resolved
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Outdated
Show resolved
Hide resolved
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Outdated
Show resolved
Hide resolved
Typically metadata files should be accessed once, so prefer `.add(x)` over `.addCopies(x, 1)`, so that `.addCopies` stand out as potentially something to address.
Once the metadata & protocol entries are found, the scanning of multi-part checkpoint files can be stopped.
5736604 to
7c56c69
Compare
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Outdated
Show resolved
Hide resolved
The `metadata` & `protocol` entries are already read (and saved) once when retrieving the table handle. Reuse this information while retrieving the active files for the table.
7c56c69 to
6e2c8fe
Compare
|
@findepi pls test this PR with secrets. |
0061960 to
f308b77
Compare
...ino-delta-lake/src/test/java/io/trino/plugin/deltalake/transactionlog/TestTableSnapshot.java
Outdated
Show resolved
Hide resolved
|
/test-with-secrets sha=f308b77d8670341f48b98c4503b86baea449575c |
f308b77 to
2151f45
Compare
|
🚀 |
Description
This is an incremental improvement in the direction set by #18916
While retrieving the
metadata&protocolentries from a multi-part checkpoint file, stop scanning the checkpoint files as soon as the metadata & protocol entries are actually found.Reuse metadata and protocol entries while retrieving the active files
The
metadata&protocolentries are already read (and saved) oncewhen retrieving the table handle.
Reuse this information while retrieving the active files for the table.
Additional context and related issues
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: