Skip to content

Reuse metadata and protocol entries while retrieving the active files#19410

Merged
findepi merged 6 commits intotrinodb:masterfrom
findinpath:findinpath/multipart-checkpoint-files
Oct 18, 2023
Merged

Reuse metadata and protocol entries while retrieving the active files#19410
findepi merged 6 commits intotrinodb:masterfrom
findinpath:findinpath/multipart-checkpoint-files

Conversation

@findinpath
Copy link
Copy Markdown
Contributor

@findinpath findinpath commented Oct 16, 2023

Description

This is an incremental improvement in the direction set by #18916

While retrieving the metadata & protocol entries from a multi-part checkpoint file, stop scanning the checkpoint files as soon as the metadata & protocol entries are actually found.

Reuse metadata and protocol entries while retrieving the active files

The metadata & protocol entries are already read (and saved) once
when retrieving the table handle.
Reuse this information while retrieving the active files for the table.

Additional context and related issues

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Delta Lake
* Avoid redundant reading of the checkpoint files. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Oct 16, 2023
@findinpath findinpath added delta-lake Delta Lake connector and removed cla-signed labels Oct 16, 2023
@cla-bot cla-bot bot added the cla-signed label Oct 16, 2023
@findinpath findinpath changed the title Avoid useless scanning of multi-part checkpoint files Reuse metadata and protocol entries while retrieving the active files Oct 16, 2023
@findinpath findinpath self-assigned this Oct 16, 2023
Copy link
Copy Markdown
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .*.parquet is pretty broad.
For example, checkpoint files would also match the pattern.

Maybe instead of this,
let's turn dataFilePattern into an ordinary catch all pattern \Q table_directory \E / ( key=value /)? [^/]+

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the pattern to .*?/(?<partition>key=[^/]*/)?[^/]+ and placed it only after matching against metadata files.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

if (entryTypes.contains(ADD)) {
            metadataAndProtocol = Optional.of(getCheckpointMetadataAndProtocolEntries(
                    session,
                    checkpointSchemaManager,
                    typeManager,
                    fileSystem,
                    stats,
                    checkpoint));

from io.trino.plugin.deltalake.transactionlog.TableSnapshot#getCheckpointTransactionLogEntries

it's now dead code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this:

snapshot.getCheckpointTransactionLogEntries(
session,
ImmutableSet.of(PROTOCOL, TRANSACTION, ADD, REMOVE, COMMIT),
checkpointSchemaManager,
typeManager,
fileSystem,
fileFormatDataSourceStats)
.forEach(checkpointBuilder::addLogEntry);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the linked code place already reads metadata entry.
let it read protocol as well and pass both

Typically metadata files should be accessed once, so prefer `.add(x)`
over `.addCopies(x, 1)`, so that `.addCopies` stand out as potentially
something to address.
Once the metadata & protocol entries are found, the scanning of
multi-part checkpoint files can be stopped.
@findinpath findinpath force-pushed the findinpath/multipart-checkpoint-files branch from 5736604 to 7c56c69 Compare October 16, 2023 19:39
@findinpath findinpath requested review from findepi and homar October 17, 2023 07:01
The `metadata` & `protocol` entries are already read (and saved) once
when retrieving the table handle.
Reuse this information while retrieving the active files for the table.
@findinpath findinpath force-pushed the findinpath/multipart-checkpoint-files branch from 7c56c69 to 6e2c8fe Compare October 17, 2023 07:54
@findinpath
Copy link
Copy Markdown
Contributor Author

@findepi pls test this PR with secrets.

@findinpath findinpath force-pushed the findinpath/multipart-checkpoint-files branch from 0061960 to f308b77 Compare October 17, 2023 10:49
@findepi
Copy link
Copy Markdown
Member

findepi commented Oct 17, 2023

/test-with-secrets sha=f308b77d8670341f48b98c4503b86baea449575c

@findinpath findinpath force-pushed the findinpath/multipart-checkpoint-files branch from f308b77 to 2151f45 Compare October 17, 2023 11:56
@findinpath findinpath requested a review from findepi October 18, 2023 03:51
@findepi findepi merged commit f5b1e89 into trinodb:master Oct 18, 2023
@findepi
Copy link
Copy Markdown
Member

findepi commented Oct 18, 2023

🚀

@github-actions github-actions bot added this to the 430 milestone Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

2 participants