Skip to content

Construct AddFileEntry instance only if necessary while reading the Delta Lake checkpoint#19795

Merged
raunaqmorarka merged 8 commits intotrinodb:masterfrom
findinpath:findinpath/add-file-entry
Dec 9, 2023
Merged

Construct AddFileEntry instance only if necessary while reading the Delta Lake checkpoint#19795
raunaqmorarka merged 8 commits intotrinodb:masterfrom
findinpath:findinpath/add-file-entry

Conversation

@findinpath
Copy link
Contributor

@findinpath findinpath commented Nov 17, 2023

Description

In case that there is checkpoint filtering applied and there are partition constraints which do not
match the partition values of the entry, avoid early to create the AddFileEntry instance.

Split the loading for the add entries from the Parquet checkpoint in two channels:

  • one channels contains the partitionValues information
  • the other channel contains everything else related to add

When building the add entry, check first the partition constraint to match against the partition values and only then load the add block to avoid unecessary resources spent on deserialization.

Used for testing a multi-part checkpoint file (25 parts , each around 12MB ~ 300MB in total) for testing this feature while storing the checkpoint in local MinIO and came up with the following results:

ADD retrieval of all entries
number of add entries: Optional[1235155]
checkpoint iterator completed positions: Optional[1235157]
checkpoint iterator completed bytes: Optional[323227977]
Elapsed Time in milliseconds: 17866

ADD partition pruning without current changes
number of add entries: Optional[18578]
checkpoint iterator completed positions: Optional[49290]
checkpoint iterator completed bytes: Optional[14099674]
Elapsed Time in milliseconds: 1056

ADD partition pruning with current changes (
number of add entries: Optional[18578]
checkpoint iterator completed positions: Optional[49290]
checkpoint iterator completed bytes: Optional[14099674]
Elapsed Time in milliseconds: 701

As can be seen from the analysis above, there can't be spotted any relevant improvement in terms of IO with this change.
That's because the Parquet page is already loaded because it contains at least one entry matching the partition predicate.
This is why the checkpoint iterator still does read the same amount of bytes as the baseline in case of applying the partition pruning.

However, it can be seen that the elapsed number of milliseconds decreases in case of using this change because there is less deserialization performed.

Tested as well with a more permissive filter and haven't actually spotted bigger improvements than ~ 0.5s in terms of elapsed time between the base line and the current changes.

ADD partition pruning without current changes
number of add entries: Optional[210575]
checkpoint iterator completed positions: Optional[248524]
checkpoint iterator completed bytes: Optional[66021553]
Elapsed Time in milliseconds: 3992

ADD partition pruning with current changes
number of add entries: Optional[210575]
checkpoint iterator completed positions: Optional[248524]
checkpoint iterator completed bytes: Optional[66021553]
Elapsed Time in milliseconds: 3532

Additional context and related issues

This change builds on top of #19588

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Delta
* Improve query planning performance on delta lake tables. ({issue}`19795`)

@cla-bot cla-bot bot added the cla-signed label Nov 17, 2023
@findinpath findinpath requested a review from ebyhr November 17, 2023 11:31
@findinpath findinpath self-assigned this Nov 17, 2023
@github-actions github-actions bot added the delta-lake Delta Lake connector label Nov 17, 2023
@findinpath findinpath added the delta-lake Delta Lake connector label Nov 17, 2023
@findinpath findinpath force-pushed the findinpath/add-file-entry branch from 1a41795 to a8e51ec Compare November 20, 2023 10:44
Copy link
Contributor Author

@findinpath findinpath Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not enough - I'm seeing while debugging buildAddEntry that the blocks are all not lazy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we use lazy blocks, the block behind the lazy block is the monolithic structure corresponding to the add entry.
If we want to actually avoid reading from parquet add related fields for the entries which are not relevant, we need to refactor the way we are reading the checkpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we use lazy blocks, the block behind the lazy block is the monolithic structure

yes

If we want to actually avoid reading from parquet add related fields for the entries which are not relevant,

i don't think the parquet reader supports that, or can support that, given how values are encoded in Parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think the parquet reader supports that,

I'm pointing here towards using a similar method as for dereference pushdown.

@findinpath findinpath marked this pull request as draft November 22, 2023 09:10
@findinpath findinpath force-pushed the findinpath/add-file-entry branch from efd2637 to e1ce7b8 Compare November 22, 2023 11:56
@findinpath findinpath marked this pull request as ready for review November 22, 2023 11:56
@findinpath findinpath force-pushed the findinpath/add-file-entry branch 2 times, most recently from ee830b6 to 17fac3f Compare November 23, 2023 05:54
@findepi
Copy link
Member

findepi commented Nov 24, 2023

Does this technically conflict with #19848 ?

@findinpath
Copy link
Contributor Author

Does this technically conflict with #19848 ?

No, it shouldn't.

The stats to be read with #19848 are build statically.
This change is mostly about splitting in two separate channels the reading from Parquet for the add entries, specifically:

  • partitionValues
  • everything else

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Avoid early to construct the AddFileEntry"

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Create CheckPointFieldExtractor instance only if necessary"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requireNonNull becomes redundant

@findinpath findinpath force-pushed the findinpath/add-file-entry branch from 17fac3f to b52e93f Compare November 24, 2023 15:18
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|| addPartitionValuesBlock.isNull(pagePosition) wasn't here, right?
why is it being added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isNull check for addPartitionValuesBlock is being added because there are now 2 blocks (instead of initially 1) from which we build up the add entry and they need to be consistent.
Changing though slightly the logic. Thank you for raising this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful, especually considering Block.toString?

or to be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you implying here to eventually remove all the debug statements from the CheckpointEntryIterator class?
Potential follow-up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could log the field count of RowBlock once per Page somewhere, it looks unnecessary to log it for every position in a Page

@findinpath findinpath force-pushed the findinpath/add-file-entry branch 2 times, most recently from 9dda8f5 to 1e3db18 Compare November 27, 2023 10:02
@findinpath findinpath force-pushed the findinpath/add-file-entry branch from 1e3db18 to 8dea018 Compare November 27, 2023 17:06
@findinpath findinpath force-pushed the findinpath/add-file-entry branch from 8dea018 to 29fab3e Compare November 27, 2023 17:19
@findinpath
Copy link
Contributor Author

Rebased on master to handle conflicts with #19848

@findinpath findinpath force-pushed the findinpath/add-file-entry branch 2 times, most recently from 6f080ce to 206ad69 Compare November 28, 2023 10:28
@findepi
Copy link
Member

findepi commented Nov 28, 2023

Rebased on master to handle conflicts with #19848

that's why i asked #19795 (comment) :)

@findinpath findinpath force-pushed the findinpath/add-file-entry branch from 206ad69 to ef818db Compare November 28, 2023 12:30
@findinpath findinpath force-pushed the findinpath/add-file-entry branch from ef818db to f1f645e Compare December 6, 2023 11:47
@findinpath
Copy link
Contributor Author

Rebased on master to adress code conflicts.

@raunaqmorarka raunaqmorarka changed the title Avoid early to construct the AddFileEntry while reading the Delta Lake checkpoint Avoid eagerly constructing AddFileEntry while reading the Delta Lake checkpoint Dec 8, 2023
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reword the commit message to

Construct AddFileEntry lazily

When checkpoint filtering is applied
and there are partition constraints which do not
match the partition values of the entry, avoid
eagerly constructing `AddFileEntry`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to

Construct AddFileEntry instance only if necessary

When checkpoint filtering is applied and there
are partition constraints which do not match the
partition values of the entry, avoid eagerly to
construct `AddFileEntry` instances.

@findinpath findinpath force-pushed the findinpath/add-file-entry branch from f1f645e to 2039b5c Compare December 8, 2023 08:34
@findinpath findinpath changed the title Avoid eagerly constructing AddFileEntry while reading the Delta Lake checkpoint Construct AddFileEntry instance only if necessary while reading the Delta Lake checkpoint Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed delta-lake Delta Lake connector

Development

Successfully merging this pull request may close these issues.

4 participants