Skip to content

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Dec 16, 2024

Description

Operations fetching metadata and protocol entries can skip reading the rest of the json file after those entries are found

Additional context and related issues

On a example transaction log json of 1.5GB, the time taken for simple operations
like register table, DESCRIBE and SELECTs which don't use table statistics (or any read with set session delta.statistics_enabled=false) reduces from 18s to under 1s on local machine.
Such large transaction log jsons were observed to have been produced by CLONE operation from Apache Spark.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Delta Lake
* Improve performance of queries on tables with large transaction log jsons. ({issue}`24491`)

@cla-bot cla-bot bot added the cla-signed label Dec 16, 2024
@github-actions github-actions bot added the delta-lake Delta Lake connector label Dec 16, 2024
@Pluies
Copy link
Contributor

Pluies commented Dec 16, 2024

@raunaqmorarka very interesting! I've skimmed the code, would you consider this an alternative approach to the metadata/protocol caching we were trying to push in #20437, or more as a separate improvement? I assume they'll impact the same sort of time-consuming operations.

(Asking as I'm keen on getting out fork closer to upstream; it doesn't really matter which approach ends up being used as long as performance improves 👍 )

@raunaqmorarka raunaqmorarka force-pushed the raunaq/delta-stream-log branch 2 times, most recently from fd27902 to 05a9faa Compare December 17, 2024 11:38
@raunaqmorarka
Copy link
Member Author

@raunaqmorarka very interesting! I've skimmed the code, would you consider this an alternative approach to the metadata/protocol caching we were trying to push in #20437, or more as a separate improvement? I assume they'll impact the same sort of time-consuming operations.

(Asking as I'm keen on getting out fork closer to upstream; it doesn't really matter which approach ends up being used as long as performance improves 👍 )

Thanks for pointing out that PR, I hadn't looked at it before. For me the priority was to deal gracefully with transaction log jsons that are GBs in size. I've tweaked the PR a bit to be better about caching metadata/protocol entries. Feel free to try this out on your workloads or add review comments.

Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promissing to me.

@raunaqmorarka raunaqmorarka force-pushed the raunaq/delta-stream-log branch 2 times, most recently from cc4cc3a to d714365 Compare December 23, 2024 08:09
Operations fetching metadata and protocol entries can skip reading
the rest of the json file after those entries are found
@raunaqmorarka raunaqmorarka force-pushed the raunaq/delta-stream-log branch from d714365 to 0301c34 Compare December 23, 2024 18:14
@raunaqmorarka raunaqmorarka merged commit f888217 into master Dec 24, 2024
57 checks passed
@raunaqmorarka raunaqmorarka deleted the raunaq/delta-stream-log branch December 24, 2024 09:31
@github-actions github-actions bot added this to the 469 milestone Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

4 participants