Skip to content

feat: improve read performance by 7x with prebuffer#1709

Merged
rtyler merged 1 commit intodelta-io:mainfrom
ion-elgreco:feat/enable_prebuffer_pyarrow
Oct 9, 2023
Merged

feat: improve read performance by 7x with prebuffer#1709
rtyler merged 1 commit intodelta-io:mainfrom
ion-elgreco:feat/enable_prebuffer_pyarrow

Conversation

@ion-elgreco
Copy link
Copy Markdown
Collaborator

@ion-elgreco ion-elgreco commented Oct 8, 2023

Description

Enable prebuffer in the pyarrow.dataset.ParquetFragmentScanOptions. Relevant PR in Arrow repo, where they changed it to be default behavior. However, this won't be the case for older versions for PyArrow, so we need to set it to True.:

It improves read speed by 6-7x on Azure in one dataset that I have.

Before:
1min 4s ± 3.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

After:
8.99 s ± 786 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related Issue(s)

Closes ##1569

@github-actions github-actions Bot added the binding/python Issues for the Python package label Oct 8, 2023
@rtyler rtyler enabled auto-merge October 9, 2023 15:16
Copy link
Copy Markdown
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a memory tradeoff, but I think upstream defaulting indicates it's well worth the tradeoff for a default behavior.

Going to approve, thanks for another solid improvement @ion-elgreco

@rtyler rtyler merged commit ab6b0cf into delta-io:main Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants