Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge string-view2 branch: reading from parquet up to 2x faster for some ClickBench queries (not on by default) #11667

Merged
merged 20 commits into from
Jul 29, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 26, 2024

Draft until arrow 52.2.0 is released to crates.io (expected Sat July 27)

Which issue does this PR close?

Part of #10918

Closes #10921

Rationale for this change

We have been integrating a set of StringView changes on the string-view2 branch as they rely on un-released code in arrow-rs . Once those changes are released and DataFusion uses them we can bring this code directly to main.

What changes are included in this PR?

Are these changes tested?

CI

Are there any user-facing changes?

When StringView is enabled, benchmarks run significantly faster

alamb and others added 9 commits July 16, 2024 15:55
* add a knob to force string view in benchmark

* fix sql logic test

* update doc

* fix ci

* fix ci only test

* Update benchmarks/src/util/options.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update datafusion/common/src/config.rs

Co-authored-by: Andrew Lamb <[email protected]>

* update tests

---------

Co-authored-by: Andrew Lamb <[email protected]>
* add functions

* add tests for hash util
* Update `string-view` branch to arrow-rs main (#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* merge

* update cast

* consistent dep

* fix ci

* add more tests

* make doc happy

* update new implementation

* fix bug

* avoid unused dep

* update dep

* update

* fix cargo check

* update doc

* pick up the comments change again

---------

Co-authored-by: Andrew Lamb <[email protected]>
…11519)

* add functions

* Update `string-view` branch to arrow-rs main (#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* merge

* update cast

* consistent dep

* fix ci

* avoid unused dep

* update dep

* update

* fix cargo check

* better group value view aggregation

* update

---------

Co-authored-by: Andrew Lamb <[email protected]>
* initial support for string view regex

* update tests
* Add StringView support for date_part and make_date funcs

* run cargo update in datafusion-cli

* cargo fmt

---------

Co-authored-by: Andrew Lamb <[email protected]>
* gc string view when appropriate

* make clippy happy

* address comments

* make doc happy

* update style

* Add comments and tests for gc_string_view_batch

* better herustic

* update test

* Update datafusion/physical-plan/src/coalesce_batches.rs

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
@github-actions github-actions bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jul 26, 2024
@alamb
Copy link
Contributor Author

alamb commented Jul 26, 2024

CI appears to be failing due to #11671

* better default block size

* fix related test
* use inferenced schema, don't load schema again

* move config to parquet-only

* update

* update

* better format

* format

* update
@alamb
Copy link
Contributor Author

alamb commented Jul 27, 2024

Update here is that we are on track to release arrow 52.2.0 to crates.io tomorrow Saturday July 28. (thank you to @waynexia @viirya @wjones127 for verifying / voting 🙏 ).

Then we'll need a PR to update datafusion to arrow 52.2.0 (which should be straightforward)

Then I will change this PR to ready to review and we should be able to merge it into DataFusion main

🤞

@alamb alamb changed the title Merge string-view2 branch to main Implement StringView reading from parquet, up to 2x faster for some click bench queries (Merge string-view2 branch to main ) Jul 27, 2024
@alamb alamb changed the title Implement StringView reading from parquet, up to 2x faster for some click bench queries (Merge string-view2 branch to main ) StringView reading from parquet, up to 2x faster for some click bench queries (Merge string-view2 branch to main ) Jul 27, 2024
* native support for character length

* Update datafusion/functions/src/unicode/character_length.rs

---------

Co-authored-by: Andrew Lamb <[email protected]>
@alamb alamb changed the title StringView reading from parquet, up to 2x faster for some click bench queries (Merge string-view2 branch to main ) Merge string-view branch: reading from parquet up to 2x faster for some ClickBench queries (not on by default) Jul 29, 2024
@alamb alamb marked this pull request as ready for review July 29, 2024 12:37
@alamb
Copy link
Contributor Author

alamb commented Jul 29, 2024

This PR is now ready for review (mostly I am hoping another committer will approve/merge it as I can't approve my own PR)

@alamb alamb changed the title Merge string-view branch: reading from parquet up to 2x faster for some ClickBench queries (not on by default) Merge string-view2 branch: reading from parquet up to 2x faster for some ClickBench queries (not on by default) Jul 29, 2024
Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@alamb
Copy link
Contributor Author

alamb commented Jul 29, 2024

Thanks @Dandandan -- so exciting!

I plan to work with @XiangpengHao and figure out what we need to do to get this feature on by default

@alamb
Copy link
Contributor Author

alamb commented Jul 29, 2024

🚀

@alamb alamb merged commit a591301 into main Jul 29, 2024
51 checks passed
@alamb alamb deleted the string-view2 branch July 29, 2024 20:52
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use StringViewArray when reading String columns from Parquet
4 participants