Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] fix bug in return type inference of utf8_to_int_type #11662

Merged
merged 4 commits into from
Jul 26, 2024

Conversation

XiangpengHao
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

get_optimal_return_type was a bit misleading and not super clear what it tries to do.

This PR fix a bug where the wrong return type is inferenced. Basically, whenver we call utf8_to_int_type, the wrong type was returned, i.e., it should return a integeter type, not string type.

Maybe we should file follow up PR to improve the documentation and reduce misuse.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

I think clippy is failing because the string-view2 branch needs to get the same fixes we have applied to main.

I will make a PR to do this shortly

@@ -41,8 +41,8 @@ macro_rules! get_optimal_return_type {
DataType::LargeUtf8 | DataType::LargeBinary => $largeUtf8Type,
// Binary inputs are automatically coerced to Utf8
DataType::Utf8 | DataType::Binary => $utf8Type,
// Utf8View inputs will yield Utf8View outputs
DataType::Utf8View => DataType::Utf8View,
// Utf8View max offset size is u32::MAX, the same as UTF8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you find this? Can we add a test for it (e.g. an explain plan test showing a change in type)?

I can help write one if you explain the symptoms you were seeing to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used in string related ScalarUDFs, for example: https://github.com/XiangpengHao/datafusion/blob/string-view2-local/datafusion/functions/src/unicode/character_length.rs#L70

I added a small unit test to the function.

Note that there are a bunch of string related functions that only accepts Utf8 and LargeUtf8, we currently rely on coerce rules to cast them, which won't panic but may be slower than it should be. I think we should add native support to Utf8View, I'm working on it.

@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

I merged up from main to try and get CI passing on this PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@alamb alamb merged commit f13bb82 into apache:string-view2 Jul 26, 2024
24 checks passed
alamb added a commit that referenced this pull request Jul 29, 2024
… some ClickBench queries (not on by default) (#11667)

* Pin to pre-release version of arrow 52.2.0

* Update for deprecated method

* Add a config to force using string view in benchmark (#11514)

* add a knob to force string view in benchmark

* fix sql logic test

* update doc

* fix ci

* fix ci only test

* Update benchmarks/src/util/options.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update datafusion/common/src/config.rs

Co-authored-by: Andrew Lamb <[email protected]>

* update tests

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Add String view helper functions (#11517)

* add functions

* add tests for hash util

* Add ArrowBytesViewMap and ArrowBytesViewSet (#11515)

* Update `string-view` branch to arrow-rs main (#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* merge

* update cast

* consistent dep

* fix ci

* add more tests

* make doc happy

* update new implementation

* fix bug

* avoid unused dep

* update dep

* update

* fix cargo check

* update doc

* pick up the comments change again

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Enable `GroupValueBytesView` for aggregation with StringView types (#11519)

* add functions

* Update `string-view` branch to arrow-rs main (#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* merge

* update cast

* consistent dep

* fix ci

* avoid unused dep

* update dep

* update

* fix cargo check

* better group value view aggregation

* update

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Initial support for regex_replace on `StringViewArray` (#11556)

* initial support for string view regex

* update tests

* Add support for Utf8View for date/temporal codepaths (#11518)

* Add StringView support for date_part and make_date funcs

* run cargo update in datafusion-cli

* cargo fmt

---------

Co-authored-by: Andrew Lamb <[email protected]>

* GC `StringViewArray` in `CoalesceBatchesStream` (#11587)

* gc string view when appropriate

* make clippy happy

* address comments

* make doc happy

* update style

* Add comments and tests for gc_string_view_batch

* better herustic

* update test

* Update datafusion/physical-plan/src/coalesce_batches.rs

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>

* [Bug] fix bug in return type inference of `utf8_to_int_type` (#11662)

* fix bug in return type inference

* update doc

* add tests

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Fix clippy

* Increase ByteViewMap block size to 2MB (#11674)

* better default block size

* fix related test

* Change `--string-view` to only apply to parquet formats (#11663)

* use inferenced schema, don't load schema again

* move config to parquet-only

* update

* update

* better format

* format

* update

* Implement native support StringView for character length (#11676)

* native support for character length

* Update datafusion/functions/src/unicode/character_length.rs

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Remove uneeded patches

* cargo fmt

---------

Co-authored-by: Xiangpeng Hao <[email protected]>
Co-authored-by: Xiangpeng Hao <[email protected]>
Co-authored-by: Andrew Duffy <[email protected]>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants