Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cast Utf8View to Utf8 to support || from StringViewArray #11796

Merged
merged 3 commits into from
Aug 7, 2024

Conversation

dharanad
Copy link
Contributor

@dharanad dharanad commented Aug 3, 2024

Which issue does this PR close?

Partially make #11766 work.

Rationale for this change

What changes are included in this PR?

  • coerced Utf8View to Utf8 make concat work
  • Added slt tests cases

Are these changes tested?

Existing test cases

carog test

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Aug 3, 2024
@dharanad dharanad changed the title [WIP] Cast Utf8View to Utf8 to support || from StringViewArray Cast Utf8View to Utf8 to support || from StringViewArray Aug 5, 2024
@dharanad dharanad marked this pull request as ready for review August 5, 2024 10:21
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dharanad -- this is ceratinly better than what happens on main (which is the query doesn't run)

I do think the plans could be improved if they cast to Utf8Vew instead, but we can do that as a follow on PR as well

(LargeUtf8, from_type) | (from_type, LargeUtf8) => {
string_concat_internal_coercion(from_type, &LargeUtf8)
match (lhs_type, rhs_type) {
// If Utf8View is in any side, we coerce to Utf8.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to coerce to Utf8View as that coercsion will often be faster (it is faster to cast Utf8 -> Utf8View than the other way around)

Is that possible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, similar to this policy:

fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
use arrow::datatypes::DataType::*;
match (lhs_type, rhs_type) {
// If Utf8View is in any side, we coerce to Utf8View.
(Utf8View, Utf8View | Utf8 | LargeUtf8) | (Utf8 | LargeUtf8, Utf8View) => {
Some(Utf8View)
}
// Then, if LargeUtf8 is in any side, we coerce to LargeUtf8.
(LargeUtf8, Utf8 | LargeUtf8) | (Utf8, LargeUtf8) => Some(LargeUtf8),
(Utf8, Utf8) => Some(Utf8),
_ => None,
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to coerce to Utf8View as that coercsion will often be faster (it is faster to cast Utf8 -> Utf8View than the other way around)

Is that possible?

How about we do in a seperate PR. Previously, we were coerced to Utf8View, so concat was failing. As a temporary workaround to resolve the issue, I've coerce Utf8 instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filed #11881 to track

select column2||' is fast' from temp;
----
rust is fast
datafusion is fast
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

explain select column2 || 'is' || column3 from temp;
----
logical_plan
01)Projection: CAST(temp.column2 AS Utf8) || Utf8("is") || CAST(temp.column3 AS Utf8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a better plan (likely faster) if the casting was to Utf8View rather than Utf8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

explain select column2||' is fast' from temp;
----
logical_plan
01)Projection: CAST(temp.column2 AS Utf8) || Utf8(" is fast")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here it would be better to use Utf8View

@dharanad
Copy link
Contributor Author

dharanad commented Aug 7, 2024

@alamb Are we good here ? Or do you have more suggestions for this PR
I'm actively investigating alternative solutions for this problem here #11836

@alamb
Copy link
Contributor

alamb commented Aug 7, 2024

@alamb Are we good here ? Or do you have more suggestions for this PR I'm actively investigating alternative solutions for this problem here #11836

Nope, sorry @dharanad -- this is good. I am behind on merging / upstream reviews.

@alamb alamb merged commit 60d1d3a into apache:main Aug 7, 2024
26 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 7, 2024

Thanks again @dharanad and @XiangpengHao -- sorry for the delay

@dharanad dharanad deleted the fix/11766 branch August 13, 2024 14:11
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants