Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement utf8_view for replace #12004

Merged
merged 8 commits into from
Aug 19, 2024
Merged

Conversation

thinh2
Copy link
Contributor

@thinh2 thinh2 commented Aug 15, 2024

Which issue does this PR close?

Closes #11913 .

Rationale for this change

Before this change, the replace function only support utf8 and largeUtf8. We used to cast utf8View to utf8, which reduces the performance of the function. This PR is to add support for utf8View inside replace function.

What changes are included in this PR?

  • Create a new function to handle replace for utf8View
  • Implement unit test for the replace function
  • Update sqllogictest to match our utf8View expectation (no CAST in the logical plans)
  • Update the signature of replace to handle utf8View. Without this change, the sql query will still cast utf8View to utf8

Are these changes tested?

  • Implement unit test for the replace function
  • Update sqllogictest to match our utf8View expectation (no CAST in the logical plans)

Are there any user-facing changes?

No

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions labels Aug 15, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking nice @thinh2 🙏 -- I'll start CI for this PR

datafusion/functions/src/string/replace.rs Outdated Show resolved Hide resolved
other => {
exec_err!("Unsupported data type {other:?} for function replace")
}
}
}
}

fn replace_view(args: &[ArrayRef]) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

01)Projection: replace(__common_expr_1, Utf8("foo"), Utf8("bar")) AS c1, replace(__common_expr_1, CAST(test.column2_utf8view AS Utf8), Utf8("bar")) AS c2
02)--Projection: CAST(test.column1_utf8view AS Utf8) AS __common_expr_1, test.column2_utf8view
03)----TableScan: test projection=[column1_utf8view, column2_utf8view]
01)Projection: replace(test.column1_utf8view, Utf8("foo"), Utf8("bar")) AS c1, replace(test.column1_utf8view, test.column2_utf8view, Utf8("bar")) AS c2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a test that shows this query running (aka run

SELECT
  REPLACE(column1_utf8view, 'foo', 'bar') as c1,
  REPLACE(column1_utf8view, column2_utf8view, 'bar') as c2
FROM test;

in addition to EXPLAIN
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb, after updating the replace.rs to process Utf8View, my sql logic test still uses the CAST in its logical plan and this test is failing. Do you have any idea to debug this issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb, I already found the issue. The reason is that I didn't update the function's signature to Utf8View.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @thinh2 -- thank you 🙏

I pushed another commit to your branch that adds a slt test to very execution, and merged up from main

Thank you very much for your contribution

@thinh2 thinh2 changed the title [draft] implement utf8_view for replace implement utf8_view for replace Aug 16, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @thinh2

vec![
Exact(vec![Utf8View, Utf8View, Utf8View]),
Exact(vec![Utf8, Utf8, Utf8]),
Exact(vec![LargeUtf8, LargeUtf8, LargeUtf8]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for also implementing support for LargeUTF8

@alamb
Copy link
Contributor

alamb commented Aug 17, 2024

Looks like there are some clippy errors and test updates needed to get CI clean on this PR

@thinh2
Copy link
Contributor Author

thinh2 commented Aug 17, 2024

Hi @alamb, currently my code fails at this sql logic test https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/functions.slt#L825

query T
SELECT replace(arrow_cast('foobar', 'Dictionary(Int32, Utf8)'), 'bar', 'hello')
----
foohello

The reason is because of replace's signature order. If I push Utf8 before Utf8View, it will pass. However, test in string_view.slt will fail. I am trying to understand the functions test above to fix it.

@thinh2
Copy link
Contributor Author

thinh2 commented Aug 19, 2024

Hi @alamb, I already fixed all the test errors. Could you please check and run the CI pipeline again?

@alamb
Copy link
Contributor

alamb commented Aug 19, 2024

Thank you @thinh2

@alamb
Copy link
Contributor

alamb commented Aug 19, 2024

🚀

@alamb alamb merged commit bd48262 into apache:main Aug 19, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update REPLACE scalar function to support Utf8View
2 participants