Commit e38e135
[SPARK-49204][SQL] Fix surrogate pair handling in StringInstr and StringLocate
### What changes were proposed in this pull request?
Fix the following string expressions to handle surrogate pairs properly:
- StringInstr
- StringLocate
The issue has to do with counting surrogate pairs, which are single Unicode code points (and single UTF-8 characters), but are represented using 2 characters in UTF-16 (Java String).
Example of incorrect results (under `UNICODE` collation, but similar issues are noted for all ICU collations):
```
StringInstr("😄a", "a") // returns: 3 (incorrect), instead of: 2 (correct)
StringLocate("a", "😄a") // returns: 3 (incorrect), instead of: 2 (correct)
```
### Why are the changes needed?
Currently, some string expressions are giving wrong results when working with surrogate pairs.
### Does this PR introduce _any_ user-facing change?
Yes, these expressions will now work properly with surrogate pairs: `instr`, `locate`/`position`.
### How was this patch tested?
New tests in `CollationSupportSuite`.
### Was this patch authored or co-authored using generative AI tooling?
Yes.
Closes #47711 from uros-db/surrogate-indexof.
Authored-by: Uros Bojanic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>1 parent 7feea93 commit e38e135
File tree
2 files changed
+386
-143
lines changed- common/unsafe/src
- main/java/org/apache/spark/sql/catalyst/util
- test/java/org/apache/spark/unsafe/types
2 files changed
+386
-143
lines changedLines changed: 20 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
701 | 701 | | |
702 | 702 | | |
703 | 703 | | |
704 | | - | |
705 | | - | |
706 | | - | |
707 | | - | |
708 | | - | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
| 723 | + | |
709 | 724 | | |
710 | 725 | | |
711 | 726 | | |
| |||
0 commit comments