Skip to content

Commit e38e135

Browse files
uros-dbcloud-fan
authored andcommitted
[SPARK-49204][SQL] Fix surrogate pair handling in StringInstr and StringLocate
### What changes were proposed in this pull request? Fix the following string expressions to handle surrogate pairs properly: - StringInstr - StringLocate The issue has to do with counting surrogate pairs, which are single Unicode code points (and single UTF-8 characters), but are represented using 2 characters in UTF-16 (Java String). Example of incorrect results (under `UNICODE` collation, but similar issues are noted for all ICU collations): ``` StringInstr("😄a", "a") // returns: 3 (incorrect), instead of: 2 (correct) StringLocate("a", "😄a") // returns: 3 (incorrect), instead of: 2 (correct) ``` ### Why are the changes needed? Currently, some string expressions are giving wrong results when working with surrogate pairs. ### Does this PR introduce _any_ user-facing change? Yes, these expressions will now work properly with surrogate pairs: `instr`, `locate`/`position`. ### How was this patch tested? New tests in `CollationSupportSuite`. ### Was this patch authored or co-authored using generative AI tooling? Yes. Closes #47711 from uros-db/surrogate-indexof. Authored-by: Uros Bojanic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
1 parent 7feea93 commit e38e135

File tree

2 files changed

+386
-143
lines changed

2 files changed

+386
-143
lines changed

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -701,11 +701,26 @@ public static int indexOf(final UTF8String target, final UTF8String pattern,
701701
final int start, final int collationId) {
702702
if (pattern.numBytes() == 0) return target.indexOfEmpty(start);
703703
if (target.numBytes() == 0) return MATCH_NOT_FOUND;
704-
705-
StringSearch stringSearch = CollationFactory.getStringSearch(target, pattern, collationId);
706-
stringSearch.setIndex(start);
707-
708-
return stringSearch.next();
704+
// Initialize the string search with respect to the specified ICU collation.
705+
String targetStr = target.toValidString();
706+
String patternStr = pattern.toValidString();
707+
StringSearch stringSearch =
708+
CollationFactory.getStringSearch(targetStr, patternStr, collationId);
709+
stringSearch.setOverlapping(true);
710+
// Start the search from `start`-th code point (NOT necessarily from the `start`-th character).
711+
int startIndex = targetStr.offsetByCodePoints(0, start);
712+
stringSearch.setIndex(startIndex);
713+
// Perform the search and return the next result, starting from the specified position.
714+
int searchIndex = stringSearch.next();
715+
if (searchIndex == StringSearch.DONE) {
716+
return MATCH_NOT_FOUND;
717+
}
718+
// Convert the search index from character count to code point count.
719+
int indexOf = targetStr.codePointCount(0, searchIndex);
720+
if (indexOf < start) {
721+
return MATCH_NOT_FOUND;
722+
}
723+
return indexOf;
709724
}
710725

711726
private static int findIndex(final StringSearch stringSearch, int count) {

0 commit comments

Comments
 (0)