-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-49204][SQL][FOLLOWUP] Fix IndexOutOfBoundsException when dealing with surrogate pairs #47871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49204][SQL][FOLLOWUP] Fix IndexOutOfBoundsException when dealing with surrogate pairs #47871
Conversation
|
Adding @uros-db to take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: this PR is regarding non-empty patterns, but please check the behaviour for empty pattern (with start > 0) - it should be the same as the current UTF8_BINARY behaviour (which is arguably incorrect), but we have decided to keep the behaviour as it is
side note: we are aware of this issue (across all collations), and I think we've decided not to fix it just yet...
please see my PR from ~3 months ago:
#46581
as well as the corresponding Jira ticket:
https://issues.apache.org/jira/browse/SPARK-48284
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Show resolved
Hide resolved
uros-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just realized, these changes should already hold up well with empty patterns - I still recommend just adding some tests (you can do it in section "// Empty strings." of testStringLocate, near line 2067)
|
I think you misunderstood the nature of the fix here. As stated in the description, the addressed issue is only for UNICODE code path (for non empty strings). |
|
yes, I see the fix in this PR is regarding non-empty patterns - however, please add these tests with empty patterns as well |
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for CI.
| String targetStr = target.toValidString(); | ||
| String patternStr = pattern.toValidString(); | ||
| // Check if `start` is out of bounds. The provided offset `start` is given in number of | ||
| // codepoints, so a simple `targetStr.length` check is not sufficient here. This check is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation.
|
@stevomitric Highly likely this GA Run / Protobuf breaking change detection and Python CodeGen check is not related to your changes. Could you re-run this failed GA only. |
|
Checks passed. @MaxGekk |
|
+1, LGTM. Merging to master. |
|
@stevomitric Just to double check, only master suffers from the issue, correct? |
What changes were proposed in this pull request?
Modified the
StringLocateICU codepath and added a codepoint count check for the provided offset.Why are the changes needed?
Currently, doing the following throws an
java.lang.IndexOutOfBoundsException.while the correct behavior is to return an false match result (0).
Does this PR introduce any user-facing change?
Yes, fixes the
java.lang.IndexOutOfBoundsExceptionerror.How was this patch tested?
New test cases in this PR.
Was this patch authored or co-authored using generative AI tooling?
No