Skip to content

Conversation

@stevomitric
Copy link
Contributor

What changes were proposed in this pull request?

Modified the StringLocate ICU codepath and added a codepoint count check for the provided offset.

Why are the changes needed?

Currently, doing the following throws an java.lang.IndexOutOfBoundsException.

select position('asd' COLLATE UNICODE, 'a', 10)

while the correct behavior is to return an false match result (0).

Does this PR introduce any user-facing change?

Yes, fixes the java.lang.IndexOutOfBoundsException error.

How was this patch tested?

New test cases in this PR.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Aug 26, 2024
@stevomitric
Copy link
Contributor Author

Adding @uros-db to take a look.

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: this PR is regarding non-empty patterns, but please check the behaviour for empty pattern (with start > 0) - it should be the same as the current UTF8_BINARY behaviour (which is arguably incorrect), but we have decided to keep the behaviour as it is

side note: we are aware of this issue (across all collations), and I think we've decided not to fix it just yet...

please see my PR from ~3 months ago:
#46581

as well as the corresponding Jira ticket:
https://issues.apache.org/jira/browse/SPARK-48284

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realized, these changes should already hold up well with empty patterns - I still recommend just adding some tests (you can do it in section "// Empty strings." of testStringLocate, near line 2067)

@stevomitric
Copy link
Contributor Author

I think you misunderstood the nature of the fix here.

As stated in the description, the addressed issue is only for UNICODE code path (for non empty strings).

@uros-db
Copy link
Contributor

uros-db commented Aug 26, 2024

yes, I see the fix in this PR is regarding non-empty patterns - however, please add these tests with empty patterns as well

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for CI.

String targetStr = target.toValidString();
String patternStr = pattern.toValidString();
// Check if `start` is out of bounds. The provided offset `start` is given in number of
// codepoints, so a simple `targetStr.length` check is not sufficient here. This check is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation.

@MaxGekk
Copy link
Member

MaxGekk commented Aug 27, 2024

@stevomitric Highly likely this GA Run / Protobuf breaking change detection and Python CodeGen check is not related to your changes. Could you re-run this failed GA only.

@stevomitric
Copy link
Contributor Author

Checks passed. @MaxGekk

@MaxGekk
Copy link
Member

MaxGekk commented Aug 27, 2024

+1, LGTM. Merging to master.
Thank you, @stevomitric and @uros-db for review.

@MaxGekk MaxGekk closed this in eb79fc4 Aug 27, 2024
@MaxGekk
Copy link
Member

MaxGekk commented Aug 27, 2024

@stevomitric Just to double check, only master suffers from the issue, correct?

@stevomitric
Copy link
Contributor Author

@MaxGekk yes. the problem was introduces by this PR 2 weeks ago #47711

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants