-
Notifications
You must be signed in to change notification settings - Fork 180
[BugFix] Fix unexpected shift of extraction for rex with nested capture groups in named groups
#4641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| matchCount++; | ||
| } else { | ||
| // If extractor returns null, it might indicate an error (like invalid group name) | ||
| // Stop processing to avoid infinite loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently thinking about adding an error handling here
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
1a8f836 to
25629e0
Compare
rex with nested capture groups in named groups rex with nested capture groups in named groups
| fieldRex, | ||
| context.rexBuilder.makeLiteral(patternStr), | ||
| context.relBuilder.literal(i + 1), | ||
| context.rexBuilder.makeLiteral(groupName), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found there is a namedGroups() API in Matcher (since JDK 20?). If we can get correct index here, we don't need to modify the UDFs below? Alternatively we can move capture name -> index logic here from UDFs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct - the core issue is matching named groups to their correct indices, and Pattern.namedGroups() would be the perfect solution. However, I discovered that we're blocked by a
compatibility constraint:
Pattern.namedGroups()was introduced in JDK 20- We need backward compatibility with JDK 11/17 - for
2.19-dev
I agree that directly leveraging the Pattern.namedGroups() is the right architectural approach - we should definitely migrate to it when we fully upgrade to JDK 20+. At that point, it would be a simple one-line change in CalciteRelNodeVisitor.
Signed-off-by: Jialiang Liang <[email protected]>
| "Rex pattern must contain at least one named capture group"); | ||
| } | ||
|
|
||
| // TODO: Once JDK 20+ is supported, consider using Pattern.namedGroups() API for more efficient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for now, I added a TODO here @dai-chen
…ture groups in named groups (opensearch-project#4641)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Simeon Widdis <[email protected]> Co-authored-by: Manasvini B S <[email protected]> Co-authored-by: opensearch-ci-bot <[email protected]> Co-authored-by: Louis Chu <[email protected]> Co-authored-by: Chen Dai <[email protected]> Co-authored-by: Mebsina <[email protected]> Co-authored-by: Yuanchun Shen <[email protected]> Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com> Co-authored-by: Kai Huang <[email protected]> Co-authored-by: Peng Huo <[email protected]> Co-authored-by: Alexey Temnikov <[email protected]> Co-authored-by: Riley Jerger <[email protected]> Co-authored-by: Tomoyuki MORITA <[email protected]> Co-authored-by: Lantao Jin <[email protected]> Co-authored-by: Songkan Tang <[email protected]> Co-authored-by: qianheng <[email protected]> Co-authored-by: Simeon Widdis <[email protected]> Co-authored-by: Xinyuan Lu <[email protected]> Co-authored-by: Jialiang Liang <[email protected]> Co-authored-by: Peter Zhu <[email protected]> Co-authored-by: Vinay Krishna Pudyodu <[email protected]> Co-authored-by: expani <[email protected]> Co-authored-by: expani1729 <[email protected]> Co-authored-by: Vamsi Manohar <[email protected]> Co-authored-by: ritvibhatt <[email protected]> Co-authored-by: Xinyu Hao <[email protected]> Co-authored-by: Marc Handalian <[email protected]> Co-authored-by: Marc Handalian <[email protected]> Fix join type ambiguous issue when specify the join type with sql-like join criteria (opensearch-project#4474) Fix issue 4441 (opensearch-project#4449) Fix missing keywordsCanBeId (opensearch-project#4491) Fix the bug of explicit makeNullLiteral for UDT fields (opensearch-project#4475) Fix mapping after aggregation push down (opensearch-project#4500) Fix percentile bug (opensearch-project#4539) Fix JsonExtractAllFunctionIT failure (opensearch-project#4556) Fix sort push down into agg after project already pushed (opensearch-project#4546) Fix push down failure for min/max on derived field (opensearch-project#4572) Fix compile issue in main (opensearch-project#4608) Fix filter parsing failure on date fields with non-default format (opensearch-project#4616) Fix bin nested fields issue (opensearch-project#4606) Fix: Support Alias Fields in MIN, MAX, FIRST, LAST, and TAKE Aggregations (opensearch-project#4621) fix rename issue (opensearch-project#4670) Fixes for `Multisearch` and `Append` command (opensearch-project#4512) Fix asc/desc keyword behavior for sort command (opensearch-project#4651) Fix] Fix unexpected shift of extraction for `rex` with nested capture groups in named groups (opensearch-project#4641) Fix CVE-2025-48924 (opensearch-project#4665) Fix sub-fields accessing of generated structs (opensearch-project#4683) Fix] Incorrect Field Index Mapping in AVG to SUM/COUNT Conversion (opensearch-project#15)
Description
Fix unexpected shift of extraction for
rexwith nested capture groups in named groupsThe rex command in PPL had a critical bug when using named capture groups that contained nested unnamed groups. This caused extracted field values to shift by one position, producing incorrect results.
1,2,3... but nested groups create non-sequential indices1,3,5...matcher.group(groupName))Example of the Bug
Query:
Expected Result (correct):
Actual Result (wrong):
Root Cause
When Java's regex engine processes the pattern
(?<user>(amber|hattie))[a-z]*, it assigns group numbers to ALL capture groups (named and unnamed):Pattern:
(?<user>(amber|hattie))[a-z]*@(?<domain>(pyrami|netagy))\.(?<tld>(com|org))Group Assignment:
(?<user>...)← Named group "user"(amber|hattie)← Unnamed nested group(?<domain>...)← Named group "domain"(pyrami|netagy)← Unnamed nested group(?<tld>...)← Named group "tld"(com|org)← Unnamed nested groupBefore the fix, the bug is in
CalciteRelNodeVisitor.java(lines 265-321). The code does:The code assumes named groups are at indices
1,2,3, ... but the actual indices are1,3,5, ... due to the unnamed nested groups.With the above buggy logic:
REX_EXTRACT(field, pattern, 1)→ Gets Group 1(?<user>...)= "amberduke" → CORRECTREX_EXTRACT(field, pattern, 2)→ Gets Group 2(amber|hattie)= "amber" → WRONGREX_EXTRACT(field, pattern, 3)→ Gets Group 3(?<domain>...)= "pyrami" → WRONGThe second and third extractions are off by one group because they hit the unnamed nested groups.
Related Issues
rexcommand off-by-one error with nested capture groups in named groups #4466Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.