Skip to content

Conversation

@RyanL1997
Copy link
Collaborator

@RyanL1997 RyanL1997 commented Oct 22, 2025

Description

Fix unexpected shift of extraction for rex with nested capture groups in named groups

The rex command in PPL had a critical bug when using named capture groups that contained nested unnamed groups. This caused extracted field values to shift by one position, producing incorrect results.

  • Root Cause: Code used sequential indices 1, 2, 3... but nested groups create non-sequential indices 1, 3, 5...
  • Solution: Bypass index calculation entirely by using Java's native named group extraction (matcher.group(groupName))

Example of the Bug

Query:

curl -X POST "localhost:9200/_plugins/_ppl" \
    -H "Content-Type: application/json" \
    -d '{
      "query": "source=accounts | rex field=email \"(?<user>(amber|hattie|nanette)[a-z]*)@(?<domain>(pyrami|netagy|quility))\\.(?<tld>(com|org))\" | fields user, domain, tld | head 1"
    }'

Expected Result (correct):

["amberduke", "pyrami", "com"]

Actual Result (wrong):

["amberduke", "amber", "pyrami"]

Root Cause

When Java's regex engine processes the pattern (?<user>(amber|hattie))[a-z]*, it assigns group numbers to ALL capture groups (named and unnamed):

Pattern: (?<user>(amber|hattie))[a-z]*@(?<domain>(pyrami|netagy))\.(?<tld>(com|org))

Group Assignment:

  • Group 0: Entire match
  • Group 1: (?<user>...) ← Named group "user"
  • Group 2: (amber|hattie) ← Unnamed nested group
  • Group 3: (?<domain>...) ← Named group "domain"
  • Group 4: (pyrami|netagy) ← Unnamed nested group
  • Group 5: (?<tld>...) ← Named group "tld"
  • Group 6: (com|org) ← Unnamed nested group

Before the fix, the bug is in CalciteRelNodeVisitor.java (lines 265-321). The code does:

  List<String> namedGroups = RegexCommonUtils.getNamedGroupCandidates(patternStr);
  // namedGroups = ["user", "domain", "tld"]

  for (int i = 0; i < namedGroups.size(); i++) {
      extractCall = PPLFuncImpTable.INSTANCE.resolve(
          context.rexBuilder,
          BuiltinFunctionName.REX_EXTRACT,
          fieldRex,
          context.rexBuilder.makeLiteral(patternStr),
          context.relBuilder.literal(i + 1));  // ← WRONG: Assumes sequential named groups
      // ...
  }

The code assumes named groups are at indices 1, 2, 3, ... but the actual indices are 1, 3, 5, ... due to the unnamed nested groups.

With the above buggy logic:

  • REX_EXTRACT(field, pattern, 1) → Gets Group 1 (?<user>...) = "amberduke" → CORRECT
  • REX_EXTRACT(field, pattern, 2) → Gets Group 2 (amber|hattie) = "amber" → WRONG
  • REX_EXTRACT(field, pattern, 3) → Gets Group 3 (?<domain>...) = "pyrami" → WRONG

The second and third extractions are off by one group because they hit the unnamed nested groups.

LogicalProject(
user=[REX_EXTRACT($7, '(?<user>(amber|hattie))[a-z]*@(?<domain>(pyrami|netagy))\.(?<tld>(com|org))', 1)],
domain=[REX_EXTRACT($7, '(?<user>(amber|hattie))[a-z]*@(?<domain>(pyrami|netagy))\.(?<tld>(com|org))', 2)],  -- Wrong!
    tld=[REX_EXTRACT($7, '(?<user>(amber|hattie))[a-z]*@(?<domain>(pyrami|netagy))\.(?<tld>(com|org))', 3)]     -- Wrong!
)

Related Issues

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

matchCount++;
} else {
// If extractor returns null, it might indicate an error (like invalid group name)
// Stop processing to avoid infinite loop
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently thinking about adding an error handling here

Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
@RyanL1997 RyanL1997 changed the title [BugFix] Fix the off-by-one error for rex with nested capture groups in named groups [BugFix] Fix unexpected shift of extraction for rex with nested capture groups in named groups Oct 25, 2025
Swiddis
Swiddis previously approved these changes Oct 28, 2025
fieldRex,
context.rexBuilder.makeLiteral(patternStr),
context.relBuilder.literal(i + 1),
context.rexBuilder.makeLiteral(groupName),
Copy link
Collaborator

@dai-chen dai-chen Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found there is a namedGroups() API in Matcher (since JDK 20?). If we can get correct index here, we don't need to modify the UDFs below? Alternatively we can move capture name -> index logic here from UDFs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct - the core issue is matching named groups to their correct indices, and Pattern.namedGroups() would be the perfect solution. However, I discovered that we're blocked by a
compatibility constraint:

  • Pattern.namedGroups() was introduced in JDK 20
  • We need backward compatibility with JDK 11/17 - for 2.19-dev

I agree that directly leveraging the Pattern.namedGroups() is the right architectural approach - we should definitely migrate to it when we fully upgrade to JDK 20+. At that point, it would be a simple one-line change in CalciteRelNodeVisitor.

"Rex pattern must contain at least one named capture group");
}

// TODO: Once JDK 20+ is supported, consider using Pattern.namedGroups() API for more efficient
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for now, I added a TODO here @dai-chen

@RyanL1997 RyanL1997 merged commit 0c1ec27 into opensearch-project:main Oct 29, 2025
51 of 52 checks passed
@RyanL1997 RyanL1997 deleted the rex-extract-fix branch October 29, 2025 00:34
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 29, 2025
…ture groups in named groups (#4641)

(cherry picked from commit 0c1ec27)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
RyanL1997 pushed a commit that referenced this pull request Oct 29, 2025
…ture groups in named groups (#4641) (#4692)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
expani pushed a commit to vinaykpud/sql that referenced this pull request Nov 4, 2025
sandeshkr419 added a commit to sandeshkr419/sql that referenced this pull request Dec 3, 2025
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Simeon Widdis <[email protected]>
Co-authored-by: Manasvini B S <[email protected]>
Co-authored-by: opensearch-ci-bot <[email protected]>
Co-authored-by: Louis Chu <[email protected]>
Co-authored-by: Chen Dai <[email protected]>
Co-authored-by: Mebsina <[email protected]>
Co-authored-by: Yuanchun Shen <[email protected]>
Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com>
Co-authored-by: Kai Huang <[email protected]>
Co-authored-by: Peng Huo <[email protected]>
Co-authored-by: Alexey Temnikov <[email protected]>
Co-authored-by: Riley Jerger <[email protected]>
Co-authored-by: Tomoyuki MORITA <[email protected]>
Co-authored-by: Lantao Jin <[email protected]>
Co-authored-by: Songkan Tang <[email protected]>
Co-authored-by: qianheng <[email protected]>
Co-authored-by: Simeon Widdis <[email protected]>
Co-authored-by: Xinyuan Lu <[email protected]>
Co-authored-by: Jialiang Liang <[email protected]>
Co-authored-by: Peter Zhu <[email protected]>
Co-authored-by: Vinay Krishna Pudyodu <[email protected]>
Co-authored-by: expani <[email protected]>
Co-authored-by: expani1729 <[email protected]>
Co-authored-by: Vamsi Manohar <[email protected]>
Co-authored-by: ritvibhatt <[email protected]>
Co-authored-by: Xinyu Hao <[email protected]>
Co-authored-by: Marc Handalian <[email protected]>
Co-authored-by: Marc Handalian <[email protected]>
Fix join type ambiguous issue when specify the join type with sql-like join criteria (opensearch-project#4474)
Fix issue 4441 (opensearch-project#4449)
Fix missing keywordsCanBeId (opensearch-project#4491)
Fix the bug of explicit makeNullLiteral for UDT fields (opensearch-project#4475)
Fix mapping after aggregation push down (opensearch-project#4500)
Fix percentile bug (opensearch-project#4539)
Fix JsonExtractAllFunctionIT failure (opensearch-project#4556)
Fix sort push down into agg after project already pushed (opensearch-project#4546)
Fix push down failure for min/max on derived field (opensearch-project#4572)
Fix compile issue in main (opensearch-project#4608)
Fix filter parsing failure on date fields with non-default format (opensearch-project#4616)
Fix bin nested fields issue (opensearch-project#4606)
Fix: Support Alias Fields in MIN, MAX, FIRST, LAST, and TAKE Aggregations (opensearch-project#4621)
fix rename issue (opensearch-project#4670)
Fixes for `Multisearch` and `Append` command (opensearch-project#4512)
Fix asc/desc keyword behavior for sort command (opensearch-project#4651)
Fix] Fix unexpected shift of extraction for `rex` with nested capture groups in named groups  (opensearch-project#4641)
Fix CVE-2025-48924 (opensearch-project#4665)
Fix sub-fields accessing of generated structs (opensearch-project#4683)
Fix] Incorrect Field Index Mapping in AVG to SUM/COUNT Conversion (opensearch-project#15)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev bug Something isn't working bugFix PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] rex command off-by-one error with nested capture groups in named groups

3 participants