Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
7e5bd57
rex - initial implementation
RyanL1997 Aug 22, 2025
f070540
stop using utils
RyanL1997 Aug 22, 2025
6362dc6
fix spotless check
RyanL1997 Aug 22, 2025
fe9676e
offset_field - initial implementation
RyanL1997 Aug 22, 2025
9eb69f6
max_match - initial implementation
RyanL1997 Aug 22, 2025
8ced572
sed - initial implementation
RyanL1997 Aug 22, 2025
97536c2
fix name capture group for extraction
RyanL1997 Aug 23, 2025
b77aab7
add rex rst doc
RyanL1997 Aug 23, 2025
63c01f6
IT - initial setup
RyanL1997 Aug 24, 2025
08edb1a
add a analyzer test for legacy engine
RyanL1997 Aug 24, 2025
8365237
Add UT for rex
RyanL1997 Aug 24, 2025
f9adf39
sed - add pushdown for sed and explain IT and IT with fix
RyanL1997 Aug 24, 2025
783bfe2
anonymizer - add rex for anonymizer and test
RyanL1997 Aug 24, 2025
422f475
Add cross cluster IT for rex
RyanL1997 Aug 24, 2025
bbd4f17
peng - resolve comments for rst doc 0
RyanL1997 Aug 25, 2025
9c7359a
peng - address some comments 1
RyanL1997 Aug 26, 2025
96de984
peng - resolve comment in rst doc to add a java doc link
RyanL1997 Aug 26, 2025
ce4b9ac
kai - modify the bin ast builder test
RyanL1997 Aug 26, 2025
53645a8
peng - fix the extraction behavior without filter even when there is …
RyanL1997 Aug 26, 2025
28a24c2
fix rex explain no pushdown
RyanL1997 Aug 27, 2025
de1f8a2
change the offset val output format
RyanL1997 Aug 27, 2025
d15bed9
fix rst file
RyanL1997 Aug 27, 2025
d22f733
peng - SWITCH TO USE CALCITE NATIVE OPERATORS
RyanL1997 Aug 28, 2025
5182e99
Peng - fix tests after operator change
RyanL1997 Aug 28, 2025
0ce00b2
support mode=extract and update doc
RyanL1997 Aug 28, 2025
68c2482
fix the issue after rebase
RyanL1997 Aug 28, 2025
1c42e8d
peng - enforce specifying field in antlr for now
RyanL1997 Aug 29, 2025
d888d00
relocate rex cmd IT
RyanL1997 Aug 29, 2025
8162765
peng - simplify vistFunciton
RyanL1997 Aug 29, 2025
fa9af85
peng - add UT for RexExtractMultiFunction
RyanL1997 Aug 29, 2025
a17963c
peng - add UT RexOffsetFunction
RyanL1997 Aug 29, 2025
764f400
fix some tests
RyanL1997 Aug 29, 2025
a711d72
DECOUPLE SED + OFFSET FIELD
RyanL1997 Aug 29, 2025
8c1ec27
Improve error handling for extract
RyanL1997 Aug 29, 2025
691c0fa
add this rex rst into index
RyanL1997 Aug 29, 2025
0ae84d9
fix return type in extract multi
RyanL1997 Aug 29, 2025
ab90dea
add rex doc into doc test
RyanL1997 Aug 29, 2025
d062781
fix doc test
RyanL1997 Aug 29, 2025
9c3f72e
Fix linting
RyanL1997 Sep 2, 2025
47dddae
fix rebase issue
RyanL1997 Sep 2, 2025
13132ef
fix regex anonymizer tests
RyanL1997 Sep 3, 2025
6419982
fix analyzer test and setup to use util function
RyanL1997 Sep 3, 2025
2f4279a
lint fix
RyanL1997 Sep 4, 2025
8133d8c
fix doc test
RyanL1997 Sep 4, 2025
b8a47f4
Add max match limit implementation
RyanL1997 Sep 5, 2025
e0360a1
fix anonymizer test
RyanL1997 Sep 5, 2025
a8df4e7
peng - simplify if
RyanL1997 Sep 5, 2025
b5e8e53
peng - make extract multi to only handle the case of max_match > 1
RyanL1997 Sep 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ public enum Key {
PATTERN_MODE("plugins.ppl.pattern.mode"),
PATTERN_MAX_SAMPLE_COUNT("plugins.ppl.pattern.max.sample.count"),
PATTERN_BUFFER_LIMIT("plugins.ppl.pattern.buffer.limit"),
PPL_REX_MAX_MATCH_LIMIT("plugins.ppl.rex.max_match.limit"),

/** Enable Calcite as execution engine */
CALCITE_ENGINE_ENABLED("plugins.calcite.enabled"),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@
import org.opensearch.sql.ast.tree.RelationSubquery;
import org.opensearch.sql.ast.tree.Rename;
import org.opensearch.sql.ast.tree.Reverse;
import org.opensearch.sql.ast.tree.Rex;
import org.opensearch.sql.ast.tree.Sort;
import org.opensearch.sql.ast.tree.Sort.SortOption;
import org.opensearch.sql.ast.tree.SubqueryAlias;
Expand Down Expand Up @@ -751,6 +752,11 @@ public LogicalPlan visitRegex(Regex node, AnalysisContext context) {
throw getOnlyForCalciteException("Regex");
}

@Override
public LogicalPlan visitRex(Rex node, AnalysisContext context) {
throw getOnlyForCalciteException("Rex");
}

@Override
public LogicalPlan visitPaginate(Paginate paginate, AnalysisContext context) {
LogicalPlan child = paginate.getChild().get(0).accept(this, context);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@
import org.opensearch.sql.ast.tree.RelationSubquery;
import org.opensearch.sql.ast.tree.Rename;
import org.opensearch.sql.ast.tree.Reverse;
import org.opensearch.sql.ast.tree.Rex;
import org.opensearch.sql.ast.tree.SPath;
import org.opensearch.sql.ast.tree.Sort;
import org.opensearch.sql.ast.tree.SubqueryAlias;
Expand Down Expand Up @@ -270,6 +271,10 @@ public T visitRegex(Regex node, C context) {
return visitChildren(node, context);
}

public T visitRex(Rex node, C context) {
return visitChildren(node, context);
}

public T visitLambdaFunction(LambdaFunction node, C context) {
return visitChildren(node, context);
}
Expand Down
75 changes: 75 additions & 0 deletions core/src/main/java/org/opensearch/sql/ast/tree/Rex.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
/*
* Copyright OpenSearch Contributors
* SPDX-License-Identifier: Apache-2.0
*/

package org.opensearch.sql.ast.tree;

import com.google.common.collect.ImmutableList;
import java.util.List;
import java.util.Optional;
import lombok.EqualsAndHashCode;
import lombok.Getter;
import lombok.Setter;
import lombok.ToString;
import org.opensearch.sql.ast.AbstractNodeVisitor;
import org.opensearch.sql.ast.expression.Literal;
import org.opensearch.sql.ast.expression.UnresolvedExpression;

/** AST node represent Rex field extraction operation. */
@Getter
@ToString
@EqualsAndHashCode(callSuper = false)
public class Rex extends UnresolvedPlan {

public enum RexMode {
EXTRACT
}

/** Field to extract from. */
private final UnresolvedExpression field;

/** Pattern with named capture groups. */
private final Literal pattern;

/** Rex mode (only EXTRACT supported). */
private final RexMode mode;

/** Maximum number of matches (optional). */
private final Optional<Integer> maxMatch;

/** Child Plan. */
@Setter private UnresolvedPlan child;

public Rex(UnresolvedExpression field, Literal pattern) {
this(field, pattern, RexMode.EXTRACT, Optional.empty());
}

public Rex(UnresolvedExpression field, Literal pattern, Optional<Integer> maxMatch) {
this(field, pattern, RexMode.EXTRACT, maxMatch);
}

public Rex(
UnresolvedExpression field, Literal pattern, RexMode mode, Optional<Integer> maxMatch) {
this.field = field;
this.pattern = pattern;
this.mode = mode;
this.maxMatch = maxMatch;
}

@Override
public Rex attach(UnresolvedPlan child) {
this.child = child;
return this;
}

@Override
public List<UnresolvedPlan> getChild() {
return ImmutableList.of(child);
}

@Override
public <T, C> T accept(AbstractNodeVisitor<T, C> nodeVisitor, C context) {
return nodeVisitor.visitRex(this, context);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@
import org.opensearch.sql.ast.tree.Regex;
import org.opensearch.sql.ast.tree.Relation;
import org.opensearch.sql.ast.tree.Rename;
import org.opensearch.sql.ast.tree.Rex;
import org.opensearch.sql.ast.tree.SPath;
import org.opensearch.sql.ast.tree.Sort;
import org.opensearch.sql.ast.tree.Sort.SortOption;
Expand All @@ -130,6 +131,7 @@
import org.opensearch.sql.exception.SemanticCheckException;
import org.opensearch.sql.expression.function.BuiltinFunctionName;
import org.opensearch.sql.expression.function.PPLFuncImpTable;
import org.opensearch.sql.expression.parse.RegexCommonUtils;
import org.opensearch.sql.utils.ParseUtils;

public class CalciteRelNodeVisitor extends AbstractNodeVisitor<RelNode, CalcitePlanContext> {
Expand Down Expand Up @@ -208,6 +210,50 @@ public RelNode visitRegex(Regex node, CalcitePlanContext context) {
return context.relBuilder.peek();
}

public RelNode visitRex(Rex node, CalcitePlanContext context) {
visitChildren(node, context);

RexNode fieldRex = rexVisitor.analyze(node.getField(), context);
String patternStr = (String) node.getPattern().getValue();

List<String> namedGroups = RegexCommonUtils.getNamedGroupCandidates(patternStr);

if (namedGroups.isEmpty()) {
throw new IllegalArgumentException(
"Rex pattern must contain at least one named capture group");
}

List<RexNode> newFields = new ArrayList<>();
List<String> newFieldNames = new ArrayList<>();

for (int i = 0; i < namedGroups.size(); i++) {
RexNode extractCall;
if (node.getMaxMatch().isPresent() && node.getMaxMatch().get() > 1) {
extractCall =
PPLFuncImpTable.INSTANCE.resolve(
context.rexBuilder,
BuiltinFunctionName.REX_EXTRACT_MULTI,
fieldRex,
context.rexBuilder.makeLiteral(patternStr),
context.relBuilder.literal(i + 1),
context.relBuilder.literal(node.getMaxMatch().get()));
} else {
extractCall =
PPLFuncImpTable.INSTANCE.resolve(
context.rexBuilder,
BuiltinFunctionName.REX_EXTRACT,
fieldRex,
context.rexBuilder.makeLiteral(patternStr),
context.relBuilder.literal(i + 1));
}
newFields.add(extractCall);
newFieldNames.add(namedGroups.get(i));
}

projectPlusOverriding(newFields, newFieldNames, context);
return context.relBuilder.peek();
}

private boolean containsSubqueryExpression(Node expr) {
if (expr == null) {
return false;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,17 @@ private PPLOperandTypes() {}
UDFOperandMetadata.wrap((FamilyOperandTypeChecker) OperandTypes.NUMERIC_NUMERIC);
public static final UDFOperandMetadata STRING_INTEGER =
UDFOperandMetadata.wrap(OperandTypes.family(SqlTypeFamily.CHARACTER, SqlTypeFamily.INTEGER));
public static final UDFOperandMetadata STRING_STRING_INTEGER =
UDFOperandMetadata.wrap(
OperandTypes.family(
SqlTypeFamily.CHARACTER, SqlTypeFamily.CHARACTER, SqlTypeFamily.INTEGER));
public static final UDFOperandMetadata STRING_STRING_INTEGER_INTEGER =
UDFOperandMetadata.wrap(
OperandTypes.family(
SqlTypeFamily.CHARACTER,
SqlTypeFamily.CHARACTER,
SqlTypeFamily.INTEGER,
SqlTypeFamily.INTEGER));

public static final UDFOperandMetadata NUMERIC_NUMERIC_OPTIONAL_NUMERIC =
UDFOperandMetadata.wrap(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,8 @@ public enum BuiltinFunctionName {
POSITION(FunctionName.of("position")),
REGEXP(FunctionName.of("regexp")),
REGEX_MATCH(FunctionName.of("regex_match")),
REX_EXTRACT(FunctionName.of("REX_EXTRACT")),
REX_EXTRACT_MULTI(FunctionName.of("REX_EXTRACT_MULTI")),
REPLACE(FunctionName.of("replace")),
REVERSE(FunctionName.of("reverse")),
RIGHT(FunctionName.of("right")),
Expand Down Expand Up @@ -315,7 +317,10 @@ public enum BuiltinFunctionName {
INTERNAL_UNCOLLECT_PATTERNS(FunctionName.of("uncollect_patterns")),
INTERNAL_REGEXP_EXTRACT(FunctionName.of("regexp_extract"), true),
INTERNAL_GROK(FunctionName.of("grok"), true),
INTERNAL_REGEXP_REPLACE_3(FunctionName.of("regexp_replace_3"), true);
INTERNAL_REGEXP_REPLACE_3(FunctionName.of("regexp_replace_3"), true),
INTERNAL_REGEXP_REPLACE_PG_4(FunctionName.of("regexp_replace_pg_4"), true),
INTERNAL_REGEXP_REPLACE_5(FunctionName.of("regexp_replace_5"), true),
INTERNAL_TRANSLATE3(FunctionName.of("translate3"), true);

private final FunctionName name;
private boolean isInternal;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@
import org.opensearch.sql.expression.function.udf.CryptographicFunction;
import org.opensearch.sql.expression.function.udf.GrokFunction;
import org.opensearch.sql.expression.function.udf.RelevanceQueryFunction;
import org.opensearch.sql.expression.function.udf.RexExtractFunction;
import org.opensearch.sql.expression.function.udf.RexExtractMultiFunction;
import org.opensearch.sql.expression.function.udf.SpanFunction;
import org.opensearch.sql.expression.function.udf.condition.EarliestFunction;
import org.opensearch.sql.expression.function.udf.condition.EnhancedCoalesceFunction;
Expand Down Expand Up @@ -401,6 +403,9 @@ public class PPLBuiltinOperators extends ReflectiveSqlOperatorTable {
public static final SqlOperator RANGE_BUCKET =
new org.opensearch.sql.expression.function.udf.binning.RangeBucketFunction()
.toUDF("RANGE_BUCKET");
public static final SqlOperator REX_EXTRACT = new RexExtractFunction().toUDF("REX_EXTRACT");
public static final SqlOperator REX_EXTRACT_MULTI =
new RexExtractMultiFunction().toUDF("REX_EXTRACT_MULTI");

// Aggregation functions
public static final SqlAggFunction AVG_NULLABLE = new NullableSqlAvgAggFunction(SqlKind.AVG);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,9 @@
import static org.opensearch.sql.expression.function.BuiltinFunctionName.INTERNAL_PATTERN_PARSER;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.INTERNAL_REGEXP_EXTRACT;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.INTERNAL_REGEXP_REPLACE_3;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.INTERNAL_REGEXP_REPLACE_5;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.INTERNAL_REGEXP_REPLACE_PG_4;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.INTERNAL_TRANSLATE3;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.IS_BLANK;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.IS_EMPTY;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.IS_NOT_NULL;
Expand Down Expand Up @@ -159,6 +162,8 @@
import static org.opensearch.sql.expression.function.BuiltinFunctionName.REGEX_MATCH;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.REPLACE;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.REVERSE;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.REX_EXTRACT;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.REX_EXTRACT_MULTI;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.RIGHT;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.RINT;
import static org.opensearch.sql.expression.function.BuiltinFunctionName.ROUND;
Expand Down Expand Up @@ -678,6 +683,9 @@ void populate() {
registerOperator(SHA1, SqlLibraryOperators.SHA1);
registerOperator(INTERNAL_REGEXP_EXTRACT, SqlLibraryOperators.REGEXP_EXTRACT);
registerOperator(INTERNAL_REGEXP_REPLACE_3, SqlLibraryOperators.REGEXP_REPLACE_3);
registerOperator(INTERNAL_REGEXP_REPLACE_PG_4, SqlLibraryOperators.REGEXP_REPLACE_PG_4);
registerOperator(INTERNAL_REGEXP_REPLACE_5, SqlLibraryOperators.REGEXP_REPLACE_5);
registerOperator(INTERNAL_TRANSLATE3, SqlLibraryOperators.TRANSLATE3);

// Register PPL UDF operator
registerOperator(COSH, PPLBuiltinOperators.COSH);
Expand All @@ -703,6 +711,8 @@ void populate() {
registerOperator(SIMPLE_QUERY_STRING, PPLBuiltinOperators.SIMPLE_QUERY_STRING);
registerOperator(QUERY_STRING, PPLBuiltinOperators.QUERY_STRING);
registerOperator(MULTI_MATCH, PPLBuiltinOperators.MULTI_MATCH);
registerOperator(REX_EXTRACT, PPLBuiltinOperators.REX_EXTRACT);
registerOperator(REX_EXTRACT_MULTI, PPLBuiltinOperators.REX_EXTRACT_MULTI);

// Register PPL Datetime UDF operator
registerOperator(TIMESTAMP, PPLBuiltinOperators.TIMESTAMP);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
/*
* Copyright OpenSearch Contributors
* SPDX-License-Identifier: Apache-2.0
*/

package org.opensearch.sql.expression.function.udf;

import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
import org.apache.calcite.adapter.enumerable.NotNullImplementor;
import org.apache.calcite.adapter.enumerable.NullPolicy;
import org.apache.calcite.adapter.enumerable.RexToLixTranslator;
import org.apache.calcite.linq4j.tree.Expression;
import org.apache.calcite.linq4j.tree.Expressions;
import org.apache.calcite.rex.RexCall;
import org.apache.calcite.sql.type.ReturnTypes;
import org.apache.calcite.sql.type.SqlReturnTypeInference;
import org.opensearch.sql.calcite.utils.PPLOperandTypes;
import org.opensearch.sql.expression.function.ImplementorUDF;
import org.opensearch.sql.expression.function.UDFOperandMetadata;

/** Custom REX_EXTRACT function for extracting regex named capture groups. */
public final class RexExtractFunction extends ImplementorUDF {

public RexExtractFunction() {
super(new RexExtractImplementor(), NullPolicy.ARG0);
}

@Override
public SqlReturnTypeInference getReturnTypeInference() {
return ReturnTypes.VARCHAR_2000_NULLABLE;
}

@Override
public UDFOperandMetadata getOperandMetadata() {
return PPLOperandTypes.STRING_STRING_INTEGER;
}

private static class RexExtractImplementor implements NotNullImplementor {

@Override
public Expression implement(
RexToLixTranslator translator, RexCall call, List<Expression> translatedOperands) {
Expression field = translatedOperands.get(0);
Expression pattern = translatedOperands.get(1);
Expression groupIndex = translatedOperands.get(2);

return Expressions.call(RexExtractFunction.class, "extractGroup", field, pattern, groupIndex);
}
}

public static String extractGroup(String text, String pattern, int groupIndex) {
try {
Pattern compiledPattern = Pattern.compile(pattern);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Low priority] I'm thinking a nice to have feature is to cache compiled pattern. Alternatively, define it as a global Linq4j expression. See Calcite example: https://github.com/apache/calcite/blob/44b57985eaeb0ef0c1eda2447aa75b5855259356/core/src/main/java/org/apache/calcite/runtime/SqlFunctions.java#L461-L475

Maybe we can create an issue to track perf improvement if it's not feasible

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, Add issue to track it. per row pattern compile is expensive.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up issue: #4235

Matcher matcher = compiledPattern.matcher(text);

if (matcher.find() && groupIndex > 0 && groupIndex <= matcher.groupCount()) {
return matcher.group(groupIndex);
}
return null;
} catch (PatternSyntaxException e) {
throw new IllegalArgumentException(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @penghuo, I have rechecked the tequrement - and yes for pattern failure it should be catched. The cuurent behavior is like this:

 curl -X POST "localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d'{
    "query": "source=accounts | rex field=email \"(?<invalid>[\" | fields email, invalid | head 1"
  }' | jq

{
  "error": {
    "reason": "Invalid Query",
    "details": "Error in 'rex' command: Encountered the following error while compiling the regex '(?<invalid>[': Unclosed character class near index 11\n(?<invalid>[\n           ^",
    "type": "IllegalArgumentException"
  },
  "status": 400
}

"Error in 'rex' command: Encountered the following error while compiling the regex '"
+ pattern
+ "': "
+ e.getMessage());
}
}
}
Loading
Loading