Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/features/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -322,6 +322,7 @@
- [SQL/PPL Bug Fixes](sql/sql-ppl-bug-fixes.md)
- [SQL/PPL Engine](sql/sql-ppl-engine.md)
- [SQL/PPL Breaking Changes](sql/sql-ppl-breaking-changes.md)
- [PPL Rex and Regex Commands](sql/ppl-rex-and-regex-commands.md)

## asynchronous-search

Expand Down
223 changes: 223 additions & 0 deletions docs/features/sql/ppl-rex-and-regex-commands.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
# PPL Rex and Regex Commands

## Summary

The `regex` and `rex` commands provide comprehensive regex-based text processing capabilities in PPL (Piped Processing Language). The `regex` command filters records based on pattern matching, while the `rex` command extracts fields using named capture groups and performs text transformations. Both commands use Java's regex engine and are available in the Calcite query engine.

## Details

### Architecture

```mermaid
graph TB
subgraph "PPL Text Processing Commands"
A[PPL Query] --> B{Command Type}
B -->|regex| C[Pattern Filtering]
B -->|rex| D[Field Extraction/Transformation]
end

subgraph "Regex Command Flow"
C --> E[Parse Pattern]
E --> F[REGEXP_CONTAINS]
F --> G{Negated?}
G -->|Yes| H[NOT Filter]
G -->|No| I[Filter Records]
end

subgraph "Rex Command Flow"
D --> J{Mode}
J -->|extract| K[Named Group Extraction]
J -->|sed| L[Text Substitution]
K --> M[REX_EXTRACT UDF]
K --> N[REX_EXTRACT_MULTI UDF]
K --> O[REX_OFFSET UDF]
L --> P[REGEXP_REPLACE]
L --> Q[TRANSLATE3]
end
```

### Data Flow

```mermaid
flowchart LR
subgraph "Input"
A[Source Data]
end

subgraph "Regex Processing"
B[regex command]
B --> C{Pattern Match?}
C -->|Yes| D[Include Record]
C -->|No| E[Exclude Record]
end

subgraph "Rex Processing"
F[rex command]
F --> G{Mode}
G -->|extract| H[Create New Fields]
G -->|sed| I[Modify Field Value]
end

A --> B
A --> F
D --> J[Output]
H --> J
I --> J
```

### Components

| Component | Description |
|-----------|-------------|
| `Regex` | AST node for regex filter command |
| `Rex` | AST node for rex extraction/transformation command |
| `RegexCommonUtils` | Shared utilities for pattern compilation and caching |
| `RexExtractFunction` | UDF for extracting single match from named capture group |
| `RexExtractMultiFunction` | UDF for extracting multiple matches as array |
| `RexOffsetFunction` | UDF for calculating match position offsets |

### Configuration

| Setting | Description | Default |
|---------|-------------|---------|
| `plugins.ppl.rex.max_match.limit` | Maximum value for `max_match` parameter to prevent memory exhaustion | 10 |

### Regex Command

The `regex` command filters records based on regex pattern matching against field values.

#### Syntax

```
regex <field>=<pattern>
regex <field>!=<pattern>
```

#### Parameters

- `field`: Field name to match against (required)
- `pattern`: Java regex pattern string (required)
- `=`: Positive matching (include matches)
- `!=`: Negative matching (exclude matches)

#### Examples

```sql
-- Basic pattern matching
source=accounts | regex lastname="^[A-Z][a-z]+$" | fields lastname

-- Negative matching
source=accounts | regex lastname!=".*son$" | fields lastname

-- Email domain filtering
source=accounts | regex email="@gmail\.com$" | fields email

-- Complex patterns with character classes
source=accounts | regex address="\d{3,4}\s+[A-Z][a-z]+\s+(Street|Lane)" | fields address

-- Case-insensitive matching (using inline flag)
source=accounts | regex state="(?i)ca" | fields state
```

### Rex Command

The `rex` command extracts fields using named capture groups or performs text transformations.

#### Syntax

```
rex field=<field> "<pattern>" [max_match=<int>] [offset_field=<string>]
rex field=<field> mode=sed "<sed-expression>"
```

#### Parameters

- `field`: Source field to process (required)
- `pattern`: Regex with named capture groups `(?<name>pattern)` (required for extract mode)
- `max_match`: Maximum matches to extract (default: 1, 0=unlimited capped to limit)
- `offset_field`: Field name to store match positions
- `mode`: `extract` (default) or `sed`

#### Extract Mode Examples

```sql
-- Basic field extraction
source=accounts | rex field=email "(?<username>[^@]+)@(?<domain>[^.]+)"
| fields email, username, domain

-- Multiple named groups
source=accounts | rex field=email "(?<user>[a-zA-Z0-9._%+-]+)@(?<domain>[a-zA-Z0-9.-]+)\.(?<tld>[a-zA-Z]{2,})"
| fields email, user, domain, tld

-- Multi-value extraction (returns array)
source=accounts | rex field=address "(?<words>[A-Za-z]+)" max_match=3
| fields address, words

-- Position tracking
source=accounts | rex field=email "(?<username>[^@]+)@(?<domain>[^.]+)" offset_field=matchpos
| fields email, username, domain, matchpos

-- Chaining multiple rex commands
source=accounts | rex field=firstname "(?<firstinitial>^.)"
| rex field=lastname "(?<lastinitial>^.)"
| fields firstname, lastname, firstinitial, lastinitial
```

#### Sed Mode Examples

```sql
-- Basic substitution
source=accounts | rex field=email mode=sed "s/@.*/@company.com/" | fields email

-- Global replacement
source=logs | rex field=message mode=sed "s/ERROR/WARNING/g" | fields message

-- Nth occurrence replacement
source=data | rex field=text mode=sed "s/word/replacement/2" | fields text

-- Case-insensitive replacement
source=data | rex field=text mode=sed "s/error/ERROR/gi" | fields text

-- Character transliteration
source=data | rex field=title mode=sed "y/ /_/" | fields title

-- Backreferences in replacement
source=data | rex field=phone mode=sed "s/(\d{3})(\d{3})(\d{4})/\1-\2-\3/" | fields phone
```

### Comparison with Related Commands

| Feature | regex | rex | parse |
|---------|-------|-----|-------|
| Pattern Type | Java Regex | Java Regex | Java Regex |
| Named Groups Required | No | Yes (extract mode) | Yes |
| Filtering by Match | Yes | No | Yes |
| Multiple Matches | No | Yes | No |
| Text Substitution | No | Yes (sed mode) | No |
| Offset Tracking | No | Yes | No |

## Limitations

- **Calcite Engine Only**: Both commands require `plugins.calcite.enabled=true`
- **Named Group Naming**: Group names cannot contain underscores (Java regex limitation)
- **String Fields Only**: `regex` command only works with string field types
- **Max Match Limit**: `max_match` values exceeding the configured limit throw an error
- **Sed Mode Restrictions**: `offset_field` cannot be used with `mode=sed`

## Related PRs

| Version | PR | Description |
|---------|-----|-------------|
| v3.3.0 | [#4083](https://github.com/opensearch-project/sql/pull/4083) | Implementation of `regex` command in PPL |
| v3.3.0 | [#4109](https://github.com/opensearch-project/sql/pull/4109) | Core implementation of `rex` command (extract mode) |
| v3.3.0 | [#4241](https://github.com/opensearch-project/sql/pull/4241) | Implementation of `sed` mode and `offset_field` in rex command |

## References

- [Issue #4082](https://github.com/opensearch-project/sql/issues/4082): RFC for regex command
- [Issue #4108](https://github.com/opensearch-project/sql/issues/4108): RFC for rex command
- [Java Pattern Documentation](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html): Java regex syntax reference

## Change History

- **v3.3.0** (2025-09): Initial implementation of `regex` and `rex` commands with extract mode, sed mode, and offset_field support
137 changes: 137 additions & 0 deletions docs/releases/v3.3.0/features/sql/ppl-rex-and-regex-commands.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# PPL Rex and Regex Commands

## Summary

OpenSearch v3.3.0 introduces two new PPL commands for regex-based text processing: `regex` for pattern-based filtering and `rex` for field extraction and text transformation. These commands enable powerful text processing capabilities within PPL pipelines, supporting log analysis, data parsing, and text transformation workflows.

## Details

### What's New in v3.3.0

This release adds comprehensive regex support to PPL through two complementary commands:

1. **`regex` command**: Filters records based on regex pattern matching
2. **`rex` command**: Extracts fields using named capture groups and performs text transformations

Both commands are implemented in the Calcite query engine and use Java's regex engine for consistent behavior.

### Technical Changes

#### Architecture Changes

```mermaid
graph TB
subgraph "PPL Query Pipeline"
A[PPL Query] --> B[Parser]
B --> C[AST Builder]
C --> D[Calcite Visitor]
end

subgraph "Regex Command"
D --> E[CalciteRelNodeVisitor]
E --> F[REGEXP_CONTAINS Filter]
F --> G[Script Query Pushdown]
end

subgraph "Rex Command"
D --> H[CalciteRelNodeVisitor]
H --> I{Mode?}
I -->|extract| J[REX_EXTRACT UDF]
I -->|sed| K[REGEXP_REPLACE]
J --> L[Named Group Extraction]
K --> M[Text Substitution]
end
```

#### New Components

| Component | Description |
|-----------|-------------|
| `Regex` AST Node | Represents regex filter command in AST |
| `Rex` AST Node | Represents rex extraction/transformation command |
| `RegexCommonUtils` | Shared utilities for pattern caching and named group extraction |
| `RexExtractFunction` | UDF for single-match field extraction |
| `RexExtractMultiFunction` | UDF for multi-match field extraction (returns arrays) |
| `RexOffsetFunction` | UDF for tracking match positions |

#### New Configuration

| Setting | Description | Default |
|---------|-------------|---------|
| `plugins.ppl.rex.max_match.limit` | Maximum allowed value for `max_match` parameter | 10 |

### Usage Examples

#### Regex Command - Pattern Filtering

```sql
-- Filter records where lastname matches pattern
source=accounts | regex lastname="^[A-Z][a-z]+$" | fields account_number, lastname

-- Negative matching - exclude records
source=accounts | regex lastname!=".*son$" | fields account_number, lastname

-- Email domain filtering
source=accounts | regex email="@pyrami\.com$" | fields account_number, email
```

#### Rex Command - Field Extraction

```sql
-- Extract username and domain from email
source=accounts | rex field=email "(?<username>[^@]+)@(?<domain>[^.]+)"
| fields email, username, domain

-- Extract multiple matches as array
source=accounts | rex field=address "(?<words>[A-Za-z]+)" max_match=3
| fields address, words

-- Track match positions
source=accounts | rex field=email "(?<user>[^@]+)" offset_field=positions
| fields email, user, positions
```

#### Rex Command - Text Transformation (sed mode)

```sql
-- Replace email domain
source=accounts | rex field=email mode=sed "s/@.*/@company.com/" | fields email

-- Global replacement
source=logs | rex field=message mode=sed "s/ERROR/WARNING/g" | fields message

-- Character transliteration
source=data | rex field=title mode=sed "y/ /_/" | fields title
```

### Migration Notes

- Both commands require `plugins.calcite.enabled=true` (Calcite engine)
- Named capture groups must use Java regex syntax: `(?<name>pattern)`
- Group names cannot contain underscores due to Java regex limitations
- The `max_match=0` (unlimited) is automatically capped to the configured limit

## Limitations

- **Calcite Engine Only**: Commands are not available in the legacy SQL engine
- **Named Group Naming**: Group names must start with a letter and contain only alphanumeric characters (no underscores)
- **String Fields Only**: `regex` command only supports string field types
- **Max Match Limit**: `max_match` values exceeding the configured limit will throw an error

## Related PRs

| PR | Description |
|----|-------------|
| [#4083](https://github.com/opensearch-project/sql/pull/4083) | Implementation of `regex` command in PPL |
| [#4109](https://github.com/opensearch-project/sql/pull/4109) | Core implementation of `rex` command (extract mode) |
| [#4241](https://github.com/opensearch-project/sql/pull/4241) | Implementation of `sed` mode and `offset_field` in rex command |

## References

- [Issue #4082](https://github.com/opensearch-project/sql/issues/4082): RFC for regex command
- [Issue #4108](https://github.com/opensearch-project/sql/issues/4108): RFC for rex command
- [Java Pattern Documentation](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html): Java regex syntax reference

## Related Feature Report

- [Full feature documentation](../../../../features/sql/ppl-rex-and-regex-commands.md)
1 change: 1 addition & 0 deletions docs/releases/v3.3.0/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@
### SQL

- [PPL Rename Command - Wildcard Support](features/sql/ppl-rename-command.md)
- [PPL Rex and Regex Commands](features/sql/ppl-rex-and-regex-commands.md)
- [PPL Spath Command](features/sql/ppl-spath-command.md)
- [SQL/PPL Bug Fixes](features/sql/sql-ppl-bug-fixes.md)

Expand Down