diff --git a/docs/features/index.md b/docs/features/index.md index 37680cc64..2461bea24 100644 --- a/docs/features/index.md +++ b/docs/features/index.md @@ -322,6 +322,7 @@ - [SQL/PPL Bug Fixes](sql/sql-ppl-bug-fixes.md) - [SQL/PPL Engine](sql/sql-ppl-engine.md) - [SQL/PPL Breaking Changes](sql/sql-ppl-breaking-changes.md) +- [PPL Rex and Regex Commands](sql/ppl-rex-and-regex-commands.md) ## asynchronous-search diff --git a/docs/features/sql/ppl-rex-and-regex-commands.md b/docs/features/sql/ppl-rex-and-regex-commands.md new file mode 100644 index 000000000..2646228f2 --- /dev/null +++ b/docs/features/sql/ppl-rex-and-regex-commands.md @@ -0,0 +1,223 @@ +# PPL Rex and Regex Commands + +## Summary + +The `regex` and `rex` commands provide comprehensive regex-based text processing capabilities in PPL (Piped Processing Language). The `regex` command filters records based on pattern matching, while the `rex` command extracts fields using named capture groups and performs text transformations. Both commands use Java's regex engine and are available in the Calcite query engine. + +## Details + +### Architecture + +```mermaid +graph TB + subgraph "PPL Text Processing Commands" + A[PPL Query] --> B{Command Type} + B -->|regex| C[Pattern Filtering] + B -->|rex| D[Field Extraction/Transformation] + end + + subgraph "Regex Command Flow" + C --> E[Parse Pattern] + E --> F[REGEXP_CONTAINS] + F --> G{Negated?} + G -->|Yes| H[NOT Filter] + G -->|No| I[Filter Records] + end + + subgraph "Rex Command Flow" + D --> J{Mode} + J -->|extract| K[Named Group Extraction] + J -->|sed| L[Text Substitution] + K --> M[REX_EXTRACT UDF] + K --> N[REX_EXTRACT_MULTI UDF] + K --> O[REX_OFFSET UDF] + L --> P[REGEXP_REPLACE] + L --> Q[TRANSLATE3] + end +``` + +### Data Flow + +```mermaid +flowchart LR + subgraph "Input" + A[Source Data] + end + + subgraph "Regex Processing" + B[regex command] + B --> C{Pattern Match?} + C -->|Yes| D[Include Record] + C -->|No| E[Exclude Record] + end + + subgraph "Rex Processing" + F[rex command] + F --> G{Mode} + G -->|extract| H[Create New Fields] + G -->|sed| I[Modify Field Value] + end + + A --> B + A --> F + D --> J[Output] + H --> J + I --> J +``` + +### Components + +| Component | Description | +|-----------|-------------| +| `Regex` | AST node for regex filter command | +| `Rex` | AST node for rex extraction/transformation command | +| `RegexCommonUtils` | Shared utilities for pattern compilation and caching | +| `RexExtractFunction` | UDF for extracting single match from named capture group | +| `RexExtractMultiFunction` | UDF for extracting multiple matches as array | +| `RexOffsetFunction` | UDF for calculating match position offsets | + +### Configuration + +| Setting | Description | Default | +|---------|-------------|---------| +| `plugins.ppl.rex.max_match.limit` | Maximum value for `max_match` parameter to prevent memory exhaustion | 10 | + +### Regex Command + +The `regex` command filters records based on regex pattern matching against field values. + +#### Syntax + +``` +regex = +regex != +``` + +#### Parameters + +- `field`: Field name to match against (required) +- `pattern`: Java regex pattern string (required) +- `=`: Positive matching (include matches) +- `!=`: Negative matching (exclude matches) + +#### Examples + +```sql +-- Basic pattern matching +source=accounts | regex lastname="^[A-Z][a-z]+$" | fields lastname + +-- Negative matching +source=accounts | regex lastname!=".*son$" | fields lastname + +-- Email domain filtering +source=accounts | regex email="@gmail\.com$" | fields email + +-- Complex patterns with character classes +source=accounts | regex address="\d{3,4}\s+[A-Z][a-z]+\s+(Street|Lane)" | fields address + +-- Case-insensitive matching (using inline flag) +source=accounts | regex state="(?i)ca" | fields state +``` + +### Rex Command + +The `rex` command extracts fields using named capture groups or performs text transformations. + +#### Syntax + +``` +rex field= "" [max_match=] [offset_field=] +rex field= mode=sed "" +``` + +#### Parameters + +- `field`: Source field to process (required) +- `pattern`: Regex with named capture groups `(?pattern)` (required for extract mode) +- `max_match`: Maximum matches to extract (default: 1, 0=unlimited capped to limit) +- `offset_field`: Field name to store match positions +- `mode`: `extract` (default) or `sed` + +#### Extract Mode Examples + +```sql +-- Basic field extraction +source=accounts | rex field=email "(?[^@]+)@(?[^.]+)" +| fields email, username, domain + +-- Multiple named groups +source=accounts | rex field=email "(?[a-zA-Z0-9._%+-]+)@(?[a-zA-Z0-9.-]+)\.(?[a-zA-Z]{2,})" +| fields email, user, domain, tld + +-- Multi-value extraction (returns array) +source=accounts | rex field=address "(?[A-Za-z]+)" max_match=3 +| fields address, words + +-- Position tracking +source=accounts | rex field=email "(?[^@]+)@(?[^.]+)" offset_field=matchpos +| fields email, username, domain, matchpos + +-- Chaining multiple rex commands +source=accounts | rex field=firstname "(?^.)" +| rex field=lastname "(?^.)" +| fields firstname, lastname, firstinitial, lastinitial +``` + +#### Sed Mode Examples + +```sql +-- Basic substitution +source=accounts | rex field=email mode=sed "s/@.*/@company.com/" | fields email + +-- Global replacement +source=logs | rex field=message mode=sed "s/ERROR/WARNING/g" | fields message + +-- Nth occurrence replacement +source=data | rex field=text mode=sed "s/word/replacement/2" | fields text + +-- Case-insensitive replacement +source=data | rex field=text mode=sed "s/error/ERROR/gi" | fields text + +-- Character transliteration +source=data | rex field=title mode=sed "y/ /_/" | fields title + +-- Backreferences in replacement +source=data | rex field=phone mode=sed "s/(\d{3})(\d{3})(\d{4})/\1-\2-\3/" | fields phone +``` + +### Comparison with Related Commands + +| Feature | regex | rex | parse | +|---------|-------|-----|-------| +| Pattern Type | Java Regex | Java Regex | Java Regex | +| Named Groups Required | No | Yes (extract mode) | Yes | +| Filtering by Match | Yes | No | Yes | +| Multiple Matches | No | Yes | No | +| Text Substitution | No | Yes (sed mode) | No | +| Offset Tracking | No | Yes | No | + +## Limitations + +- **Calcite Engine Only**: Both commands require `plugins.calcite.enabled=true` +- **Named Group Naming**: Group names cannot contain underscores (Java regex limitation) +- **String Fields Only**: `regex` command only works with string field types +- **Max Match Limit**: `max_match` values exceeding the configured limit throw an error +- **Sed Mode Restrictions**: `offset_field` cannot be used with `mode=sed` + +## Related PRs + +| Version | PR | Description | +|---------|-----|-------------| +| v3.3.0 | [#4083](https://github.com/opensearch-project/sql/pull/4083) | Implementation of `regex` command in PPL | +| v3.3.0 | [#4109](https://github.com/opensearch-project/sql/pull/4109) | Core implementation of `rex` command (extract mode) | +| v3.3.0 | [#4241](https://github.com/opensearch-project/sql/pull/4241) | Implementation of `sed` mode and `offset_field` in rex command | + +## References + +- [Issue #4082](https://github.com/opensearch-project/sql/issues/4082): RFC for regex command +- [Issue #4108](https://github.com/opensearch-project/sql/issues/4108): RFC for rex command +- [Java Pattern Documentation](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html): Java regex syntax reference + +## Change History + +- **v3.3.0** (2025-09): Initial implementation of `regex` and `rex` commands with extract mode, sed mode, and offset_field support diff --git a/docs/releases/v3.3.0/features/sql/ppl-rex-and-regex-commands.md b/docs/releases/v3.3.0/features/sql/ppl-rex-and-regex-commands.md new file mode 100644 index 000000000..252b750b4 --- /dev/null +++ b/docs/releases/v3.3.0/features/sql/ppl-rex-and-regex-commands.md @@ -0,0 +1,137 @@ +# PPL Rex and Regex Commands + +## Summary + +OpenSearch v3.3.0 introduces two new PPL commands for regex-based text processing: `regex` for pattern-based filtering and `rex` for field extraction and text transformation. These commands enable powerful text processing capabilities within PPL pipelines, supporting log analysis, data parsing, and text transformation workflows. + +## Details + +### What's New in v3.3.0 + +This release adds comprehensive regex support to PPL through two complementary commands: + +1. **`regex` command**: Filters records based on regex pattern matching +2. **`rex` command**: Extracts fields using named capture groups and performs text transformations + +Both commands are implemented in the Calcite query engine and use Java's regex engine for consistent behavior. + +### Technical Changes + +#### Architecture Changes + +```mermaid +graph TB + subgraph "PPL Query Pipeline" + A[PPL Query] --> B[Parser] + B --> C[AST Builder] + C --> D[Calcite Visitor] + end + + subgraph "Regex Command" + D --> E[CalciteRelNodeVisitor] + E --> F[REGEXP_CONTAINS Filter] + F --> G[Script Query Pushdown] + end + + subgraph "Rex Command" + D --> H[CalciteRelNodeVisitor] + H --> I{Mode?} + I -->|extract| J[REX_EXTRACT UDF] + I -->|sed| K[REGEXP_REPLACE] + J --> L[Named Group Extraction] + K --> M[Text Substitution] + end +``` + +#### New Components + +| Component | Description | +|-----------|-------------| +| `Regex` AST Node | Represents regex filter command in AST | +| `Rex` AST Node | Represents rex extraction/transformation command | +| `RegexCommonUtils` | Shared utilities for pattern caching and named group extraction | +| `RexExtractFunction` | UDF for single-match field extraction | +| `RexExtractMultiFunction` | UDF for multi-match field extraction (returns arrays) | +| `RexOffsetFunction` | UDF for tracking match positions | + +#### New Configuration + +| Setting | Description | Default | +|---------|-------------|---------| +| `plugins.ppl.rex.max_match.limit` | Maximum allowed value for `max_match` parameter | 10 | + +### Usage Examples + +#### Regex Command - Pattern Filtering + +```sql +-- Filter records where lastname matches pattern +source=accounts | regex lastname="^[A-Z][a-z]+$" | fields account_number, lastname + +-- Negative matching - exclude records +source=accounts | regex lastname!=".*son$" | fields account_number, lastname + +-- Email domain filtering +source=accounts | regex email="@pyrami\.com$" | fields account_number, email +``` + +#### Rex Command - Field Extraction + +```sql +-- Extract username and domain from email +source=accounts | rex field=email "(?[^@]+)@(?[^.]+)" +| fields email, username, domain + +-- Extract multiple matches as array +source=accounts | rex field=address "(?[A-Za-z]+)" max_match=3 +| fields address, words + +-- Track match positions +source=accounts | rex field=email "(?[^@]+)" offset_field=positions +| fields email, user, positions +``` + +#### Rex Command - Text Transformation (sed mode) + +```sql +-- Replace email domain +source=accounts | rex field=email mode=sed "s/@.*/@company.com/" | fields email + +-- Global replacement +source=logs | rex field=message mode=sed "s/ERROR/WARNING/g" | fields message + +-- Character transliteration +source=data | rex field=title mode=sed "y/ /_/" | fields title +``` + +### Migration Notes + +- Both commands require `plugins.calcite.enabled=true` (Calcite engine) +- Named capture groups must use Java regex syntax: `(?pattern)` +- Group names cannot contain underscores due to Java regex limitations +- The `max_match=0` (unlimited) is automatically capped to the configured limit + +## Limitations + +- **Calcite Engine Only**: Commands are not available in the legacy SQL engine +- **Named Group Naming**: Group names must start with a letter and contain only alphanumeric characters (no underscores) +- **String Fields Only**: `regex` command only supports string field types +- **Max Match Limit**: `max_match` values exceeding the configured limit will throw an error + +## Related PRs + +| PR | Description | +|----|-------------| +| [#4083](https://github.com/opensearch-project/sql/pull/4083) | Implementation of `regex` command in PPL | +| [#4109](https://github.com/opensearch-project/sql/pull/4109) | Core implementation of `rex` command (extract mode) | +| [#4241](https://github.com/opensearch-project/sql/pull/4241) | Implementation of `sed` mode and `offset_field` in rex command | + +## References + +- [Issue #4082](https://github.com/opensearch-project/sql/issues/4082): RFC for regex command +- [Issue #4108](https://github.com/opensearch-project/sql/issues/4108): RFC for rex command +- [Java Pattern Documentation](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html): Java regex syntax reference + +## Related Feature Report + +- [Full feature documentation](../../../../features/sql/ppl-rex-and-regex-commands.md) diff --git a/docs/releases/v3.3.0/index.md b/docs/releases/v3.3.0/index.md index 20a8bbd9d..c1799e346 100644 --- a/docs/releases/v3.3.0/index.md +++ b/docs/releases/v3.3.0/index.md @@ -137,6 +137,7 @@ ### SQL - [PPL Rename Command - Wildcard Support](features/sql/ppl-rename-command.md) +- [PPL Rex and Regex Commands](features/sql/ppl-rex-and-regex-commands.md) - [PPL Spath Command](features/sql/ppl-spath-command.md) - [SQL/PPL Bug Fixes](features/sql/sql-ppl-bug-fixes.md)