Skip to content

Commit

Permalink
Merge pull request #236 from y-scope/latest
Browse files Browse the repository at this point in the history
Document CLPDECODE.
  • Loading branch information
kelseiv authored Dec 28, 2023
2 parents 0fd4bde + 3d3b5a0 commit 7618e47
Show file tree
Hide file tree
Showing 3 changed files with 71 additions and 2 deletions.
7 changes: 5 additions & 2 deletions basics/data-import/clp.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,6 @@ Assuming the user wants to encode `message` and `logPath` as in the example, the

* `stream.kafka.decoder.prop.fieldsForClpEncoding` is a comma-separated list of names for fields that should be encoded with CLP.
* We use [variable-length dictionaries](../../configuration-reference/table#table-index-config) for the logtype and dictionary variables since their length can vary significantly.
* Ideally, we would disable the dictionaries for the encoded variable columns (since they are likely to be random), but currently, a bug prevents us from doing that for multi-valued number-type columns.

### Schema

Expand Down Expand Up @@ -134,4 +133,8 @@ For the table's schema, users should configure the CLP-encoded fields as follows

## Searching and decoding CLP-encoded fields

There is currently no built-in support within Pinot for searching and decoding CLP-encoded fields. This will be added in future commits, potentially as a set of UDFs. The development of these features is being tracked in this [design doc](https://docs.google.com/document/d/1nHZb37re4mUwEA258x3a2pgX13EWLWMJ0uLEDk1dUyU/edit).
To decode CLP-encoded fields, use [CLPDECODE](../../configuration-reference/functions/clpdecode.md).

To search CLP-encoded fields, you can combine `CLPDECODE` with `LIKE`. Note, this may decrease performance when querying a large number of rows.

We are working to integrate efficient searches on CLP-encoded columns as another UDF. The development of this feature is being tracked in this [design doc](https://docs.google.com/document/d/1nHZb37re4mUwEA258x3a2pgX13EWLWMJ0uLEDk1dUyU/edit).
4 changes: 4 additions & 0 deletions configuration-reference/functions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,10 @@ This page contains reference documentation for functions in Apache Pinot.
[chr.md](chr.md)
{% endcontent-ref %}

{% content-ref url="clpdecode.md" %}
[clpdecode.md](clpdecode.md)
{% endcontent-ref %}

{% content-ref url="codepoint.md" %}
[codepoint.md](codepoint.md)
{% endcontent-ref %}
Expand Down
62 changes: 62 additions & 0 deletions configuration-reference/functions/clpdecode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
description: This section contains reference documentation for the CLPDECODE function.
---

# CLPDECODE

Reconstructs (decodes) the value of a CLP-encoded field from its component columns.

The [CLPLogMessageDecoder](../../basics/data-import/clp.md) can encode fields into a set of three columns:

* `<field>_logtype`
* `<field>_dictionaryVars`
* `<field>_encodedVars`

where `<field>` is the field's name before encoding. We refer to such a set of columns as a column group.

## Signatures

> CLPDECODE(colGroupName)
>
> CLPDECODE(colGroupName, defaultValue)
>
> CLPDECODE(colGroupName_logtype, colGroupName_dictionaryVars, colGroupName_encodedVars)
>
> CLPDECODE(colGroupName_logtype, colGroupName_dictionaryVars, colGroupName_encodedVars, defaultValue)
* The syntax lets you specify the name of a column group or all columns within the column group.
* To use the syntax where you only specify the column group's name, you need to enable an additional query rewriter as described [below](#enable-the-column-group-syntax).
* `defaultValue` is optional and used when a column group can't be decoded for some reason (e.g., it's null).

## Usage Examples

Consider a record that contains a "message" field with the following value:

> INFO Task task_12 assigned to container: [ContainerID:container_15], operation took 0.335 seconds. 8 tasks remaining.
[CLPLogMessageDecoder](../../basics/data-import/clp.md) encodes this information into 3 columns:

| message_logtype | message_dictionaryVars | message_encodedVars |
|--------------------------------------------------------------------------------------------------------------|-----------------------------|-------------------------|
| INFO Task \x12 assigned to container: [ContainerID:\x12], operation took \x13 seconds. \x11 tasks remaining. | ["task_12", "container_15"] | [0x190000000000014f, 8] |

Then we can use `CLPDECODE` as follows:

```sql
SELECT CLPDECODE(message) AS message
FROM myTable
```

| message |
|-----------------------------------------------------------------------------------------------------------------------|
| INFO Task task_12 assigned to container: [ContainerID:container_15], operation took 0.335 seconds. 8 tasks remaining. |

# Enable the column-group syntax

To use the `CLPDECODE` syntax that only specifies the column group name, you must configure the Pinot broker with an additional query rewriter as follows:

```properties
pinot.broker.query.rewriter.class.names=org.apache.pinot.sql.parsers.rewriter.CompileTimeFunctionsInvoker,org.apache.pinot.sql.parsers.rewriter.SelectionsRewriter,org.apache.pinot.sql.parsers.rewriter.PredicateComparisonRewriter,org.apache.pinot.sql.parsers.rewriter.CLPDecodeRewriter,org.apache.pinot.sql.parsers.rewriter.AliasApplier,org.apache.pinot.sql.parsers.rewriter.OrdinalsUpdater,org.apache.pinot.sql.parsers.rewriter.NonAggregationGroupByToDistinctQueryRewriter
```

This adds the `CLPDecodeRewriter` to the default set of query rewriters. Note that the `CLPDecodeRewriter` is placed before the `AliasApplier` so that any aliasing of CLP-encoded fields happens only after the `CLPDECODE` rewrite.

0 comments on commit 7618e47

Please sign in to comment.