Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document CLPDECODE. #236

Merged
merged 8 commits into from
Dec 28, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions basics/data-import/clp.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,6 @@ Assuming the user wants to encode `message` and `logPath` as in the example, the

* `stream.kafka.decoder.prop.fieldsForClpEncoding` is a comma-separated list of names for fields that should be encoded with CLP.
* We use [variable-length dictionaries](../../configuration-reference/table#table-index-config) for the logtype and dictionary variables since their length can vary significantly.
* Ideally, we would disable the dictionaries for the encoded variable columns (since they are likely to be random), but currently, a bug prevents us from doing that for multi-valued number-type columns.

### Schema

Expand Down Expand Up @@ -134,4 +133,8 @@ For the table's schema, users should configure the CLP-encoded fields as follows

## Searching and decoding CLP-encoded fields

There is currently no built-in support within Pinot for searching and decoding CLP-encoded fields. This will be added in future commits, potentially as a set of UDFs. The development of these features is being tracked in this [design doc](https://docs.google.com/document/d/1nHZb37re4mUwEA258x3a2pgX13EWLWMJ0uLEDk1dUyU/edit).
To decode CLP-encoded fields, users can use [CLPDECODE](../../configuration-reference/functions/clpdecode.md).
kirkrodrigues marked this conversation as resolved.
Show resolved Hide resolved

To search CLP-encoded fields, users can combine `CLPDECODE` with `LIKE`, however this may be expensive if there are a lot of rows to query.
kelseiv marked this conversation as resolved.
Show resolved Hide resolved

We are working to integrate efficient searches on CLP-encoded columns as another UDF. The development of this feature is being tracked in this [design doc](https://docs.google.com/document/d/1nHZb37re4mUwEA258x3a2pgX13EWLWMJ0uLEDk1dUyU/edit).
4 changes: 4 additions & 0 deletions configuration-reference/functions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,10 @@ This page contains reference documentation for functions in Apache Pinot.
[chr.md](chr.md)
{% endcontent-ref %}

{% content-ref url="clpdecode.md" %}
[clpdecode.md](clpdecode.md)
{% endcontent-ref %}

{% content-ref url="codepoint.md" %}
[codepoint.md](codepoint.md)
{% endcontent-ref %}
Expand Down
51 changes: 51 additions & 0 deletions configuration-reference/functions/clpdecode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
description: This section contains reference documentation for the CLPDECODE function.
---

# CLPDECODE

Reconstructs (decodes) the value of a CLP-encoded field from its component columns.

The [CLPLogMessageDecoder](../../basics/data-import/clp.md) can encode fields into a set of three columns:

* `<field>_logtype`
* `<field>_dictionaryVars`
* `<field>_encodedVars`

where `<field>` is the field's name before encoding. We refer to such a set of columns as a column group.

## Signatures

> CLPDECODE(colGroupName)
>
> CLPDECODE(colGroupName, defaultValue)
>
> CLPDECODE(colGroupName_logtype, colGroupName_dictionaryVars, colGroupName_encodedVars)
>
> CLPDECODE(colGroupName_logtype, colGroupName_dictionaryVars, colGroupName_encodedVars, defaultValue)

* The syntax allows you to specify just the name of a column group or all columns within the column group.
kirkrodrigues marked this conversation as resolved.
Show resolved Hide resolved
* `defaultValue` is optional and is used when a column group can't be decoded for some reason (e.g., it's null).
kirkrodrigues marked this conversation as resolved.
Show resolved Hide resolved

## Usage Examples

Consider a record that contains a "message" field with this value:
kirkrodrigues marked this conversation as resolved.
Show resolved Hide resolved

> INFO Task task_12 assigned to container: [ContainerID:container_15], operation took 0.335 seconds. 8 tasks remaining.

[CLPLogMessageDecoder](../../basics/data-import/clp.md) will encode it into 3 columns:
kirkrodrigues marked this conversation as resolved.
Show resolved Hide resolved

| message_logtype | message_dictionaryVars | message_encodedVars |
|--------------------------------------------------------------------------------------------------------------|-----------------------------|-------------------------|
| INFO Task \x12 assigned to container: [ContainerID:\x12], operation took \x13 seconds. \x11 tasks remaining. | ["task_12", "container_15"] | [0x190000000000014f, 8] |

Then we can use `CLPDECODE` as follows:

```sql
SELECT CLPDECODE(message) AS message
FROM myTable
```

| message |
|-----------------------------------------------------------------------------------------------------------------------|
| INFO Task task_12 assigned to container: [ContainerID:container_15], operation took 0.335 seconds. 8 tasks remaining. |