Merge pull request #236 from y-scope/latest

Document CLPDECODE.
pinot-contrib · Dec 28, 2023 · 7618e47 · 7618e47
2 parents 0fd4bde + 3d3b5a0
commit 7618e47
Show file tree

Hide file tree

Showing 3 changed files with 71 additions and 2 deletions.
diff --git a/basics/data-import/clp.md b/basics/data-import/clp.md
@@ -84,7 +84,6 @@ Assuming the user wants to encode `message` and `logPath` as in the example, the
 
 * `stream.kafka.decoder.prop.fieldsForClpEncoding` is a comma-separated list of names for fields that should be encoded with CLP.
 * We use [variable-length dictionaries](../../configuration-reference/table#table-index-config) for the logtype and dictionary variables since their length can vary significantly.
-* Ideally, we would disable the dictionaries for the encoded variable columns (since they are likely to be random), but currently, a bug prevents us from doing that for multi-valued number-type columns.
 
 ### Schema
 
@@ -134,4 +133,8 @@ For the table's schema, users should configure the CLP-encoded fields as follows
 
 ## Searching and decoding CLP-encoded fields
 
-There is currently no built-in support within Pinot for searching and decoding CLP-encoded fields. This will be added in future commits, potentially as a set of UDFs. The development of these features is being tracked in this [design doc](https://docs.google.com/document/d/1nHZb37re4mUwEA258x3a2pgX13EWLWMJ0uLEDk1dUyU/edit).
+To decode CLP-encoded fields, use [CLPDECODE](../../configuration-reference/functions/clpdecode.md).
+
+To search CLP-encoded fields, you can combine `CLPDECODE` with `LIKE`. Note, this may decrease performance when querying a large number of rows.
+
+We are working to integrate efficient searches on CLP-encoded columns as another UDF. The development of this feature is being tracked in this [design doc](https://docs.google.com/document/d/1nHZb37re4mUwEA258x3a2pgX13EWLWMJ0uLEDk1dUyU/edit).
diff --git a/configuration-reference/functions/README.md b/configuration-reference/functions/README.md
@@ -126,6 +126,10 @@ This page contains reference documentation for functions in Apache Pinot.
 [chr.md](chr.md)
 {% endcontent-ref %}
 
+{% content-ref url="clpdecode.md" %}
+[clpdecode.md](clpdecode.md)
+{% endcontent-ref %}
+
 {% content-ref url="codepoint.md" %}
 [codepoint.md](codepoint.md)
 {% endcontent-ref %}

diff --git a/configuration-reference/functions/clpdecode.md b/configuration-reference/functions/clpdecode.md
@@ -0,0 +1,62 @@
+---
+description: This section contains reference documentation for the CLPDECODE function.
+---
+
+# CLPDECODE
+
+Reconstructs (decodes) the value of a CLP-encoded field from its component columns.
+
+The [CLPLogMessageDecoder](../../basics/data-import/clp.md) can encode fields into a set of three columns:
+
+* `<field>_logtype`
+* `<field>_dictionaryVars`
+* `<field>_encodedVars`
+
+where `<field>` is the field's name before encoding. We refer to such a set of columns as a column group.
+
+## Signatures
+
+> CLPDECODE(colGroupName)
+> 
+> CLPDECODE(colGroupName, defaultValue)
+> 
+> CLPDECODE(colGroupName_logtype, colGroupName_dictionaryVars, colGroupName_encodedVars)
+> 
+> CLPDECODE(colGroupName_logtype, colGroupName_dictionaryVars, colGroupName_encodedVars, defaultValue)
+
+* The syntax lets you specify the name of a column group or all columns within the column group.
+  * To use the syntax where you only specify the column group's name, you need to enable an additional query rewriter as described [below](#enable-the-column-group-syntax).   
+* `defaultValue` is optional and used when a column group can't be decoded for some reason (e.g., it's null).
+
+## Usage Examples
+
+Consider a record that contains a "message" field with the following value:
+
+> INFO Task task_12 assigned to container: [ContainerID:container_15], operation took 0.335 seconds. 8 tasks remaining.
+
+[CLPLogMessageDecoder](../../basics/data-import/clp.md) encodes this information into 3 columns:
+
+| message_logtype                                                                                              | message_dictionaryVars      | message_encodedVars     |
+|--------------------------------------------------------------------------------------------------------------|-----------------------------|-------------------------|
+| INFO Task \x12 assigned to container: [ContainerID:\x12], operation took \x13 seconds. \x11 tasks remaining. | ["task_12", "container_15"] | [0x190000000000014f, 8] |
+
+Then we can use `CLPDECODE` as follows:
+
+```sql
+SELECT CLPDECODE(message) AS message
+FROM myTable
+```
+
+| message                                                                                                               |
+|-----------------------------------------------------------------------------------------------------------------------|
+| INFO Task task_12 assigned to container: [ContainerID:container_15], operation took 0.335 seconds. 8 tasks remaining. |
+
+# Enable the column-group syntax
+
+To use the `CLPDECODE` syntax that only specifies the column group name, you must configure the Pinot broker with an additional query rewriter as follows:
+
+```properties
+pinot.broker.query.rewriter.class.names=org.apache.pinot.sql.parsers.rewriter.CompileTimeFunctionsInvoker,org.apache.pinot.sql.parsers.rewriter.SelectionsRewriter,org.apache.pinot.sql.parsers.rewriter.PredicateComparisonRewriter,org.apache.pinot.sql.parsers.rewriter.CLPDecodeRewriter,org.apache.pinot.sql.parsers.rewriter.AliasApplier,org.apache.pinot.sql.parsers.rewriter.OrdinalsUpdater,org.apache.pinot.sql.parsers.rewriter.NonAggregationGroupByToDistinctQueryRewriter
+```
+
+This adds the `CLPDecodeRewriter` to the default set of query rewriters. Note that the `CLPDecodeRewriter` is placed before the `AliasApplier` so that any aliasing of CLP-encoded fields happens only after the `CLPDECODE` rewrite.