Add support for list() multi-value stats function#4161
Add support for list() multi-value stats function#4161ps48 merged 11 commits intoopensearch-project:mainfrom
list() multi-value stats function#4161Conversation
list() multi-value stats function
|
|
||
| Version: 3.3.0 (Calcite engine only) | ||
|
|
||
| Usage: LIST(expr). Returns an array containing all values of the specified field from the result set, preserving both duplicates and the original order of appearance. |
| | EARLIEST | ||
| | LATEST | ||
| | LIST | ||
| | VALUES |
|
|
||
| register( | ||
| LIST, | ||
| (distinct, field, argList, ctx) -> { |
There was a problem hiding this comment.
@yuancu @qianheng-aws Please help review customrized ARRAY_ARG.
There was a problem hiding this comment.
Can we leverage ARRAY_SLICE? apache/calcite#4194
There was a problem hiding this comment.
Do you mean using ARRAY_SLICE to cut the length to 100 after ARRAY_AGG? I think it's feasible but less efficient as it needs to construct a complete ARRAY first, which may be too big.
There was a problem hiding this comment.
I'm thinking if we should implement this aggregate function by ourself, which should get better performance. That customized function should imitate the implementation of ARRAY_AGG or COLLECT except it has additional logic to limit the length of the final output collection.
The current approach has to perform window first and then aggregation, they are both very heavy operators.
| } | ||
|
|
||
| @Test | ||
| public void testListAggregationAlone() { |
| // Create ROW_NUMBER() OVER() window function properly | ||
| RexNode rowNumber = | ||
| ctx.relBuilder | ||
| .aggregateCall(SqlStdOperatorTable.ROW_NUMBER) | ||
| .over() | ||
| .rowsTo(RexWindowBounds.CURRENT_ROW) | ||
| .toRex(); | ||
|
|
||
| // Create condition: ROW_NUMBER() OVER() <= 100 | ||
| RelDataType intType = | ||
| ctx.relBuilder.getTypeFactory().createSqlType(SqlTypeName.INTEGER); | ||
| RexNode hundredLiteral = rexBuilder.makeLiteral(100, intType, false); | ||
| RexNode rowNumCondition = | ||
| rexBuilder.makeCall( | ||
| SqlStdOperatorTable.LESS_THAN_OR_EQUAL, rowNumber, hundredLiteral); | ||
|
|
||
| // Create CASE expression: CASE WHEN ROW_NUMBER() OVER() <= 100 THEN cast_field | ||
| // ELSE NULL END | ||
| RexNode limitedCastExpr = | ||
| rexBuilder.makeCall( | ||
| SqlStdOperatorTable.CASE, | ||
| rowNumCondition, | ||
| castToVarchar, | ||
| rexBuilder.makeNullLiteral(UserDefinedFunctionUtils.NULLABLE_STRING)); | ||
|
|
||
| // Apply ARRAY_AGG directly to the CASE expression | ||
| // DON'T use limt() or filter() - this keeps it contained within the | ||
| // aggregation | ||
| return ctx.relBuilder | ||
| .aggregateCall(SqlLibraryOperators.ARRAY_AGG, limitedCastExpr) | ||
| .ignoreNulls(true); | ||
| }, |
There was a problem hiding this comment.
From the implementation, it seems it will return less elements than expected.
For example, if there are 200 elements, but there are 5 nulls in the first 100 elements, this implementation will return 95 elements instead of 100 elements (first 105 elements minus 5 nulls).
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: Shenoy Pratik <sgguruda@amazon.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
| * <li>Order of values in the result is non-deterministic | ||
| * </ul> | ||
| * | ||
| * <p>Note: Similar to the TAKE function, LIST does not guarantee any specific order of values in |
There was a problem hiding this comment.
Remove "Similar to the TAKE function"
There was a problem hiding this comment.
Sure will update this in a following PR.
|
The backport to To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-4161-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 0875affcd0f120cd9880d28c605e143d0477ac29
# Push it to GitHub
git push --set-upstream origin backport/backport-4161-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-devThen, create a pull request where the |
…ct#4161) * Add support for list function Signed-off-by: ps48 <pshenoy36@gmail.com> * fix test and resolve comments Signed-off-by: ps48 <pshenoy36@gmail.com> * fix spotlesscheck Signed-off-by: ps48 <pshenoy36@gmail.com> * revert list() to UDAF Signed-off-by: ps48 <pshenoy36@gmail.com> * update tests Signed-off-by: ps48 <pshenoy36@gmail.com> * update tests and docs Signed-off-by: ps48 <pshenoy36@gmail.com> * apply spotless Signed-off-by: ps48 <pshenoy36@gmail.com> * remove order by test Signed-off-by: ps48 <pshenoy36@gmail.com> * Add a group by test case in eval Signed-off-by: ps48 <pshenoy36@gmail.com> * revert Optionality in UDF Signed-off-by: ps48 <pshenoy36@gmail.com> --------- Signed-off-by: ps48 <pshenoy36@gmail.com> Signed-off-by: Shenoy Pratik <sgguruda@amazon.com> (cherry picked from commit 0875aff)
* Add support for list function * fix test and resolve comments * fix spotlesscheck * revert list() to UDAF * update tests * update tests and docs * apply spotless * remove order by test * Add a group by test case in eval * revert Optionality in UDF --------- (cherry picked from commit 0875aff) Signed-off-by: ps48 <pshenoy36@gmail.com> Signed-off-by: Shenoy Pratik <sgguruda@amazon.com>
Description
This PR adds support for the
list()aggregation function in PPL (Piped Processing Language), which collects field values into an array while preserving duplicates and order. Splitting the PR for list and values from: #4042BuiltinFunctionName.javaand registered it inPPLFuncImpTable.javaOpenSearchPPLParser.g4to recognize the list function in PPL syntaxdocs/user/ppl/cmd/stats.rstwith comprehensive examples showing:CalciteMultiValueStatsIT.javacovering all supported data types (boolean, byte, numeric, string, etc.)Related Issues
#4026
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.