Skip to content

Add support for list() multi-value stats function#4161

Merged
ps48 merged 11 commits intoopensearch-project:mainfrom
ps48:mv-list-function
Sep 3, 2025
Merged

Add support for list() multi-value stats function#4161
ps48 merged 11 commits intoopensearch-project:mainfrom
ps48:mv-list-function

Conversation

@ps48
Copy link
Copy Markdown
Member

@ps48 ps48 commented Aug 28, 2025

Description

This PR adds support for the list() aggregation function in PPL (Piped Processing Language), which collects field values into an array while preserving duplicates and order. Splitting the PR for list and values from: #4042

  • Core Function Registration: Added LIST to BuiltinFunctionName.java and registered it in PPLFuncImpTable.java
  • Grammar Support: Updated OpenSearchPPLParser.g4 to recognize the list function in PPL syntax
  • Documentation: Enhanced docs/user/ppl/cmd/stats.rst with comprehensive examples showing:
  • Comprehensive Testing: Added extensive integration tests in CalciteMultiValueStatsIT.java covering all supported data types (boolean, byte, numeric, string, etc.)

Related Issues

#4026

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@ps48 ps48 changed the title Add support for list multi-value stats function Add support for list() multi-value stats function Aug 28, 2025
@ps48 ps48 added the enhancement New feature or request label Aug 28, 2025
Comment thread docs/user/ppl/cmd/stats.rst Outdated

Version: 3.3.0 (Calcite engine only)

Usage: LIST(expr). Returns an array containing all values of the specified field from the result set, preserving both duplicates and the original order of appearance.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order is undetermiend

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

| EARLIEST
| LATEST
| LIST
| VALUES
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert values

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted


register(
LIST,
(distinct, field, argList, ctx) -> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuancu @qianheng-aws Please help review customrized ARRAY_ARG.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leverage ARRAY_SLICE? apache/calcite#4194

Copy link
Copy Markdown
Collaborator

@qianheng-aws qianheng-aws Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean using ARRAY_SLICE to cut the length to 100 after ARRAY_AGG? I think it's feasible but less efficient as it needs to construct a complete ARRAY first, which may be too big.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if we should implement this aggregate function by ourself, which should get better performance. That customized function should imitate the implementation of ARRAY_AGG or COLLECT except it has additional logic to limit the length of the final output collection.

The current approach has to perform window first and then aggregation, they are both very heavy operators.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. @ps48 UDAF then make sense.

}

@Test
public void testListAggregationAlone() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a group by test case

Copy link
Copy Markdown
Member Author

@ps48 ps48 Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added it 3311c89

Comment on lines +1105 to +1133
// Create ROW_NUMBER() OVER() window function properly
RexNode rowNumber =
ctx.relBuilder
.aggregateCall(SqlStdOperatorTable.ROW_NUMBER)
.over()
.rowsTo(RexWindowBounds.CURRENT_ROW)
.toRex();

// Create condition: ROW_NUMBER() OVER() <= 100
RelDataType intType =
ctx.relBuilder.getTypeFactory().createSqlType(SqlTypeName.INTEGER);
RexNode hundredLiteral = rexBuilder.makeLiteral(100, intType, false);
RexNode rowNumCondition =
rexBuilder.makeCall(
SqlStdOperatorTable.LESS_THAN_OR_EQUAL, rowNumber, hundredLiteral);

// Create CASE expression: CASE WHEN ROW_NUMBER() OVER() <= 100 THEN cast_field
// ELSE NULL END
RexNode limitedCastExpr =
rexBuilder.makeCall(
SqlStdOperatorTable.CASE,
rowNumCondition,
castToVarchar,
rexBuilder.makeNullLiteral(UserDefinedFunctionUtils.NULLABLE_STRING));

// Apply ARRAY_AGG directly to the CASE expression
// DON'T use limt() or filter() - this keeps it contained within the
// aggregation
return ctx.relBuilder
.aggregateCall(SqlLibraryOperators.ARRAY_AGG, limitedCastExpr)
.ignoreNulls(true);
},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the implementation, it seems it will return less elements than expected.

For example, if there are 200 elements, but there are 5 nulls in the first 100 elements, this implementation will return 95 elements instead of 100 elements (first 105 elements minus 5 nulls).

ps48 added 2 commits August 29, 2025 11:51
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
ps48 added 2 commits August 29, 2025 11:52
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
@ps48 ps48 force-pushed the mv-list-function branch from 84dca8c to 91cb056 Compare August 29, 2025 18:52
ps48 and others added 7 commits August 29, 2025 16:12
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: Shenoy Pratik <sgguruda@amazon.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: ps48 <pshenoy36@gmail.com>
* <li>Order of values in the result is non-deterministic
* </ul>
*
* <p>Note: Similar to the TAKE function, LIST does not guarantee any specific order of values in
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "Similar to the TAKE function"

Copy link
Copy Markdown
Member Author

@ps48 ps48 Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will update this in a following PR.

@ps48 ps48 merged commit 0875aff into opensearch-project:main Sep 3, 2025
23 checks passed
@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

The backport to 2.19-dev failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-4161-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 0875affcd0f120cd9880d28c605e143d0477ac29
# Push it to GitHub
git push --set-upstream origin backport/backport-4161-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-dev

Then, create a pull request where the base branch is 2.19-dev and the compare/head branch is backport/backport-4161-to-2.19-dev.

ps48 added a commit to ps48/sql that referenced this pull request Sep 4, 2025
…ct#4161)

* Add support for list function

Signed-off-by: ps48 <pshenoy36@gmail.com>

* fix test and resolve comments

Signed-off-by: ps48 <pshenoy36@gmail.com>

* fix spotlesscheck

Signed-off-by: ps48 <pshenoy36@gmail.com>

* revert list() to UDAF

Signed-off-by: ps48 <pshenoy36@gmail.com>

* update tests

Signed-off-by: ps48 <pshenoy36@gmail.com>

* update tests and docs

Signed-off-by: ps48 <pshenoy36@gmail.com>

* apply spotless

Signed-off-by: ps48 <pshenoy36@gmail.com>

* remove order by test

Signed-off-by: ps48 <pshenoy36@gmail.com>

* Add a group by test case in eval

Signed-off-by: ps48 <pshenoy36@gmail.com>

* revert Optionality in UDF

Signed-off-by: ps48 <pshenoy36@gmail.com>

---------

Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: Shenoy Pratik <sgguruda@amazon.com>
(cherry picked from commit 0875aff)
Swiddis pushed a commit that referenced this pull request Sep 4, 2025
* Add support for list function



* fix test and resolve comments



* fix spotlesscheck



* revert list() to UDAF



* update tests



* update tests and docs



* apply spotless



* remove order by test



* Add a group by test case in eval



* revert Optionality in UDF



---------



(cherry picked from commit 0875aff)

Signed-off-by: ps48 <pshenoy36@gmail.com>
Signed-off-by: Shenoy Pratik <sgguruda@amazon.com>
@LantaoJin LantaoJin added the backport-manually Filed a PR to backport manually. label Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev backport-failed backport-manually Filed a PR to backport manually. enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants