(Do not review) feat: TopNRank optimization by aditi-pandit · Pull Request #11554 · facebookincubator/velox

aditi-pandit · 2024-11-15T19:28:55Z

Design doc : https://docs.google.com/document/d/1WQfNigR9bVrbM-PqY7F0mswcetN_tdNahzD9ENye-Q0/edit?usp=sharing

e2e Presto PR (with changes in the Presto optimizer as well) prestodb/presto#24138

Latency for SF1K TPC-DS Q67 fell from 399s to 146s with this change.

(I also started working on a fuzzer in #12103 which I will enhance for the rank and dense_rank functions added here).

netlify · 2024-11-15T19:29:11Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`a8feafe`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/6859fff4f4fc6c0008c30f9e

liujiayi771 · 2024-11-25T01:46:23Z

velox/exec/tests/TopNRowNumberTest.cpp

Could you also add a test case for the logic in fixTopRank.

@liujiayi771 : The fixTopRank logic is tested very thoroughly in the fewPartitions test. Have added a comment there.

JkSelf · 2024-11-27T06:29:01Z

@aditi-pandit
For TopNRowNumber, there is an issue similar to Window, which is that before TopNRowNumber, Spark will insert an OrderBy operator to sort the data as following.

So, do we need to make some abstractions in addInput here as well, to facilitate the addition of TopNStreamingRowNumber later on?

aditi-pandit · 2024-11-27T06:32:38Z

@aditi-pandit For TopNRowNumber, there is an issue similar to Window, which is that before TopNRowNumber, Spark will insert an OrderBy operator to sort the data as following.

So, do we need to make some abstractions in addInput here as well, to facilitate the addition of TopNStreamingRowNumber later on?

@JkSelf : TopNRowNumber is a somewhat streaming operator in its current implementation ... It uses HashTable internally to map the input row to a partition and each partition has an accumulator that maintains the ordered rows (as many required for limit) in a priority queue.

Window accumulates all the input rows and does a full sort of the input rows to demarcate into partitions and sort by order-by. So the preceding Sort was useful and we abstracted the streaming window.

With TopNRowNumber, doing a full sort and then making TopNRowNumber limit to only a partition at a time, the tradeoffs are different. Have you considered removing the global sort and checking if TopNRowNumber suffices ?

If we decide eventually that having a full streaming operator for topNRowNumber is useful, then it might be worth it to write a new operator itself (rather than enhance this current one). Offcourse, we can try to reuse some of the ranking logic pieces.

JkSelf · 2024-11-27T06:51:52Z

@aditi-pandit For TopNRowNumber, there is an issue similar to Window, which is that before TopNRowNumber, Spark will insert an OrderBy operator to sort the data as following.

So, do we need to make some abstractions in addInput here as well, to facilitate the addition of TopNStreamingRowNumber later on?

@JkSelf : TopNRowNumber is a streaming operator... It uses HashTable internally to map the input row to a partition and each partition has an accumulator that maintains the ordered rows in a priority queue.

So the OrderBy is wasteful. It's not required. Can you consider removing the OrderBy before TopNRowNumber ?

@aditi-pandit I see. Thanks for your explanations.

Summary: topNRowNumber node is an optimized planNode for SQL with ranking window functions but which limits them to only the topN results. Add a TopNRowNumberFuzzer for plans with this planNode. This fuzzer is closely modeled after the RowNumberFuzzer. So the common code is abstracted to a RowNumberFuzzerBase class which is used as the parent class for both RowNumberFuzzer and TopnRowNumberFuzzer. The fuzzer generates plans only for row_number function right now. It will be enhanced to support rank and dense_rank functions after #11554 Pull Request resolved: #12103 Reviewed By: xiaoxmeng Differential Revision: D69936162 Pulled By: kagamiori fbshipit-source-id: 81214748c874f219cf7bc57b5eeeb8039325b06c

prestodb-ci · 2025-04-04T07:40:07Z

@ethanyzhang imported this issue into IBM GitHub Enterprise

HolyLow · 2025-04-16T07:37:11Z

This feature is useful, and I am wondering when will this PR get merged? Seems it's been half a year since proposal.

aditi-pandit · 2025-04-18T17:10:14Z

This feature is useful, and I am wondering when will this PR get merged? Seems it's been half a year since proposal.

@HolyLow : I'm breaking this feature into smaller pieces to keep reviews moving. We have added the fuzzers needed to keep validating results for this logic now.

Summary: The getOutputFromMemory loop first output the remainder of rows (if any) from the current partition. If there was still space on the output buffer, then it went into a loop trying to add as many partitions (some with all rows in entirety and a last partial one possibly). The code is simplified to just a loop that carries forward from the current partition to fill output rows. The loop iterates over as many partitions can be output like the one before. The priority queue in the TopNRowNumber partitions pops rows in reverse order of row number. The old code maintained a remainingRowsInPartition_ variable that was used in some odd ways to compute the 'start' rowNumber for each output block from a partition. This is not needed. The partition.rows.size() can be used simply to get the correct row numbers. Remove the un-necessary currentPartition() function as a result. ref #11554 for TopNRank enhancements to this operator. Pull Request resolved: #11440 Reviewed By: xiaoxmeng Differential Revision: D74427134 Pulled By: kagamiori fbshipit-source-id: d4e422a899db733565846e844834dd6bb408b689

liujiayi771 · 2025-05-26T02:24:54Z

Hi @aditi-pandit. Could you rebase the code?

aditi-pandit · 2025-05-29T21:58:11Z

Hi @aditi-pandit. Could you rebase the code?

@liujiayi771 : Done.

Summary: TopNRank comparisons require an API to return if the comparisons are <, > or =. The old APIs only returned a bool for <= or not. This is the 2nd PR towards TopNRank functionality ref #11554 Pull Request resolved: #13264 Reviewed By: bikramSingh91 Differential Revision: D76729753 Pulled By: mbasmanova fbshipit-source-id: 0a5f600530793508750d12a4d50201e9d4a50663

kevinwilfong · 2025-06-20T16:04:21Z

@aditi-pandit It looks like TopNRowNumberTest/MultiTopNRowNumberTest.fewPartitions/1 is broken

aditi-pandit · 2025-06-20T16:51:47Z

@kevinwilfong : Thanks for looking at this PR. I've been breaking this feature into smaller PRs and trying to rebase. Will put only the minimal changes here for review after the underlying PRs make it in. Moving this to a draft state.

Summary: Towards #11554 Pull Request resolved: #13265 Reviewed By: xiaoxmeng Differential Revision: D76933336 Pulled By: kevinwilfong fbshipit-source-id: 215b67ffa6c9dcc5ef7a92f91d558cc32816f625

Summary: This is the first step to support TopNRowNumber node to support rank() and dense_rank() function. ref #11554 Pull Request resolved: #13248 Reviewed By: tanjialiang Differential Revision: D78392182 Pulled By: xiaoxmeng fbshipit-source-id: 0e21f334e27ef5aaaf8aaacb619427f2d57ae24d

…nctions in getOutput() logic (#13860) Summary: Last refactoring towards #11554 Pull Request resolved: #13860 Reviewed By: gggrace14 Differential Revision: D83028157 Pulled By: xiaoxmeng fbshipit-source-id: 9ba629bc217bd2ac1e9eff35c9a582782099bb20

aditi-pandit · 2025-09-25T00:36:39Z

Closing in favor of #14141 which is the final piece of this PR after all the refactoring PRs linked.

prestodb-ci · 2025-09-25T00:36:41Z

Failed to update imported issue:

PATCH https://github.ibm.com/api/v3/repos/lakehouse/velox/issues/506: 422 Validation Failed []

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2024

aditi-pandit requested review from Yuhta, mbasmanova and xiaoxmeng November 15, 2024 19:29

aditi-pandit force-pushed the topn branch 5 times, most recently from 0446e51 to 153a06f Compare November 17, 2024 03:22

liujiayi771 reviewed Nov 25, 2024

View reviewed changes

aditi-pandit force-pushed the topn branch from 153a06f to ab57e63 Compare November 25, 2024 06:39

aditi-pandit force-pushed the topn branch 2 times, most recently from 227f92c to 67f644d Compare January 10, 2025 00:23

aditi-pandit mentioned this pull request Jan 31, 2025

feat(fuzzer): Add TopNRowNumberFuzzer #12103

Closed

aditi-pandit force-pushed the topn branch 3 times, most recently from f84597c to 2aab22c Compare March 26, 2025 23:59

rui-mo mentioned this pull request Mar 28, 2025

[VL] upstream OAP/Velox commits to upstream apache/gluten#8782

Open

aditi-pandit mentioned this pull request Apr 22, 2025

test: Add TopnRowNumber test for few partitions but repetitive order_by keys #13109

Closed

This was referenced May 6, 2025

feat(TopNRank): Add rank function field to TopNRowNumber node #13248

Closed

refactor: Enhance RowComparator for rank function comparisons #13264

Closed

refactor: TopNRowNumber::getOutputFromMemory loop #11440

Closed

aditi-pandit mentioned this pull request May 7, 2025

refactor: Abstract the processInputRow loop in TopNRowNumber #13265

Closed

aditi-pandit force-pushed the topn branch from 2aab22c to f27ac76 Compare May 29, 2025 21:57

aditi-pandit force-pushed the topn branch 4 times, most recently from 60bc434 to c13136b Compare June 19, 2025 22:43

aditi-pandit changed the title ~~feat: TopNRank optimization~~ (Do not review) feat: TopNRank optimization Jun 20, 2025

aditi-pandit marked this pull request as draft June 20, 2025 16:51

aditi-pandit force-pushed the topn branch 3 times, most recently from 812d59d to 91787f7 Compare June 24, 2025 01:28

Add TopNRank optimization

a8feafe

aditi-pandit force-pushed the topn branch from 91787f7 to a8feafe Compare June 24, 2025 01:31

aditi-pandit mentioned this pull request Jun 24, 2025

refactor(TopNRowNumber): Abstract computeNextRankInMemory(InSpill) functions in getOutput() logic #13860

Closed

aditi-pandit marked this pull request as ready for review July 17, 2025 21:53

aditi-pandit closed this Sep 25, 2025

Conversation

aditi-pandit commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

liujiayi771 Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aditi-pandit Nov 25, 2024

Choose a reason for hiding this comment

Uh oh!

JkSelf commented Nov 27, 2024

Uh oh!

aditi-pandit commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JkSelf commented Nov 27, 2024

Uh oh!

prestodb-ci commented Apr 4, 2025

Uh oh!

HolyLow commented Apr 16, 2025

Uh oh!

aditi-pandit commented Apr 18, 2025

Uh oh!

liujiayi771 commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditi-pandit commented May 29, 2025

Uh oh!

kevinwilfong commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditi-pandit commented Jun 20, 2025

Uh oh!

aditi-pandit commented Sep 25, 2025

Uh oh!

prestodb-ci commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

aditi-pandit commented Nov 15, 2024 •

edited

Loading

netlify bot commented Nov 15, 2024 •

edited

Loading

liujiayi771 Nov 25, 2024 •

edited

Loading

aditi-pandit commented Nov 27, 2024 •

edited

Loading

liujiayi771 commented May 26, 2025 •

edited

Loading

kevinwilfong commented Jun 20, 2025 •

edited

Loading