Skip to content

(Do not review) feat: TopNRank optimization#11554

Closed
aditi-pandit wants to merge 1 commit intomainfrom
topn
Closed

(Do not review) feat: TopNRank optimization#11554
aditi-pandit wants to merge 1 commit intomainfrom
topn

Conversation

@aditi-pandit
Copy link
Copy Markdown
Collaborator

@aditi-pandit aditi-pandit commented Nov 15, 2024

Design doc : https://docs.google.com/document/d/1WQfNigR9bVrbM-PqY7F0mswcetN_tdNahzD9ENye-Q0/edit?usp=sharing

#9404

e2e Presto PR (with changes in the Presto optimizer as well) prestodb/presto#24138

Latency for SF1K TPC-DS Q67 fell from 399s to 146s with this change.

(I also started working on a fuzzer in #12103 which I will enhance for the rank and dense_rank functions added here).

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2024
@netlify
Copy link
Copy Markdown

netlify bot commented Nov 15, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit a8feafe
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/6859fff4f4fc6c0008c30f9e

Copy link
Copy Markdown
Contributor

@liujiayi771 liujiayi771 Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test case for the logic in fixTopRank.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liujiayi771 : The fixTopRank logic is tested very thoroughly in the fewPartitions test. Have added a comment there.

@JkSelf
Copy link
Copy Markdown
Collaborator

JkSelf commented Nov 27, 2024

@aditi-pandit
For TopNRowNumber, there is an issue similar to Window, which is that before TopNRowNumber, Spark will insert an OrderBy operator to sort the data as following.

image

So, do we need to make some abstractions in addInput here as well, to facilitate the addition of TopNStreamingRowNumber later on?

@aditi-pandit
Copy link
Copy Markdown
Collaborator Author

aditi-pandit commented Nov 27, 2024

@aditi-pandit For TopNRowNumber, there is an issue similar to Window, which is that before TopNRowNumber, Spark will insert an OrderBy operator to sort the data as following.

image

So, do we need to make some abstractions in addInput here as well, to facilitate the addition of TopNStreamingRowNumber later on?

@JkSelf : TopNRowNumber is a somewhat streaming operator in its current implementation ... It uses HashTable internally to map the input row to a partition and each partition has an accumulator that maintains the ordered rows (as many required for limit) in a priority queue.

Window accumulates all the input rows and does a full sort of the input rows to demarcate into partitions and sort by order-by. So the preceding Sort was useful and we abstracted the streaming window.

With TopNRowNumber, doing a full sort and then making TopNRowNumber limit to only a partition at a time, the tradeoffs are different. Have you considered removing the global sort and checking if TopNRowNumber suffices ?

If we decide eventually that having a full streaming operator for topNRowNumber is useful, then it might be worth it to write a new operator itself (rather than enhance this current one). Offcourse, we can try to reuse some of the ranking logic pieces.

@JkSelf
Copy link
Copy Markdown
Collaborator

JkSelf commented Nov 27, 2024

@aditi-pandit For TopNRowNumber, there is an issue similar to Window, which is that before TopNRowNumber, Spark will insert an OrderBy operator to sort the data as following.
image
So, do we need to make some abstractions in addInput here as well, to facilitate the addition of TopNStreamingRowNumber later on?

@JkSelf : TopNRowNumber is a streaming operator... It uses HashTable internally to map the input row to a partition and each partition has an accumulator that maintains the ordered rows in a priority queue.

So the OrderBy is wasteful. It's not required. Can you consider removing the OrderBy before TopNRowNumber ?

@aditi-pandit I see. Thanks for your explanations.

@aditi-pandit aditi-pandit force-pushed the topn branch 2 times, most recently from 227f92c to 67f644d Compare January 10, 2025 00:23
facebook-github-bot pushed a commit that referenced this pull request Feb 27, 2025
Summary:
topNRowNumber node is an optimized planNode for SQL with ranking window functions but which limits them to only the topN results. Add a TopNRowNumberFuzzer for plans with this planNode.

This fuzzer is closely modeled after the RowNumberFuzzer. So the common code is abstracted to a RowNumberFuzzerBase class which is used as the parent class for both RowNumberFuzzer and TopnRowNumberFuzzer.

The fuzzer generates plans only for row_number function right now. It will be enhanced to support rank and dense_rank functions after #11554

Pull Request resolved: #12103

Reviewed By: xiaoxmeng

Differential Revision: D69936162

Pulled By: kagamiori

fbshipit-source-id: 81214748c874f219cf7bc57b5eeeb8039325b06c
@aditi-pandit aditi-pandit force-pushed the topn branch 3 times, most recently from f84597c to 2aab22c Compare March 26, 2025 23:59
@prestodb-ci
Copy link
Copy Markdown

@ethanyzhang imported this issue into IBM GitHub Enterprise

@HolyLow
Copy link
Copy Markdown
Contributor

HolyLow commented Apr 16, 2025

This feature is useful, and I am wondering when will this PR get merged? Seems it's been half a year since proposal.

@aditi-pandit
Copy link
Copy Markdown
Collaborator Author

This feature is useful, and I am wondering when will this PR get merged? Seems it's been half a year since proposal.

@HolyLow : I'm breaking this feature into smaller pieces to keep reviews moving. We have added the fuzzers needed to keep validating results for this logic now.

facebook-github-bot pushed a commit that referenced this pull request May 9, 2025
Summary:
The getOutputFromMemory loop first output the remainder of rows (if any) from the current partition. If there was still space on the output buffer, then it went into a loop trying to add as many partitions (some with all rows in entirety and a last partial one possibly). The code is simplified to just a loop that carries forward from the current partition to fill output rows. The loop iterates over as many partitions can be output like the one before.

The priority queue in the TopNRowNumber partitions pops rows in reverse order of row number.  The old code maintained a remainingRowsInPartition_ variable that was used in some odd ways to compute the 'start' rowNumber for each output block from a partition. This is not needed. The partition.rows.size() can be used simply to get the correct row numbers.

Remove the un-necessary currentPartition() function as a result.

ref #11554 for TopNRank enhancements to this operator.

Pull Request resolved: #11440

Reviewed By: xiaoxmeng

Differential Revision: D74427134

Pulled By: kagamiori

fbshipit-source-id: d4e422a899db733565846e844834dd6bb408b689
@liujiayi771
Copy link
Copy Markdown
Contributor

liujiayi771 commented May 26, 2025

Hi @aditi-pandit. Could you rebase the code?

@aditi-pandit
Copy link
Copy Markdown
Collaborator Author

Hi @aditi-pandit. Could you rebase the code?

@liujiayi771 : Done.

facebook-github-bot pushed a commit that referenced this pull request Jun 16, 2025
Summary:
TopNRank comparisons require an API to return if the comparisons are <, > or =. The old APIs only returned a bool for <= or not.

This is the 2nd PR towards TopNRank functionality
ref #11554

Pull Request resolved: #13264

Reviewed By: bikramSingh91

Differential Revision: D76729753

Pulled By: mbasmanova

fbshipit-source-id: 0a5f600530793508750d12a4d50201e9d4a50663
@aditi-pandit aditi-pandit force-pushed the topn branch 4 times, most recently from 60bc434 to c13136b Compare June 19, 2025 22:43
@kevinwilfong
Copy link
Copy Markdown
Contributor

kevinwilfong commented Jun 20, 2025

@aditi-pandit It looks like TopNRowNumberTest/MultiTopNRowNumberTest.fewPartitions/1 is broken

@aditi-pandit aditi-pandit changed the title feat: TopNRank optimization (Do not review) feat: TopNRank optimization Jun 20, 2025
@aditi-pandit
Copy link
Copy Markdown
Collaborator Author

@kevinwilfong : Thanks for looking at this PR. I've been breaking this feature into smaller PRs and trying to rebase. Will put only the minimal changes here for review after the underlying PRs make it in. Moving this to a draft state.

@aditi-pandit aditi-pandit marked this pull request as draft June 20, 2025 16:51
facebook-github-bot pushed a commit that referenced this pull request Jun 20, 2025
Summary:
Towards #11554

Pull Request resolved: #13265

Reviewed By: xiaoxmeng

Differential Revision: D76933336

Pulled By: kevinwilfong

fbshipit-source-id: 215b67ffa6c9dcc5ef7a92f91d558cc32816f625
@aditi-pandit aditi-pandit force-pushed the topn branch 3 times, most recently from 812d59d to 91787f7 Compare June 24, 2025 01:28
facebook-github-bot pushed a commit that referenced this pull request Jul 17, 2025
Summary:
This is the first step to support TopNRowNumber node to support rank() and dense_rank() function.

ref #11554

Pull Request resolved: #13248

Reviewed By: tanjialiang

Differential Revision: D78392182

Pulled By: xiaoxmeng

fbshipit-source-id: 0e21f334e27ef5aaaf8aaacb619427f2d57ae24d
@aditi-pandit aditi-pandit marked this pull request as ready for review July 17, 2025 21:53
facebook-github-bot pushed a commit that referenced this pull request Sep 23, 2025
…nctions in getOutput() logic (#13860)

Summary:
Last refactoring towards #11554

Pull Request resolved: #13860

Reviewed By: gggrace14

Differential Revision: D83028157

Pulled By: xiaoxmeng

fbshipit-source-id: 9ba629bc217bd2ac1e9eff35c9a582782099bb20
@aditi-pandit
Copy link
Copy Markdown
Collaborator Author

Closing in favor of #14141 which is the final piece of this PR after all the refactoring PRs linked.

@prestodb-ci
Copy link
Copy Markdown

Failed to update imported issue:

PATCH https://github.ibm.com/api/v3/repos/lakehouse/velox/issues/506: 422 Validation Failed []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants