fix(TopNRowNumber): Rank with peer computation#16190
fix(TopNRowNumber): Rank with peer computation#16190aditi-pandit wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
b405e50 to
9e0d6ad
Compare
There was a problem hiding this comment.
Pull request overview
Fixes incorrect rank() peer handling when producing in-memory output and makes counting of top-rank peer rows deterministic/correct.
Changes:
- Adjust in-memory rank decrement logic to account for the peer-count of the next rank group.
- Rework
TopRows::numTopRankRows()to compute peer counts without relying onstd::priority_queue’s underlying container ordering. - Add a new test case exercising peer (tie) scenarios across multiple limits and configurations.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
velox/exec/tests/TopNRowNumberTest.cpp |
Adds basicWithPeers to validate correctness under ties/peer rows for all rank functions. |
velox/exec/TopNRowNumber.h |
Updates computeNextRankInMemory signature to allow mutation needed by new peer-count logic. |
velox/exec/TopNRowNumber.cpp |
Fixes in-memory rank decrement behavior and replaces numTopRankRows() implementation to avoid relying on PQ container ordering. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
848ffc7 to
3b5b5dc
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3b5b5dc to
55e983b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto popAndSaveTopRow = [&]() { | ||
| tempTopRankRows.push_back(rows.top()); | ||
| rows.pop(); | ||
| }; |
There was a problem hiding this comment.
The lambda popAndSaveTopRow performs a simple two-line operation that is only called twice in close proximity. Inlining these operations directly would improve code readability and reduce unnecessary abstraction.
e73749e to
32046d6
Compare
32046d6 to
8774db6
Compare
pedroerp
left a comment
There was a problem hiding this comment.
Small comment, but looks good to me. Thanks for the fix
| @@ -578,7 +578,7 @@ TopNRowNumber::TopRows* TopNRowNumber::nextPartition() { | |||
|
|
|||
| template <core::TopNRowNumberNode::RankFunction TRank> | |||
| void TopNRowNumber::computeNextRankInMemory( | |||
| const TopRows& partition, | |||
| TopRows& partition, | |||
There was a problem hiding this comment.
why can't this be const anymore? I don't see where you modify this below.
There was a problem hiding this comment.
This function now uses partition.numTopRankRows() which needs non const variable.
Ideally we should be able to do this logic without mutable structures, but priority_queue doesn't really offer methods beyond top of queue member access. So the only option was to call pop() and check top() and then reinsert elements back.
8774db6 to
98d4437
Compare
xiaoxmeng
left a comment
There was a problem hiding this comment.
@aditi-pandit thanks for the change. I don't quite understand what's the problem of the peer tracking? Thanks!
| @@ -587,21 +587,19 @@ void TopNRowNumber::computeNextRankInMemory( | |||
|
|
|||
| // This is the logic for rank() and dense_rank(). | |||
| // If the next row is a peer of the current one, then the rank remains the | |||
| // same, but the number of peers is incremented. | |||
| // same. | |||
There was a problem hiding this comment.
Does the peer counting has bug? Why we can't use the current tracking and it seems that numTopRankRows is a bit expensive than explicitly tracking the peers? thanks!
There was a problem hiding this comment.
@xiaoxmeng :
The peer counting bug is the fix at line 602.
To give an example:
So say the input had sort keys : 1, 2, 2, 3.
The ranks for these should be: 1, 2, 2, 4
When emitting output rows, we are going from top of the queue to lower. We know that the top rank found so far is 4.
So first row output is (3, 4). Peers of 3 = 1
We need to compute the rank of the new top of the queue (which is 2). Since 2 != 3 we know its a new rank value. All rows with sort key 2 will have the same rank, and that rank value its incremented by the number of peers when computing the rank of the current row which is 4. As there are 2 peers of 2 then their rank should be 4 - 2 = 2.
So next output is (2, 2)
The new row at top of queue is 2. As its not a new sort key the rank remains the same. So next output is (2, 2). Peers of 2 = 2
The new top of queue is 1. As 1 != 2 then its a new rank. The new rank is incremented by the number of its peers to obtain the current rank 2. As number of peers is 1, so new rank is 1.
If we were counting peers of current row (old code) then we would output
1, 3, 3, 4 which is incorrect.
We can't keep all output rows until we see next peer value when outputting, so finding number of top rank rows is the best approach.
We end up doing the numTopRankRows only once for each distinct sort key (not each row) so its not very expensive actually.
98d4437 to
bf061b4
Compare
When decrementing rank values when outputting to memory, the ranks drop by the number of peers of the lower rank row than the higher one. Also fix the computation of numTopRankRows to avoid using the priority queue container vector as that does not necessarily maintain the order of ranks.
bf061b4 to
5e1f0c0
Compare
xiaoxmeng
left a comment
There was a problem hiding this comment.
@aditi-pandit thanks for the explanation!
|
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this in D92526503. |
|
@xiaoxmeng merged this pull request in 34338bf. |
Updating for facebookincubator/velox#16190 ``` == NO RELEASE NOTE == ```
When decrementing rank values when outputting to memory, the ranks drop by the number of peers of the lower rank row than the higher one.
Also fix the computation of numTopRankRows to avoid using the priority queue container vector as that does not necessarily maintain the order of ranks.