-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix(TopNRowNumber): Rank with peer computation #16190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -578,7 +578,7 @@ | |
|
|
||
| template <core::TopNRowNumberNode::RankFunction TRank> | ||
| void TopNRowNumber::computeNextRankInMemory( | ||
| const TopRows& partition, | ||
| TopRows& partition, | ||
| vector_size_t outputIndex) { | ||
| if constexpr (TRank == core::TopNRowNumberNode::RankFunction::kRowNumber) { | ||
| nextRank_ -= 1; | ||
|
|
@@ -587,21 +587,19 @@ | |
|
|
||
| // This is the logic for rank() and dense_rank(). | ||
| // If the next row is a peer of the current one, then the rank remains the | ||
| // same, but the number of peers is incremented. | ||
| // same. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does the peer counting has bug? Why we can't use the current tracking and it seems that numTopRankRows is a bit expensive than explicitly tracking the peers? thanks!
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @xiaoxmeng : To give an example: When emitting output rows, we are going from top of the queue to lower. We know that the top rank found so far is 4. We need to compute the rank of the new top of the queue (which is 2). Since 2 != 3 we know its a new rank value. All rows with sort key 2 will have the same rank, and that rank value its incremented by the number of peers when computing the rank of the current row which is 4. As there are 2 peers of 2 then their rank should be 4 - 2 = 2. The new row at top of queue is 2. As its not a new sort key the rank remains the same. So next output is (2, 2). Peers of 2 = 2 The new top of queue is 1. As 1 != 2 then its a new rank. The new rank is incremented by the number of its peers to obtain the current rank 2. As number of peers is 1, so new rank is 1. If we were counting peers of current row (old code) then we would output We can't keep all output rows until we see next peer value when outputting, so finding number of top rank rows is the best approach. We end up doing the numTopRankRows only once for each distinct sort key (not each row) so its not very expensive actually. |
||
| if (comparator_.compare(outputRows_[outputIndex], partition.rows.top()) == | ||
| 0) { | ||
| numPeers_ += 1; | ||
| return; | ||
| } | ||
|
|
||
| // The new row is not a peer of the current one. So dense_rank drops the | ||
| // rank by 1, but rank drops it by the number of peers (which is then | ||
| // reset). | ||
| // rank by 1, but rank drops by the number of peers of the new top | ||
| // row (new rank) in TopRows queue. | ||
| if constexpr (TRank == core::TopNRowNumberNode::RankFunction::kDenseRank) { | ||
| nextRank_ -= 1; | ||
| } else { | ||
| nextRank_ -= numPeers_; | ||
| numPeers_ = 1; | ||
| nextRank_ -= partition.numTopRankRows(); | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -1111,18 +1109,32 @@ | |
|
|
||
| vector_size_t TopNRowNumber::TopRows::numTopRankRows() { | ||
| VELOX_CHECK(!rows.empty()); | ||
|
|
||
| tempTopRankRows.clear(); | ||
|
xiaoxmeng marked this conversation as resolved.
|
||
| SCOPE_EXIT { | ||
| tempTopRankRows.clear(); | ||
| }; | ||
| auto popAndSaveTopRow = [&]() { | ||
| tempTopRankRows.push_back(rows.top()); | ||
| rows.pop(); | ||
| }; | ||
|
Comment on lines
+1117
to
+1120
|
||
|
|
||
|
aditi-pandit marked this conversation as resolved.
|
||
| char* topRow = rows.top(); | ||
| vector_size_t numRows = 0; | ||
| const std::vector<char*, StlAllocator<char*>> partitionRowsVector = | ||
| PriorityQueueVector(rows); | ||
| for (const char* row : partitionRowsVector) { | ||
| if (rowComparator.compare(topRow, row) == 0) { | ||
| numRows += 1; | ||
| popAndSaveTopRow(); | ||
| while (!rows.empty()) { | ||
| if (rowComparator.compare(topRow, rows.top()) == 0) { | ||
| popAndSaveTopRow(); | ||
| } else { | ||
| break; | ||
| } | ||
| } | ||
| return numRows; | ||
|
|
||
| vector_size_t numTopRows = tempTopRankRows.size(); | ||
|
Check warning on line 1132 in velox/exec/TopNRowNumber.cpp
|
||
| // Re-insert all rows with the top rank row. | ||
| for (char* row : tempTopRankRows) { | ||
| rows.push(row); | ||
| } | ||
| return numTopRows; | ||
| } | ||
|
|
||
| bool TopNRowNumber::TopRows::isDuplicate( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't this be const anymore? I don't see where you modify this below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function now uses partition.numTopRankRows() which needs non const variable.
Ideally we should be able to do this logic without mutable structures, but priority_queue doesn't really offer methods beyond top of queue member access. So the only option was to call pop() and check top() and then reinsert elements back.