-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21401][ML][MLLIB] add poll function for BoundedPriorityQueue #18620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This isn't used anywhere though? |
|
Yes, my following PR will use it. |
|
Why not add the usage here, and make a JIRA? I don't see a reason to split them |
|
Ok, thanks @srowen . |
|
Test build #79580 has finished for PR 18620 at commit
|
|
Hi @srowen , I have added Test Suite for BoundedPriorityQueue. Thanks. |
|
Test build #79603 has finished for PR 18620 at commit
|
|
Test build #79604 has finished for PR 18620 at commit
|
|
retest this please |
|
Test build #79606 has finished for PR 18620 at commit
|
|
@mpjlu I think we can close this right? |
|
Keep it or close it, both is ok for me. We have much discussion on: |
|
Although it could be rolled into #18624, since we're here, we could merge this. |
|
I have tested much about poll and toArray.sorted. |
|
So in the first case there is a slight win with `array.sortBy` (but not
that much if I recall, they are more or less on par?) and in the second
case poll is a lot faster?
Given the default block size in ALS, the 1st scenario is by far the most
likely, right? But for very small item sizes it could be different (but in
that case, frankly performance won't really be any issue).
I mean, it's a private util so adding `poll` is not a big deal. It just
feels a little unnecessary.
…On Mon, 17 Jul 2017 at 13:37 Meng, Peng ***@***.***> wrote:
I have tested much about poll and toArray.sorted.
If the queue is much ordered (suppose offer 2000 times for queue size 20).
Use pq.toArray.sorted is faster.
If the queue is much disordered (suppose offer 100 times for queue size
20), Use pq.poll is much faster.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#18620 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SB17ZqiWE1BMmuX8IdHTce0NYJSlkks5sO0djgaJpZM4OWot0>
.
|
|
Hi @MLnick , |
|
Fair enough
…On Mon, 17 Jul 2017 at 14:14, Meng, Peng ***@***.***> wrote:
Hi @MLnick <https://github.com/mlnick> ,
pq.toArray.sorted also used in other places, like word2vector and LDA, how
about waiting for my other benchmark results. Then decide to close it or
not.
Thanks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18620 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SB2NYd961vGKgHIUNf4gdKk_9qHslks5sO1ARgaJpZM4OWot0>
.
|
|
Hi @MLnick , @srowen . |
|
I'm not understanding why def sortBy[B](f: A => B)(implicit ord: Ordering[B]): Repr = sorted(ord on f) |
|
I also very confused about this. You can change #18624 to sorted and test. |
|
My micro benchmark (write a program only test pq.toArray.sorted and pq.Array.sortBy and pq.poll), not find significant performance difference. Only in the Spark job, there is big difference. Confused. |
|
That would make sense. There must be something else going on. Overall, I don't think it is compelling enough evidence to make the |
|
I am ok to close this. Thanks @MLnick |
|
My benchmarks locally said poll() is a little faster on moderately large collections, like 100 elements in the queue. I'm really neutral. If it affords a little help, that's great. It's a natural method for a queue to have and no extra implementation cost. |
|
Merged to master. Hey, we gain some tests of this class, which has no tests now. |
What changes were proposed in this pull request?
The most of BoundedPriorityQueue usages in ML/MLLIB are:
Get the value of BoundedPriorityQueue, then sort it.
For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
in ALS, pq.toArray.sorted()
The test results show using pq.poll is much faster than sort the value.
It is good to add the poll function for BoundedPriorityQueue.
How was this patch tested?
The existing UT