Skip to content

Conversation

@mpjlu
Copy link

@mpjlu mpjlu commented Jul 13, 2017

What changes were proposed in this pull request?

The most of BoundedPriorityQueue usages in ML/MLLIB are:
Get the value of BoundedPriorityQueue, then sort it.
For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
in ALS, pq.toArray.sorted()

The test results show using pq.poll is much faster than sort the value.
It is good to add the poll function for BoundedPriorityQueue.

How was this patch tested?

The existing UT

@srowen
Copy link
Member

srowen commented Jul 13, 2017

This isn't used anywhere though?

@mpjlu
Copy link
Author

mpjlu commented Jul 13, 2017

Yes, my following PR will use it.

@srowen
Copy link
Member

srowen commented Jul 13, 2017

Why not add the usage here, and make a JIRA? I don't see a reason to split them

@mpjlu
Copy link
Author

mpjlu commented Jul 13, 2017

Ok, thanks @srowen .
I will create a JIRA, and show the usage and performance comparing.

@SparkQA
Copy link

SparkQA commented Jul 13, 2017

Test build #79580 has finished for PR 18620 at commit 5c80798.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu mpjlu changed the title [MINOR][ML][MLLIB] add poll function for BoundedPriorityQueue [SPARK-21401][ML][MLLIB] add poll function for BoundedPriorityQueue Jul 13, 2017
@mpjlu
Copy link
Author

mpjlu commented Jul 14, 2017

Hi @srowen , I have added Test Suite for BoundedPriorityQueue. Thanks.

@SparkQA
Copy link

SparkQA commented Jul 14, 2017

Test build #79603 has finished for PR 18620 at commit 5e0ab01.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2017

Test build #79604 has finished for PR 18620 at commit b6630db.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu
Copy link
Author

mpjlu commented Jul 14, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Jul 14, 2017

Test build #79606 has finished for PR 18620 at commit b6630db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Jul 17, 2017

@mpjlu I think we can close this right?

@mpjlu
Copy link
Author

mpjlu commented Jul 17, 2017

Keep it or close it, both is ok for me. We have much discussion on:
https://issues.apache.org/jira/browse/SPARK-21401

@srowen
Copy link
Member

srowen commented Jul 17, 2017

Although it could be rolled into #18624, since we're here, we could merge this.

@mpjlu
Copy link
Author

mpjlu commented Jul 17, 2017

I have tested much about poll and toArray.sorted.
If the queue is much ordered (suppose offer 2000 times for queue size 20). Use pq.toArray.sorted is faster.
If the queue is much disordered (suppose offer 100 times for queue size 20), Use pq.poll is much faster.

@MLnick
Copy link
Contributor

MLnick commented Jul 17, 2017 via email

@mpjlu
Copy link
Author

mpjlu commented Jul 17, 2017

Hi @MLnick ,
pq.toArray.sorted also used in other places, like word2vector and LDA, how about waiting for my other benchmark results. Then decide to close it or not.
Thanks.

@MLnick
Copy link
Contributor

MLnick commented Jul 17, 2017 via email

@mpjlu
Copy link
Author

mpjlu commented Jul 18, 2017

Hi @MLnick , @srowen .
My test showing: pq.poll is not significantly faster than pq.toArray.sortBy, but significantly faster than pq.toArray.sorted. Seems not each pq.toArray.sorted (such as used in topByKey) can be replaced by pq.toArray.sortBy, so use pq.poll to replace pq.toArray.sorted will benefit.
You can compare the performance of pq.sorted, pq.sortBy, and pq.poll using: #18624
The performance of pq.toArray.sortBy is about the same as pq.poll, and about 20% improvement comparing pq.toArray.sorted.

@MLnick
Copy link
Contributor

MLnick commented Jul 18, 2017

I'm not understanding why sorted is slower than sortBy - sortBy uses sorted in its implementation:

def sortBy[B](f: A => B)(implicit ord: Ordering[B]): Repr = sorted(ord on f)

@mpjlu
Copy link
Author

mpjlu commented Jul 18, 2017

I also very confused about this. You can change #18624 to sorted and test.

@mpjlu
Copy link
Author

mpjlu commented Jul 18, 2017

My micro benchmark (write a program only test pq.toArray.sorted and pq.Array.sortBy and pq.poll), not find significant performance difference. Only in the Spark job, there is big difference. Confused.

@MLnick
Copy link
Contributor

MLnick commented Jul 18, 2017

That would make sense. There must be something else going on. Overall, I don't think it is compelling enough evidence to make the poll change. (Though as mentioned it's not a huge deal so if others want to do it, no objection)

@mpjlu
Copy link
Author

mpjlu commented Jul 18, 2017

I am ok to close this. Thanks @MLnick

@srowen
Copy link
Member

srowen commented Jul 18, 2017

My benchmarks locally said poll() is a little faster on moderately large collections, like 100 elements in the queue. I'm really neutral. If it affords a little help, that's great. It's a natural method for a queue to have and no extra implementation cost.

@mpjlu
Copy link
Author

mpjlu commented Jul 18, 2017

Thanks @srowen , my test also said pq.poll is a little faster on some cases.
One possible benefit here is if we provide pq.poll, user's first choice may use pq.poll, not pq.toArray.sorted, which may causes performance reduction. As I have encounter for #18624

@srowen
Copy link
Member

srowen commented Jul 19, 2017

Merged to master. Hey, we gain some tests of this class, which has no tests now.

@asfgit asfgit closed this in 46307b2 Jul 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants