[SPARK-34291][ML] LSH hashDistance optimization by zhengruifeng · Pull Request #31394 · apache/spark

zhengruifeng · 2021-01-29T10:00:50Z

What changes were proposed in this pull request?

hashDistance optimization: if two vectors in a pair are the same, directly return 0.0

Why are the changes needed?

it should be faster than existing impl, because of short-circuit

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

nit fix

SparkQA · 2021-01-29T11:23:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39239/

SparkQA · 2021-01-29T11:52:00Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39239/

SparkQA · 2021-01-29T11:54:20Z

Test build #134651 has finished for PR 31394 at commit fb05fd1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-02-01T14:33:53Z

mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala

  override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): Double = {
    // Since it's generated by hashing, it will be a pair of dense vectors.
-    x.zip(y).map(vectorPair => Vectors.sqdist(vectorPair._1, vectorPair._2)).min
+    // Currently each hash vector (generated by hashFunction) only has one element, this equals to:


Is this true? can't you have multiple hash functions? but the optimization would be OK even if not, I believe.

It is possible that a hash vector has length>1, but in current impl (since 2.1), each vector has only one value.

Huh, that seems odd. So with N hash functions you get N 1-vectors, not 1 N-vector? I read SPARK-18454 referred to in the comments but wasn't clear why it was done this way.

Since you may have thought about this more - is this assumption always going to be true for these two implementations, so we don't need to assert about it? or do we need to check the dim to make sure this doesn't return the wrong answer if it ever changes?

Like was this put in place, do you think, to accommodate future algorithms that need to return longer vectors?

If not I wonder if this is another optimization opportunity, to stop wrapping all these in vectors to begin with.

So with N hash functions you get N 1-vectors, not 1 N-vector?

Yes, for both MinHash and BucketedRandomProjectionLSH.

is this assumption always going to be true for these two implementations, so we don't need to assert about it?

It seems that community had try to update this to N M-vectors, but seems inactive for a long time.

to accommodate future algorithms that need to return longer vectors

It maybe possible, so I tend to update this PR to not use the attribute of 1-Vector.

stop wrapping all these in vectors to begin with

I think we can not do this, since this column of type Array[Vector] had been already exposed to end user.

LSH is widely used, but current impl of LSH in mllib does not work well in my opinion. I will study it in the future.

mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala

SparkQA · 2021-02-04T08:52:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39449/

SparkQA · 2021-02-04T09:15:18Z

Test build #134862 has finished for PR 31394 at commit f8b29dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-04T09:18:19Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39449/

srowen · 2021-02-06T19:25:55Z

Merged to master

zhengruifeng · 2021-02-07T02:59:57Z

thanks @srowen for reviewing!

init

fb05fd1

nit fix

zhengruifeng force-pushed the min_hash_distance_opt branch from 04ed8ce to fb05fd1 Compare January 29, 2021 10:28

github-actions bot added the ML label Jan 29, 2021

srowen reviewed Feb 1, 2021

View reviewed changes

support long vectors

f8b29dd

srowen approved these changes Feb 4, 2021

View reviewed changes

srowen approved these changes Feb 6, 2021

View reviewed changes

srowen closed this in a9969fa Feb 6, 2021

zhengruifeng deleted the min_hash_distance_opt branch February 7, 2021 02:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34291][ML] LSH hashDistance optimization#31394

[SPARK-34291][ML] LSH hashDistance optimization#31394
zhengruifeng wants to merge 2 commits intoapache:masterfrom
zhengruifeng:min_hash_distance_opt

zhengruifeng commented Jan 29, 2021 •

edited

Loading

Uh oh!

SparkQA commented Jan 29, 2021

Uh oh!

SparkQA commented Jan 29, 2021

Uh oh!

SparkQA commented Jan 29, 2021

Uh oh!

srowen Feb 1, 2021

Uh oh!

zhengruifeng Feb 2, 2021

Uh oh!

srowen Feb 2, 2021

Uh oh!

zhengruifeng Feb 4, 2021

Uh oh!

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

srowen commented Feb 6, 2021

Uh oh!

zhengruifeng commented Feb 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhengruifeng commented Jan 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jan 29, 2021

Uh oh!

SparkQA commented Jan 29, 2021

Uh oh!

SparkQA commented Jan 29, 2021

Uh oh!

srowen Feb 1, 2021

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Feb 2, 2021

Choose a reason for hiding this comment

Uh oh!

srowen Feb 2, 2021

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

srowen commented Feb 6, 2021

Uh oh!

zhengruifeng commented Feb 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhengruifeng commented Jan 29, 2021 •

edited

Loading