Skip to content

[SPARK-34291][ML] LSH hashDistance optimization#31394

Closed
zhengruifeng wants to merge 2 commits intoapache:masterfrom
zhengruifeng:min_hash_distance_opt
Closed

[SPARK-34291][ML] LSH hashDistance optimization#31394
zhengruifeng wants to merge 2 commits intoapache:masterfrom
zhengruifeng:min_hash_distance_opt

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jan 29, 2021

What changes were proposed in this pull request?

hashDistance optimization: if two vectors in a pair are the same, directly return 0.0

Why are the changes needed?

it should be faster than existing impl, because of short-circuit

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

nit

fix
@zhengruifeng zhengruifeng force-pushed the min_hash_distance_opt branch from 04ed8ce to fb05fd1 Compare January 29, 2021 10:28
@github-actions github-actions bot added the ML label Jan 29, 2021
@SparkQA
Copy link

SparkQA commented Jan 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39239/

@SparkQA
Copy link

SparkQA commented Jan 29, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39239/

@SparkQA
Copy link

SparkQA commented Jan 29, 2021

Test build #134651 has finished for PR 31394 at commit fb05fd1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): Double = {
// Since it's generated by hashing, it will be a pair of dense vectors.
x.zip(y).map(vectorPair => Vectors.sqdist(vectorPair._1, vectorPair._2)).min
// Currently each hash vector (generated by hashFunction) only has one element, this equals to:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? can't you have multiple hash functions? but the optimization would be OK even if not, I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that a hash vector has length>1, but in current impl (since 2.1), each vector has only one value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, that seems odd. So with N hash functions you get N 1-vectors, not 1 N-vector? I read SPARK-18454 referred to in the comments but wasn't clear why it was done this way.

Since you may have thought about this more - is this assumption always going to be true for these two implementations, so we don't need to assert about it? or do we need to check the dim to make sure this doesn't return the wrong answer if it ever changes?

Like was this put in place, do you think, to accommodate future algorithms that need to return longer vectors?

If not I wonder if this is another optimization opportunity, to stop wrapping all these in vectors to begin with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with N hash functions you get N 1-vectors, not 1 N-vector?

Yes, for both MinHash and BucketedRandomProjectionLSH.

is this assumption always going to be true for these two implementations, so we don't need to assert about it?

It seems that community had try to update this to N M-vectors, but seems inactive for a long time.

to accommodate future algorithms that need to return longer vectors

It maybe possible, so I tend to update this PR to not use the attribute of 1-Vector.

stop wrapping all these in vectors to begin with

I think we can not do this, since this column of type Array[Vector] had been already exposed to end user.

LSH is widely used, but current impl of LSH in mllib does not work well in my opinion. I will study it in the future.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39449/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134862 has finished for PR 31394 at commit f8b29dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39449/

@srowen
Copy link
Member

srowen commented Feb 6, 2021

Merged to master

@srowen srowen closed this in a9969fa Feb 6, 2021
@zhengruifeng zhengruifeng deleted the min_hash_distance_opt branch February 7, 2021 02:00
@zhengruifeng
Copy link
Contributor Author

thanks @srowen for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants