[SPARK-34291][ML] LSH hashDistance optimization#31394
[SPARK-34291][ML] LSH hashDistance optimization#31394zhengruifeng wants to merge 2 commits intoapache:masterfrom
Conversation
04ed8ce to
fb05fd1
Compare
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #134651 has finished for PR 31394 at commit
|
| override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): Double = { | ||
| // Since it's generated by hashing, it will be a pair of dense vectors. | ||
| x.zip(y).map(vectorPair => Vectors.sqdist(vectorPair._1, vectorPair._2)).min | ||
| // Currently each hash vector (generated by hashFunction) only has one element, this equals to: |
There was a problem hiding this comment.
Is this true? can't you have multiple hash functions? but the optimization would be OK even if not, I believe.
There was a problem hiding this comment.
It is possible that a hash vector has length>1, but in current impl (since 2.1), each vector has only one value.
There was a problem hiding this comment.
Huh, that seems odd. So with N hash functions you get N 1-vectors, not 1 N-vector? I read SPARK-18454 referred to in the comments but wasn't clear why it was done this way.
Since you may have thought about this more - is this assumption always going to be true for these two implementations, so we don't need to assert about it? or do we need to check the dim to make sure this doesn't return the wrong answer if it ever changes?
Like was this put in place, do you think, to accommodate future algorithms that need to return longer vectors?
If not I wonder if this is another optimization opportunity, to stop wrapping all these in vectors to begin with.
There was a problem hiding this comment.
So with N hash functions you get N 1-vectors, not 1 N-vector?
Yes, for both MinHash and BucketedRandomProjectionLSH.
is this assumption always going to be true for these two implementations, so we don't need to assert about it?
It seems that community had try to update this to N M-vectors, but seems inactive for a long time.
to accommodate future algorithms that need to return longer vectors
It maybe possible, so I tend to update this PR to not use the attribute of 1-Vector.
stop wrapping all these in vectors to begin with
I think we can not do this, since this column of type Array[Vector] had been already exposed to end user.
LSH is widely used, but current impl of LSH in mllib does not work well in my opinion. I will study it in the future.
mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Test build #134862 has finished for PR 31394 at commit
|
|
Kubernetes integration test status success |
|
Merged to master |
|
thanks @srowen for reviewing! |
What changes were proposed in this pull request?
hashDistanceoptimization: if two vectors in a pair are the same, directly return 0.0Why are the changes needed?
it should be faster than existing impl, because of short-circuit
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing testsuites