-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21680][ML][MLLIB]optimize Vector compress #18899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #80473 has finished for PR 18899 at commit
|
|
This isn't what was proposed in the JIRA? |
|
Yes, I just concern if add toSparse(size) we should check the size in the code, there will be no performance gain. If we don't need to check the "size" (comparing size with numNonZero) in the code, add toSparse(size) is ok. |
|
Check what? you're just saving the extra call to numNonZeroes. Change the declaration like so: Then make the implementations override the new private method and use the given nnz arg, and change |
|
Thanks @srowen. |
|
The user can't call toSparse(nnz). It will be private. |
|
Test build #80485 has finished for PR 18899 at commit
|
|
Test build #80487 has finished for PR 18899 at commit
|
|
|
||
| override def toSparse: SparseVector = { | ||
| val nnz = numNonzeros | ||
| override def toSparse: SparseVector = toSparse(numNonzeros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be overridden. Just define it in the superclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If define
def toSparse: SparseVector = toSparse(numNonzeros)
in the superclass, when call dv.toSparse (there are this kinds of call in the code), there will be error message:
Both toSparse in the DenseVector of type (nnz:Int) org.apache.spark.ml.linalg.SparseVector and toSparse in trait Vector of type =>org.apache.spark.ml.linalg.SparseVector match .
So we should change the name of toSparse(nnz: Int), maybe toSparseWithSize(nnz: Int).
| val nnz = numNonzeros | ||
| override def toSparse: SparseVector = toSparse(numNonzeros) | ||
|
|
||
| @Since("2.3.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not need Since because it is private
| ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees"), | ||
| ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.regression.RandomForestRegressionModel.setFeatureSubsetStrategy") | ||
| ) ++ Seq( | ||
| // [SPARK-21680][ML][MLLIB]optimzie Vector coompress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, does this really cause a MiMa failure? what's the message, is it about adding the new method to the interface? I think it could be OK because it's a sealed trait that user code can't implement. CC maybe @MLnick or @sethah or @jkbradley for a thought on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is"method toSparse(nnz: Int) in trait is present only in current version"
|
First suggestion is that there must be unit tests :) |
|
This approach doesn't feel right to me. The goal of the change is to avoid making a pass over the values to find out if there are any explicit zeros that need to be eliminated, which is fine. Instead of allowing the user to specify how many non-zero elements there are, we should instead allow them to specify a Boolean value on whether or not we should bother removing explicit zeros. Here's a small example to demonstrate why: val v = Vectors.dense(1, 2, 3)
val sv = v.toSparse(2)This raises a def toSparse(removeExplicitZeros: Boolean): SparseVectorThat won't work anyway because of ambiguous reference compile errors (another reason unit tests are so important). I ran into that problem before, and never found a good solution, and so you'll have to come up with a way around that. |
|
Btw, I think the compile error is because |
|
The new method is private. Certainly the user is not intended to call it and supply nnz. This change shouldn't alter any semantics or functionality. It's just trying to avoid calculating nnz twice: to figure out if the vector is sparse, and then to convert it to sparse. |
|
Ok, yes, I see it now. Though, the point remains but to a lesser degree. We still have a method, albeit private, that indexes the array at potentially unsafe locations. It's probably ok, but at the very least we need a unit test to document the behavior. |
|
Hi @sethah , the unit test is added. Thanks |
|
Test build #80517 has finished for PR 18899 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
In theory there's no new functionality here so nothing new to test, but more tests never hurt. This seems OK. Is there any other call site where nnz is already known? It is a nontrivial bit of change though. How much does this speed things up, do you have any benchmark, for the record? |
|
For PR-18904, before this change, one iteration is about 58s, after this change, one iteration is about:40s |
|
Hi @srowen; how about using our first version? though duplicate some code, but change is small. |
|
No, duplicate code like that is bad. |
|
@mpjlu sorry which benchmark are you referring to? PR 18904 doesn't seem to benchmark just this in isolation. I just want to be sure the gain is significant |
|
I did not only test this PR. Only work for PR 18904 and find this performance difference. |
|
I think there is new functionality, a new method that needs its functionality defined. One specific example, we need a test like: test("toSparseWithSize") {
val dv = Vectors.dense(1, 2, 3)
withClue("toSparseWithSize fails on the wrong number of non-zeros") {
intercept[java.lang.ArrayIndexOutOfBoundsException] {
dv.toSparseWithSize(2)
}
}
}This is evidence to future developers that the potential failure of this method is known and intended. Also, we need a test for when we specify it incorrectly the other way. i.e. what is the expected outcome of: val dv = Vectors.dense(1, 2, 3)
val sv = dv.toSparseWithSize(6)Right now, I get the error I don't believe there's any way to find a better solution without at least adding an O(nnz) operation. Honestly, some more specific performance results would be great to have here. |
|
This isn't a public method though. The dv.toSparseWithSize(2) error will never come up unless Spark causes it, and there's no contract for its behavior in that case. It's probably over-specifying things to require it to throw AIOOBE, for example; nothing depends on that nor should. It doesn't hurt to unit test though and the additional test seems fine. |
|
Ok, it's fairly safe since it's limited to /**
* This method is used to avoid re-computing the number of non-zero elements when it is
* already known. This method should only be called after computing the number of non-zero
* elements via [[numNonZeros]]. e.g.
* {{{
* val nnz = this.numNonZeros
* val sv = toSparse(nnz)
* }}}
*
* If `nnz` is under-specified, a [[java.lang.ArrayIndexOutOfBoundsException]] is thrown.
*/ |
|
Test build #80529 has finished for PR 18899 at commit
|
|
OK maybe include some of this text in the scaladoc for it, to make it clear it is always intended to be called with the value of |
|
Test build #80600 has finished for PR 18899 at commit
|
|
Test build #80602 has finished for PR 18899 at commit
|
|
Test build #80606 has finished for PR 18899 at commit
|
|
I have tested the performance of toSparse and toSparseWithSize separately. There is about 35% performance improvement for this change. |
|
merged to master |
What changes were proposed in this pull request?
When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse.
This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse.
When the length of the vector is large, there is significant performance difference between this two method.
How was this patch tested?
The existing UT