Skip to content

Conversation

@mpjlu
Copy link

@mpjlu mpjlu commented Aug 10, 2017

What changes were proposed in this pull request?

When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse.
This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse.
When the length of the vector is large, there is significant performance difference between this two method.

How was this patch tested?

The existing UT

@SparkQA
Copy link

SparkQA commented Aug 10, 2017

Test build #80473 has finished for PR 18899 at commit 5dc5c89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 10, 2017

This isn't what was proposed in the JIRA?

@mpjlu
Copy link
Author

mpjlu commented Aug 10, 2017

Yes, I just concern if add toSparse(size) we should check the size in the code, there will be no performance gain. If we don't need to check the "size" (comparing size with numNonZero) in the code, add toSparse(size) is ok.
Thanks.

@srowen
Copy link
Member

srowen commented Aug 10, 2017

Check what? you're just saving the extra call to numNonZeroes. Change the declaration like so:

  def toSparse: SparseVector = toSparse(numNonzeros)

  private[linalg] def toSparse(nnz: Int): SparseVector

Then make the implementations override the new private method and use the given nnz arg, and change compressed to pass the nnz value it has already computed.

@mpjlu
Copy link
Author

mpjlu commented Aug 10, 2017

Thanks @srowen.
I will revise the code per your suggestion.
when I wrote the code, I just concerned user call toSparse(size) and give a very small size.

@srowen
Copy link
Member

srowen commented Aug 10, 2017

The user can't call toSparse(nnz). It will be private.

@SparkQA
Copy link

SparkQA commented Aug 10, 2017

Test build #80485 has finished for PR 18899 at commit cebe600.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2017

Test build #80487 has finished for PR 18899 at commit d4f9e51.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


override def toSparse: SparseVector = {
val nnz = numNonzeros
override def toSparse: SparseVector = toSparse(numNonzeros)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be overridden. Just define it in the superclass

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If define
def toSparse: SparseVector = toSparse(numNonzeros)
in the superclass, when call dv.toSparse (there are this kinds of call in the code), there will be error message:
Both toSparse in the DenseVector of type (nnz:Int) org.apache.spark.ml.linalg.SparseVector and toSparse in trait Vector of type =>org.apache.spark.ml.linalg.SparseVector match .
So we should change the name of toSparse(nnz: Int), maybe toSparseWithSize(nnz: Int).

val nnz = numNonzeros
override def toSparse: SparseVector = toSparse(numNonzeros)

@Since("2.3.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not need Since because it is private

ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees"),
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.regression.RandomForestRegressionModel.setFeatureSubsetStrategy")
) ++ Seq(
// [SPARK-21680][ML][MLLIB]optimzie Vector coompress
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, does this really cause a MiMa failure? what's the message, is it about adding the new method to the interface? I think it could be OK because it's a sealed trait that user code can't implement. CC maybe @MLnick or @sethah or @jkbradley for a thought on that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is"method toSparse(nnz: Int) in trait is present only in current version"

@sethah
Copy link
Contributor

sethah commented Aug 10, 2017

First suggestion is that there must be unit tests :)

@sethah
Copy link
Contributor

sethah commented Aug 10, 2017

This approach doesn't feel right to me. The goal of the change is to avoid making a pass over the values to find out if there are any explicit zeros that need to be eliminated, which is fine. Instead of allowing the user to specify how many non-zero elements there are, we should instead allow them to specify a Boolean value on whether or not we should bother removing explicit zeros. Here's a small example to demonstrate why:

val v = Vectors.dense(1, 2, 3)
val sv = v.toSparse(2)

This raises a java.lang.ArrayIndexOutOfBoundsException. Since I incorrectly specified the number of non-zero elements. If instead we use the boolean like removeExplicitZeros then when that is false we can just make the values array the size of the dense values in the dense case, and just return this in the Sparse case. So I'm suggesting a method like:

def toSparse(removeExplicitZeros: Boolean): SparseVector

That won't work anyway because of ambiguous reference compile errors (another reason unit tests are so important). I ran into that problem before, and never found a good solution, and so you'll have to come up with a way around that.

@sethah
Copy link
Contributor

sethah commented Aug 10, 2017

Btw, I think the compile error is because v.toSparse(2) could resolve to either v.toSparse(nnz = 2) OR v.toSparse.apply(2).

@srowen
Copy link
Member

srowen commented Aug 10, 2017

The new method is private. Certainly the user is not intended to call it and supply nnz. This change shouldn't alter any semantics or functionality. It's just trying to avoid calculating nnz twice: to figure out if the vector is sparse, and then to convert it to sparse.

@sethah
Copy link
Contributor

sethah commented Aug 10, 2017

Ok, yes, I see it now. Though, the point remains but to a lesser degree. We still have a method, albeit private, that indexes the array at potentially unsafe locations. It's probably ok, but at the very least we need a unit test to document the behavior.

@mpjlu
Copy link
Author

mpjlu commented Aug 11, 2017

Hi @sethah , the unit test is added. Thanks

@SparkQA
Copy link

SparkQA commented Aug 11, 2017

Test build #80517 has finished for PR 18899 at commit 4404b10.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu
Copy link
Author

mpjlu commented Aug 11, 2017

retest this please

1 similar comment
@mpjlu
Copy link
Author

mpjlu commented Aug 11, 2017

retest this please

@srowen
Copy link
Member

srowen commented Aug 11, 2017

In theory there's no new functionality here so nothing new to test, but more tests never hurt.

This seems OK. Is there any other call site where nnz is already known?

It is a nontrivial bit of change though. How much does this speed things up, do you have any benchmark, for the record?

@mpjlu
Copy link
Author

mpjlu commented Aug 11, 2017

For PR-18904, before this change, one iteration is about 58s, after this change, one iteration is about:40s

@mpjlu
Copy link
Author

mpjlu commented Aug 11, 2017

Hi @srowen; how about using our first version? though duplicate some code, but change is small.

@srowen
Copy link
Member

srowen commented Aug 11, 2017

No, duplicate code like that is bad.

@srowen
Copy link
Member

srowen commented Aug 11, 2017

@mpjlu sorry which benchmark are you referring to? PR 18904 doesn't seem to benchmark just this in isolation. I just want to be sure the gain is significant

@mpjlu
Copy link
Author

mpjlu commented Aug 11, 2017

I did not only test this PR. Only work for PR 18904 and find this performance difference.

@sethah
Copy link
Contributor

sethah commented Aug 11, 2017

I think there is new functionality, a new method that needs its functionality defined. One specific example, we need a test like:

  test("toSparseWithSize") {
    val dv = Vectors.dense(1, 2, 3)
    withClue("toSparseWithSize fails on the wrong number of non-zeros") {
      intercept[java.lang.ArrayIndexOutOfBoundsException] {
        dv.toSparseWithSize(2)
      }
    }
  }

This is evidence to future developers that the potential failure of this method is known and intended. Also, we need a test for when we specify it incorrectly the other way. i.e. what is the expected outcome of:

val dv = Vectors.dense(1, 2, 3)
val sv = dv.toSparseWithSize(6)

Right now, I get the error java.lang.IllegalArgumentException: requirement failed: You provided 6 indices and values, which exceeds the specified vector size 3. These behaviors are undesirable IMO, but should at least be tested.

I don't believe there's any way to find a better solution without at least adding an O(nnz) operation. Honestly, some more specific performance results would be great to have here.

@srowen
Copy link
Member

srowen commented Aug 11, 2017

This isn't a public method though. The dv.toSparseWithSize(2) error will never come up unless Spark causes it, and there's no contract for its behavior in that case. It's probably over-specifying things to require it to throw AIOOBE, for example; nothing depends on that nor should.

It doesn't hurt to unit test though and the additional test seems fine.

@sethah
Copy link
Contributor

sethah commented Aug 11, 2017

Ok, it's fairly safe since it's limited to private[linalg]. The confusion for me is that this method introduces all sorts of edge cases which have behavior that is not at all obvious or clear. If we don't think we need to test all of these edge cases, I'd like to at least be very explicit in the method doc, e.g.:

  /**
   * This method is used to avoid re-computing the number of non-zero elements when it is
   * already known. This method should only be called after computing the number of non-zero
   * elements via [[numNonZeros]]. e.g.
   *   {{{
   *     val nnz = this.numNonZeros
   *     val sv = toSparse(nnz)
   *   }}}
   *
   * If `nnz` is under-specified, a [[java.lang.ArrayIndexOutOfBoundsException]] is thrown.
   */

@SparkQA
Copy link

SparkQA commented Aug 11, 2017

Test build #80529 has finished for PR 18899 at commit 4404b10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 13, 2017

OK maybe include some of this text in the scaladoc for it, to make it clear it is always intended to be called with the value of numNonZeroes.

@mpjlu
Copy link
Author

mpjlu commented Aug 14, 2017

Thanks @sethah @srowen . The comment is added.

@SparkQA
Copy link

SparkQA commented Aug 14, 2017

Test build #80600 has finished for PR 18899 at commit d50de99.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 14, 2017

Test build #80602 has finished for PR 18899 at commit 83ac893.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 14, 2017

Test build #80606 has finished for PR 18899 at commit 7ab264d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu mpjlu changed the title [SPARK-21680][ML][MLLIB]optimzie Vector coompress [SPARK-21680][ML][MLLIB]optimize Vector compress Aug 15, 2017
@mpjlu
Copy link
Author

mpjlu commented Aug 15, 2017

I have tested the performance of toSparse and toSparseWithSize separately. There is about 35% performance improvement for this change.

@srowen
Copy link
Member

srowen commented Aug 16, 2017

merged to master

@asfgit asfgit closed this in a0345cb Aug 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants