[SPARK-21680][ML][MLLIB]optimize Vector compress #18899

mpjlu · 2017-08-10T02:05:22Z

What changes were proposed in this pull request?

When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse.
This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse.
When the length of the vector is large, there is significant performance difference between this two method.

How was this patch tested?

The existing UT

SparkQA · 2017-08-10T03:10:25Z

Test build #80473 has finished for PR 18899 at commit 5dc5c89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-08-10T06:55:51Z

This isn't what was proposed in the JIRA?

mpjlu · 2017-08-10T07:08:43Z

Yes, I just concern if add toSparse(size) we should check the size in the code, there will be no performance gain. If we don't need to check the "size" (comparing size with numNonZero) in the code, add toSparse(size) is ok.
Thanks.

srowen · 2017-08-10T07:59:05Z

Check what? you're just saving the extra call to numNonZeroes. Change the declaration like so:

  def toSparse: SparseVector = toSparse(numNonzeros)

  private[linalg] def toSparse(nnz: Int): SparseVector

Then make the implementations override the new private method and use the given nnz arg, and change compressed to pass the nnz value it has already computed.

mpjlu · 2017-08-10T08:11:31Z

Thanks @srowen.
I will revise the code per your suggestion.
when I wrote the code, I just concerned user call toSparse(size) and give a very small size.

srowen · 2017-08-10T08:43:35Z

The user can't call toSparse(nnz). It will be private.

SparkQA · 2017-08-10T10:46:48Z

Test build #80485 has finished for PR 18899 at commit cebe600.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-10T14:45:16Z

Test build #80487 has finished for PR 18899 at commit d4f9e51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-08-10T12:17:52Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala


-  override def toSparse: SparseVector = {
-    val nnz = numNonzeros
+  override def toSparse: SparseVector = toSparse(numNonzeros)


This doesn't need to be overridden. Just define it in the superclass

If define
def toSparse: SparseVector = toSparse(numNonzeros)
in the superclass, when call dv.toSparse (there are this kinds of call in the code), there will be error message:
Both toSparse in the DenseVector of type (nnz:Int) org.apache.spark.ml.linalg.SparseVector and toSparse in trait Vector of type =>org.apache.spark.ml.linalg.SparseVector match .
So we should change the name of toSparse(nnz: Int), maybe toSparseWithSize(nnz: Int).

srowen · 2017-08-10T12:22:01Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

-    val nnz = numNonzeros
+  override def toSparse: SparseVector = toSparse(numNonzeros)
+
+  @Since("2.3.0")


Does not need Since because it is private

srowen · 2017-08-10T20:40:04Z

project/MimaExcludes.scala

      ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees"),
      ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.regression.RandomForestRegressionModel.setFeatureSubsetStrategy")
+    ) ++ Seq(
+      // [SPARK-21680][ML][MLLIB]optimzie Vector coompress


Hm, does this really cause a MiMa failure? what's the message, is it about adding the new method to the interface? I think it could be OK because it's a sealed trait that user code can't implement. CC maybe @MLnick or @sethah or @jkbradley for a thought on that

The error message is"method toSparse(nnz: Int) in trait is present only in current version"

sethah · 2017-08-10T21:14:32Z

First suggestion is that there must be unit tests :)

sethah · 2017-08-10T21:27:11Z

This approach doesn't feel right to me. The goal of the change is to avoid making a pass over the values to find out if there are any explicit zeros that need to be eliminated, which is fine. Instead of allowing the user to specify how many non-zero elements there are, we should instead allow them to specify a Boolean value on whether or not we should bother removing explicit zeros. Here's a small example to demonstrate why:

val v = Vectors.dense(1, 2, 3)
val sv = v.toSparse(2)

This raises a java.lang.ArrayIndexOutOfBoundsException. Since I incorrectly specified the number of non-zero elements. If instead we use the boolean like removeExplicitZeros then when that is false we can just make the values array the size of the dense values in the dense case, and just return this in the Sparse case. So I'm suggesting a method like:

def toSparse(removeExplicitZeros: Boolean): SparseVector

That won't work anyway because of ambiguous reference compile errors (another reason unit tests are so important). I ran into that problem before, and never found a good solution, and so you'll have to come up with a way around that.

sethah · 2017-08-10T21:30:50Z

Btw, I think the compile error is because v.toSparse(2) could resolve to either v.toSparse(nnz = 2) OR v.toSparse.apply(2).

srowen · 2017-08-10T21:54:17Z

The new method is private. Certainly the user is not intended to call it and supply nnz. This change shouldn't alter any semantics or functionality. It's just trying to avoid calculating nnz twice: to figure out if the vector is sparse, and then to convert it to sparse.

sethah · 2017-08-10T22:10:18Z

Ok, yes, I see it now. Though, the point remains but to a lesser degree. We still have a method, albeit private, that indexes the array at potentially unsafe locations. It's probably ok, but at the very least we need a unit test to document the behavior.

mpjlu · 2017-08-11T02:59:43Z

Hi @sethah , the unit test is added. Thanks

SparkQA · 2017-08-11T05:27:47Z

Test build #80517 has finished for PR 18899 at commit 4404b10.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

mpjlu · 2017-08-11T05:34:01Z

retest this please

mpjlu · 2017-08-11T08:03:19Z

retest this please

srowen · 2017-08-11T08:11:39Z

In theory there's no new functionality here so nothing new to test, but more tests never hurt.

This seems OK. Is there any other call site where nnz is already known?

It is a nontrivial bit of change though. How much does this speed things up, do you have any benchmark, for the record?

mpjlu · 2017-08-11T08:18:15Z

For PR-18904, before this change, one iteration is about 58s, after this change, one iteration is about:40s

mpjlu · 2017-08-11T08:38:41Z

Hi @srowen; how about using our first version? though duplicate some code, but change is small.

srowen · 2017-08-11T09:15:55Z

No, duplicate code like that is bad.

srowen · 2017-08-11T10:43:48Z

@mpjlu sorry which benchmark are you referring to? PR 18904 doesn't seem to benchmark just this in isolation. I just want to be sure the gain is significant

mpjlu · 2017-08-11T11:28:49Z

I did not only test this PR. Only work for PR 18904 and find this performance difference.

sethah · 2017-08-11T16:53:28Z

I think there is new functionality, a new method that needs its functionality defined. One specific example, we need a test like:

  test("toSparseWithSize") {
    val dv = Vectors.dense(1, 2, 3)
    withClue("toSparseWithSize fails on the wrong number of non-zeros") {
      intercept[java.lang.ArrayIndexOutOfBoundsException] {
        dv.toSparseWithSize(2)
      }
    }
  }

This is evidence to future developers that the potential failure of this method is known and intended. Also, we need a test for when we specify it incorrectly the other way. i.e. what is the expected outcome of:

val dv = Vectors.dense(1, 2, 3)
val sv = dv.toSparseWithSize(6)

Right now, I get the error java.lang.IllegalArgumentException: requirement failed: You provided 6 indices and values, which exceeds the specified vector size 3. These behaviors are undesirable IMO, but should at least be tested.

I don't believe there's any way to find a better solution without at least adding an O(nnz) operation. Honestly, some more specific performance results would be great to have here.

srowen · 2017-08-11T16:59:07Z

This isn't a public method though. The dv.toSparseWithSize(2) error will never come up unless Spark causes it, and there's no contract for its behavior in that case. It's probably over-specifying things to require it to throw AIOOBE, for example; nothing depends on that nor should.

It doesn't hurt to unit test though and the additional test seems fine.

sethah · 2017-08-11T18:13:18Z

Ok, it's fairly safe since it's limited to private[linalg]. The confusion for me is that this method introduces all sorts of edge cases which have behavior that is not at all obvious or clear. If we don't think we need to test all of these edge cases, I'd like to at least be very explicit in the method doc, e.g.:

  /**
   * This method is used to avoid re-computing the number of non-zero elements when it is
   * already known. This method should only be called after computing the number of non-zero
   * elements via [[numNonZeros]]. e.g.
   *   {{{
   *     val nnz = this.numNonZeros
   *     val sv = toSparse(nnz)
   *   }}}
   *
   * If `nnz` is under-specified, a [[java.lang.ArrayIndexOutOfBoundsException]] is thrown.
   */

SparkQA · 2017-08-11T19:44:44Z

Test build #80529 has finished for PR 18899 at commit 4404b10.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-08-13T07:19:26Z

OK maybe include some of this text in the scaladoc for it, to make it clear it is always intended to be called with the value of numNonZeroes.

mpjlu · 2017-08-14T02:00:37Z

Thanks @sethah @srowen . The comment is added.

SparkQA · 2017-08-14T02:16:26Z

Test build #80600 has finished for PR 18899 at commit d50de99.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-14T02:46:15Z

Test build #80602 has finished for PR 18899 at commit 83ac893.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-14T06:15:07Z

Test build #80606 has finished for PR 18899 at commit 7ab264d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mpjlu · 2017-08-15T07:11:23Z

I have tested the performance of toSparse and toSparseWithSize separately. There is about 35% performance improvement for this change.

srowen · 2017-08-16T18:05:40Z

merged to master

optimzie Vector coompress

5dc5c89

remove code duplicate

cebe600

fix mima

d4f9e51

srowen reviewed Aug 10, 2017

View reviewed changes

Peng Meng added 2 commits August 11, 2017 10:42

add test cases

150b35d

fix mima

4404b10

add comment

d50de99

fix comment style

83ac893

fix comment typo

7ab264d

mpjlu changed the title ~~[SPARK-21680][ML][MLLIB]optimzie Vector coompress~~ [SPARK-21680][ML][MLLIB]optimize Vector compress Aug 15, 2017

srowen approved these changes Aug 15, 2017

View reviewed changes

asfgit closed this in a0345cb Aug 16, 2017

[SPARK-21680][ML][MLLIB]optimize Vector compress #18899

[SPARK-21680][ML][MLLIB]optimize Vector compress #18899

Uh oh!

Conversation

mpjlu commented Aug 10, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 10, 2017

Uh oh!

srowen commented Aug 10, 2017

Uh oh!

mpjlu commented Aug 10, 2017

Uh oh!

srowen commented Aug 10, 2017

Uh oh!

mpjlu commented Aug 10, 2017

Uh oh!

srowen commented Aug 10, 2017

Uh oh!

SparkQA commented Aug 10, 2017

Uh oh!

SparkQA commented Aug 10, 2017

Uh oh!

srowen Aug 10, 2017

Choose a reason for hiding this comment

Uh oh!

mpjlu Aug 11, 2017

Choose a reason for hiding this comment

Uh oh!

srowen Aug 10, 2017

Choose a reason for hiding this comment

Uh oh!

srowen Aug 10, 2017

Choose a reason for hiding this comment

Uh oh!

mpjlu Aug 11, 2017

Choose a reason for hiding this comment

Uh oh!

sethah commented Aug 10, 2017

Uh oh!

sethah commented Aug 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sethah commented Aug 10, 2017

Uh oh!

srowen commented Aug 10, 2017

Uh oh!

sethah commented Aug 10, 2017

Uh oh!

mpjlu commented Aug 11, 2017

Uh oh!

SparkQA commented Aug 11, 2017

Uh oh!

mpjlu commented Aug 11, 2017

Uh oh!

mpjlu commented Aug 11, 2017

Uh oh!

srowen commented Aug 11, 2017

Uh oh!

mpjlu commented Aug 11, 2017

Uh oh!

mpjlu commented Aug 11, 2017

Uh oh!

srowen commented Aug 11, 2017

Uh oh!

srowen commented Aug 11, 2017

Uh oh!

mpjlu commented Aug 11, 2017

Uh oh!

sethah commented Aug 11, 2017

Uh oh!

srowen commented Aug 11, 2017

Uh oh!

sethah commented Aug 11, 2017

Uh oh!

SparkQA commented Aug 11, 2017

Uh oh!

srowen commented Aug 13, 2017

Uh oh!

sethah commented Aug 10, 2017 •

edited

Loading