[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors #27374

zhengruifeng · 2020-01-28T13:45:02Z

What changes were proposed in this pull request?

1, use blocks instead of vectors
2, use Level-2 BLAS for binary, use Level-3 BLAS for multinomial

Why are the changes needed?

1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (40% ~ 92%)

Does this PR introduce any user-facing change?

add a new expert param blockSize

How was this patch tested?

updated testsuites

init init

zhengruifeng · 2020-01-28T13:49:59Z

env: bin/spark-shell --driver-memory=32G

testCode:

import org.apache.spark.ml.classification._
import org.apache.spark.storage.StorageLevel


var df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a").withColumn("label", (col("label")+1)/2)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count

(0 until 8).foreach{ _ => df = df.union(df) }
df.count

new LogisticRegression().setMaxIter(10).fit(df)

val lr1 = new LogisticRegression().setMaxIter(100).setFamily("binomial")
val start = System.currentTimeMillis; val model1 = lr1.fit(df); val end = System.currentTimeMillis; end - start


val lr2 = new LogisticRegression().setMaxIter(100).setFitIntercept(false).setFamily("binomial")
val start = System.currentTimeMillis; val model2 = lr2.fit(df); val end = System.currentTimeMillis; end - start


val lr3 = new LogisticRegression().setMaxIter(100).setFamily("multinomial")
val start = System.currentTimeMillis; val model3 = lr3.fit(df); val end = System.currentTimeMillis; end - start


val lr4 = new LogisticRegression().setMaxIter(100).setFitIntercept(false).setFamily("multinomial")
val start = System.currentTimeMillis; val model4 = lr4.fit(df); val end = System.currentTimeMillis; end - start

result:

this PR:
RAM: 1418.9M
DURATION: 136217, 161194, 171625, 177116

Master:
RAM: 2.3G
DURATION: 217035, 218267, 239111, 250163

zhengruifeng · 2020-01-28T14:11:43Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala

+        // If fitIntercept==false, gradientSumArray += mat.T X matrix
+        // GEMM requires block.matrix is dense
+        val gradSumMat = new DenseMatrix(numClasses, numFeatures, localGradientSumArray)
+        BLAS.gemm(1.0, mat.transpose, dm, 1.0, gradSumMat)


Since gradientSumArray is for Matrix of shape CXFPI, and BLAS.gemm requires the output matrix is not transposed. So only if F(numFeature) == FPI(numFeaturesPlusIntercept) and input block is dense, can I use BLAS.gemm to directly update gradientSumArray.
Otherwise, I need to output the result to a temp matrix multinomialLinearGradSumMat, and then add elements to gradientSumArray

zhengruifeng · 2020-01-28T14:40:34Z

ping @srowen

srowen · 2020-01-28T14:45:35Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala

  }

+  // Helper vectors and matrices for binary:
+  @transient private lazy val binaryLinear = {


So, are these lazy just to deal with recreating them after deserialization? they don't seem big, so can they just be non-transient, non-lazy? unless it's a material problem, might be simpler and faster.
Or how much do you need to hold on to scratch vectors like auxiliaryVec vs just locals?

binaryLinear, binaryIntercept, multinomialLinear, multinomialIntercept are the linear and bias part of coefficients, repectively.

binaryLinearGradSumVec (numFeatures) and multinomialLinearGradSumMat (numClassXnumFeatures) are used to store result of gemv/gemm if fitIntercept==True, since gradientSumArray contains gradient sums of intercepts and can not be used directly in gemv/gemm.

auxiliaryVec (blockSize) and multinomialAuxiliaryMat (blockSizeXnumClasses) are used to store the intermediate multiplication(margins) and multipliers.

they can be used among blocks, and if they are used multi-times in one call we can assign them to local variables.
However I am OK to make them local variables, since I guess they are not the bottleneck.

OK up to your judgment. It'd be simpler to not even make them members, if it's not much difference to performance

mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala

srowen · 2020-01-28T14:48:30Z

Also, does this cause any appreciable slowdown at smaller scale? it's not a big deal if something that's fast is a little slower, to make things that are slow much faster, but just want to get a sense of what you know about the scale implications.

SparkQA · 2020-01-28T14:57:44Z

Test build #117483 has finished for PR 27374 at commit f753462.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-01-28T15:31:58Z

@srowen
The orignial dataset a9a is not big, its numFeatures=123, numInstances=32,561, after upsampling its numInstances=32,561X256=8,335,616.

I had made other performance tests, it seems that the performance is related to numFeatures and blockSize, and I guess the performance is highly related to: given a array of vectors, to what degree can Level2/3-BLAS be faster than existing java impl or Level-1.

Thanks for reviewing!

srowen · 2020-01-28T16:11:29Z

Yeah that's the question... Level 1 BLAS often isn't a win. L2/L3 yes. It's probably a win, but just dont' want to take a perf hit on most use cases to help large ones. Even that's arguable.

zhengruifeng · 2020-01-29T02:41:33Z

Ok, I will test on small datasets.

SparkQA · 2020-01-29T03:51:26Z

Test build #117501 has finished for PR 27374 at commit 36245b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-01-29T08:47:24Z

@srowen I found that on small datasets, the speed up is even more significant.

data: a9a, numFeatures=123, numInstances=32,561

testCode:

import org.apache.spark.ml.classification._
import org.apache.spark.storage.StorageLevel


val df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a").withColumn("label", (col("label")+1)/2)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count

val lr4 = new LogisticRegression().setMaxIter(100).setFitIntercept(false).setFamily("multinomial")
val start = System.currentTimeMillis; val model4 = lr4.fit(df); val end = System.currentTimeMillis; end - start

Seq(64, 256, 1024, 4096, 8192).map { b => val start = System.currentTimeMillis; val model1 = new LogisticRegression().setBlockSize(b).fit(df); val end = System.currentTimeMillis; end - start } // this PR

 Seq(64, 256, 1024, 4096, 8192).map { b => val start = System.currentTimeMillis; val model1 = new LogisticRegression().fit(df); val end = System.currentTimeMillis; end - start } // Master

result: about 77%~92% faster. I think that is beacuse on big dataset, the communiation overhead has a bigger impact on the whole procedure; while on small datasets like a9a, high-level BLAS dominates the performance.
This PR: List(1630, 1623, 1539, 1559, 1666)
Master: List(2985, 3037, 2957, 2994, 2959)

But the way, I set default value to 1024 base on above result. However, the best blocksize will depend on many factors like whether native-BLAS is used, numFetaures, sparsity, numInstances, etc.

SparkQA · 2020-01-29T09:21:38Z

Test build #117507 has finished for PR 27374 at commit f0e2e40.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-01-29T10:38:31Z

Param BlockSize was just added in #27360, and only used in LinearSVC and LR, so I can safely chang its default value.

SparkQA · 2020-01-29T11:49:49Z

Test build #117510 has finished for PR 27374 at commit a249fdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looking pretty OK to me

mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala

srowen · 2020-01-29T14:35:47Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala

  }

+  // Helper vectors and matrices for binary:
+  @transient private lazy val binaryLinear = {


OK up to your judgment. It'd be simpler to not even make them members, if it's not much difference to performance

srowen · 2020-01-29T14:36:20Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala

+    if (fitIntercept) {
+      val intercept = coefficientsArray.last
+      var i = 0
+      while (i < size) {


Would it be faster to fill an array with this value and then make a DenseVector? maybe I'm missing why not

OK, I will update it.

srowen · 2020-01-30T02:16:01Z

I'll merge soon to unblock #27389 , but if you have any final thoughts on the above soon, that would be good to check.

zhengruifeng · 2020-01-30T04:03:39Z

since LeastSquaresAggregator and HuberAggregator in LiR also mark the linear part coefficients/effectiveCoefficientsVector transient and lazy, so I follow this at the begining.
LiR may also benefit from using blocks instead of vectors, I am working on it.

SparkQA · 2020-01-30T05:17:22Z

Test build #117538 has finished for PR 27374 at commit c49b379.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-01-30T16:52:39Z

Merged to master

WeichenXu123 · 2020-02-05T10:04:02Z

@zhengruifeng Could you provide detail benchmark results separately for:

dense features (all features are dense)
sparse features (such as 50%, 10%, 1% sparsity)

Thanks!

mengxr · 2020-02-06T07:24:46Z

+1 on @WeichenXu123 's suggestion and I would suggest temporarily reverting this change before we have a good solution.

@zhengruifeng @srowen This new approach will introduce significant performance regression on sparse datasets with large number of features, e.g., https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#webspam (16,609,143 features). With block size 1024, it requires ~130GB RAM.

zhengruifeng · 2020-02-06T07:59:53Z

@mengxr @WeichenXu123
I am OK to revert this since it cause regression on high dimensional sparse datasets. Then we may also consider to do revert LinearRegression and LinearSVC, since they three are impled in the same way.

WeichenXu123 · 2020-02-06T08:21:11Z

@zhengruifeng Thanks! Also note your ongoing PR blockify GMM #27473 which do similar thing should also suspend for now.
We have found java-BLAS introduce significant JNI overhead and in some scenario introduce regression.

zhengruifeng · 2020-02-06T08:25:35Z

@WeichenXu123 Yes, I just mark GMM WIP.
I am going to study this issue, thanks for pointing it out. I am sorry for failing to make a comprehensive test.

zhengruifeng · 2020-02-06T13:10:46Z

@WeichenXu123 @mengxr @srowen
I just made a quick test on webspam:
I draw the first 10,000 sample from webspam_wc_normalized_trigram.svm, and the numFeatures=8,289,919 in the sampled dataset;
It's sparsity (percentage of non-zero values) is about 0.4489%.

This PR will fail dure to OOM in standardization, so I use a patch:

           val vec = features match {
              case dv: DenseVector =>
                var i = 0
                while (i < dv.size) {
                  val std = featuresStd(i)
                  if (std != 0) {
                    dv.values(i) /= std
                  } else {
                    dv.values(i) = 0.0
                  }
                  i += 1
                }
                dv
              case sv: SparseVector =>
                var j = 0
                while (j < sv.numActives) {
                  val i = sv.indices(j)
                  val std = featuresStd(i)
                  if (std != 0) {
                    sv.values(j) /= std
                  } else {
                    sv.values(j) = 0.0
                  }
                  j += 1
                }
                sv
            }

After that, I use following code to test performance:
spark-shell --driver-memory=32G --conf spark.driver.maxResultSize=4g

import org.apache.spark.ml.classification._
import org.apache.spark.storage.StorageLevel

val df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k").withColumn("label", (col("label")+1)/2)


val lr1 = new LogisticRegression().setMaxIter(100).setFamily("binomial").setBlockSize(128) // this PR
val start = System.currentTimeMillis; val model1 = lr1.fit(df); val end = System.currentTimeMillis; end - start


val lr2 = new LogisticRegression().setMaxIter(100).setFamily("binomial").setBlockSize(1024) // this PR
val start = System.currentTimeMillis; val model2 = lr2.fit(df); val end = System.currentTimeMillis; end - start


val lr = new LogisticRegression().setMaxIter(100).setFamily("binomial") // 2.4.4
val start = System.currentTimeMillis; val model = lr.fit(df); val end = System.currentTimeMillis; end - start

Result:

Impl	This PR(blockSize=128)	This PR(blockSize=1024)	2.4.4
summary.totalIterations	31	31	31
duration	298514	133982	108375
RAM	425	425	396

For this sparse dataset, this PR (with updated standardization) is about 23% slower, and use 7% more RAM.

So I aggre with you to revert this PR and relative PRs LinearSVC, LinearRegression.
Since ALS/MLP extend HasBlockSize depend on LinearSVC, so may it also need to be reverted for now @huaxingao

srowen · 2020-02-06T17:49:08Z

Ahhh, OK. I didn't think enough about whether sparse vectors would behave significantly differently. Of course, should have been checked. I agree. I'm happy to merge a revert PR or @zhengruifeng you can too.

huaxingao · 2020-02-06T18:18:05Z

I can revert my PR #27389. But before I revert, I want to check with you folks. My PR only has API changes:

add HasBlockSize to ALS so user can specify the blockSize for method blockify
make MLP extend HasBlockSize in sharedParams.scala instead of having its own param blockSize

It seems to me it's OK to keep these changes. Of course, there is no HasBlockSize in sharedParams.scala any more after reverting #27360. I guess I can put HasBlockSize in sharedParams.scala instead of reverting my PR?

mengxr · 2020-02-06T18:19:34Z

I think we can keep the API only change and figure out a way to let the implementation automatically decides whether to blockify+densify for performance.

zhengruifeng · 2020-02-07T03:16:57Z

OK, I will revert those PRs, and then @huaxingao can add back ALS/MLP extend HasBlockSize as a separate PR.

### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <[email protected]> Signed-off-by: zhengruifeng <[email protected]>

### What changes were proposed in this pull request? Revert apache#27360 apache#27396 apache#27374 apache#27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (apache#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes apache#27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <[email protected]> Signed-off-by: zhengruifeng <[email protected]>

init

f753462

init init

zhengruifeng added ML PYSPARK labels Jan 28, 2020

zhengruifeng commented Jan 28, 2020

View reviewed changes

srowen reviewed Jan 28, 2020

View reviewed changes

zhengruifeng added 2 commits January 29, 2020 10:30

use local var

3e0a8e7

nit

36245b6

update default blocksize=1024

f0e2e40

zhengruifeng added 2 commits January 29, 2020 18:31

del unused blockSize in agg

8a6015f

fix py

a249fdf

srowen reviewed Jan 29, 2020

View reviewed changes

srowen mentioned this pull request Jan 30, 2020

[SPARK-30662][ML][PySpark] ALS/MLP extend HasBlockSize #27389

Closed

use Array.fill

c49b379

srowen closed this in 073ce12 Jan 30, 2020

zhengruifeng deleted the blockify_lor branch January 31, 2020 04:21

zhengruifeng restored the blockify_lor branch February 6, 2020 08:02

This was referenced Feb 7, 2020

Revert "[SPARK-30642][SPARK-30659][SPARK-30660][SPARK-30662]" #27486

Closed

Revert "[SPARK-30642][SPARK-30659][SPARK-30660][SPARK-30662]" #27487

Closed

zhengruifeng deleted the blockify_lor branch April 27, 2020 02:10

zhengruifeng mentioned this pull request Apr 29, 2020

[SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM #28229

Closed

zhengruifeng mentioned this pull request May 7, 2020

[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors #28458

Closed

zero323 mentioned this pull request May 18, 2020

[SPARK-30659] LogisticRegression blockify input vectors zero323/pyspark-stubs#408

Closed

[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors #27374

[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors #27374

Uh oh!

Conversation

zhengruifeng commented Jan 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Jan 28, 2020

Uh oh!

zhengruifeng Jan 28, 2020

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Jan 28, 2020

Uh oh!

srowen Jan 28, 2020

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jan 28, 2020

Choose a reason for hiding this comment

Uh oh!

srowen Jan 29, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srowen commented Jan 28, 2020

Uh oh!

SparkQA commented Jan 28, 2020

Uh oh!

zhengruifeng commented Jan 28, 2020

Uh oh!

srowen commented Jan 28, 2020

Uh oh!

zhengruifeng commented Jan 29, 2020

Uh oh!

SparkQA commented Jan 29, 2020

Uh oh!

zhengruifeng commented Jan 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 29, 2020

Uh oh!

zhengruifeng commented Jan 29, 2020

Uh oh!

SparkQA commented Jan 29, 2020

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srowen Jan 29, 2020

Choose a reason for hiding this comment

Uh oh!

srowen Jan 29, 2020

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jan 30, 2020

Choose a reason for hiding this comment

Uh oh!

srowen commented Jan 30, 2020

Uh oh!

zhengruifeng commented Jan 30, 2020

Uh oh!

SparkQA commented Jan 30, 2020

Uh oh!

srowen commented Jan 30, 2020

Uh oh!

WeichenXu123 commented Feb 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mengxr commented Feb 6, 2020

Uh oh!

zhengruifeng commented Feb 6, 2020

Uh oh!

WeichenXu123 commented Feb 6, 2020

Uh oh!

zhengruifeng commented Feb 6, 2020

zhengruifeng commented Jan 28, 2020 •

edited

Loading

zhengruifeng commented Jan 29, 2020 •

edited

Loading

WeichenXu123 commented Feb 5, 2020 •

edited

Loading