Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jan 28, 2020

What changes were proposed in this pull request?

1, use blocks instead of vectors
2, use Level-2 BLAS for binary, use Level-3 BLAS for multinomial

Why are the changes needed?

1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (40% ~ 92%)

Does this PR introduce any user-facing change?

add a new expert param blockSize

How was this patch tested?

updated testsuites

init

init
@zhengruifeng
Copy link
Contributor Author

env: bin/spark-shell --driver-memory=32G

testCode:

import org.apache.spark.ml.classification._
import org.apache.spark.storage.StorageLevel


var df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a").withColumn("label", (col("label")+1)/2)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count

(0 until 8).foreach{ _ => df = df.union(df) }
df.count

new LogisticRegression().setMaxIter(10).fit(df)

val lr1 = new LogisticRegression().setMaxIter(100).setFamily("binomial")
val start = System.currentTimeMillis; val model1 = lr1.fit(df); val end = System.currentTimeMillis; end - start


val lr2 = new LogisticRegression().setMaxIter(100).setFitIntercept(false).setFamily("binomial")
val start = System.currentTimeMillis; val model2 = lr2.fit(df); val end = System.currentTimeMillis; end - start


val lr3 = new LogisticRegression().setMaxIter(100).setFamily("multinomial")
val start = System.currentTimeMillis; val model3 = lr3.fit(df); val end = System.currentTimeMillis; end - start


val lr4 = new LogisticRegression().setMaxIter(100).setFitIntercept(false).setFamily("multinomial")
val start = System.currentTimeMillis; val model4 = lr4.fit(df); val end = System.currentTimeMillis; end - start

result:

this PR:
RAM: 1418.9M
DURATION: 136217, 161194, 171625, 177116

Master:
RAM: 2.3G
DURATION: 217035, 218267, 239111, 250163

// If fitIntercept==false, gradientSumArray += mat.T X matrix
// GEMM requires block.matrix is dense
val gradSumMat = new DenseMatrix(numClasses, numFeatures, localGradientSumArray)
BLAS.gemm(1.0, mat.transpose, dm, 1.0, gradSumMat)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since gradientSumArray is for Matrix of shape CXFPI, and BLAS.gemm requires the output matrix is not transposed. So only if F(numFeature) == FPI(numFeaturesPlusIntercept) and input block is dense, can I use BLAS.gemm to directly update gradientSumArray.
Otherwise, I need to output the result to a temp matrix multinomialLinearGradSumMat, and then add elements to gradientSumArray

@zhengruifeng
Copy link
Contributor Author

ping @srowen

}

// Helper vectors and matrices for binary:
@transient private lazy val binaryLinear = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, are these lazy just to deal with recreating them after deserialization? they don't seem big, so can they just be non-transient, non-lazy? unless it's a material problem, might be simpler and faster.
Or how much do you need to hold on to scratch vectors like auxiliaryVec vs just locals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binaryLinear, binaryIntercept, multinomialLinear, multinomialIntercept are the linear and bias part of coefficients, repectively.

binaryLinearGradSumVec (numFeatures) and multinomialLinearGradSumMat (numClassXnumFeatures) are used to store result of gemv/gemm if fitIntercept==True, since gradientSumArray contains gradient sums of intercepts and can not be used directly in gemv/gemm.

auxiliaryVec (blockSize) and multinomialAuxiliaryMat (blockSizeXnumClasses) are used to store the intermediate multiplication(margins) and multipliers.

they can be used among blocks, and if they are used multi-times in one call we can assign them to local variables.
However I am OK to make them local variables, since I guess they are not the bottleneck.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK up to your judgment. It'd be simpler to not even make them members, if it's not much difference to performance

@srowen
Copy link
Member

srowen commented Jan 28, 2020

Also, does this cause any appreciable slowdown at smaller scale? it's not a big deal if something that's fast is a little slower, to make things that are slow much faster, but just want to get a sense of what you know about the scale implications.

@SparkQA
Copy link

SparkQA commented Jan 28, 2020

Test build #117483 has finished for PR 27374 at commit f753462.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@srowen
The orignial dataset a9a is not big, its numFeatures=123, numInstances=32,561, after upsampling its numInstances=32,561X256=8,335,616.

I had made other performance tests, it seems that the performance is related to numFeatures and blockSize, and I guess the performance is highly related to: given a array of vectors, to what degree can Level2/3-BLAS be faster than existing java impl or Level-1.

Thanks for reviewing!

@srowen
Copy link
Member

srowen commented Jan 28, 2020

Yeah that's the question... Level 1 BLAS often isn't a win. L2/L3 yes. It's probably a win, but just dont' want to take a perf hit on most use cases to help large ones. Even that's arguable.

@zhengruifeng
Copy link
Contributor Author

Ok, I will test on small datasets.

@SparkQA
Copy link

SparkQA commented Jan 29, 2020

Test build #117501 has finished for PR 27374 at commit 36245b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Jan 29, 2020

@srowen I found that on small datasets, the speed up is even more significant.

data: a9a, numFeatures=123, numInstances=32,561

testCode:

import org.apache.spark.ml.classification._
import org.apache.spark.storage.StorageLevel


val df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a").withColumn("label", (col("label")+1)/2)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count

val lr4 = new LogisticRegression().setMaxIter(100).setFitIntercept(false).setFamily("multinomial")
val start = System.currentTimeMillis; val model4 = lr4.fit(df); val end = System.currentTimeMillis; end - start

Seq(64, 256, 1024, 4096, 8192).map { b => val start = System.currentTimeMillis; val model1 = new LogisticRegression().setBlockSize(b).fit(df); val end = System.currentTimeMillis; end - start } // this PR

 Seq(64, 256, 1024, 4096, 8192).map { b => val start = System.currentTimeMillis; val model1 = new LogisticRegression().fit(df); val end = System.currentTimeMillis; end - start } // Master

result: about 77%~92% faster. I think that is beacuse on big dataset, the communiation overhead has a bigger impact on the whole procedure; while on small datasets like a9a, high-level BLAS dominates the performance.
This PR: List(1630, 1623, 1539, 1559, 1666)
Master: List(2985, 3037, 2957, 2994, 2959)

But the way, I set default value to 1024 base on above result. However, the best blocksize will depend on many factors like whether native-BLAS is used, numFetaures, sparsity, numInstances, etc.

@SparkQA
Copy link

SparkQA commented Jan 29, 2020

Test build #117507 has finished for PR 27374 at commit f0e2e40.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

Param BlockSize was just added in #27360, and only used in LinearSVC and LR, so I can safely chang its default value.

@SparkQA
Copy link

SparkQA commented Jan 29, 2020

Test build #117510 has finished for PR 27374 at commit a249fdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty OK to me

}

// Helper vectors and matrices for binary:
@transient private lazy val binaryLinear = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK up to your judgment. It'd be simpler to not even make them members, if it's not much difference to performance

if (fitIntercept) {
val intercept = coefficientsArray.last
var i = 0
while (i < size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be faster to fill an array with this value and then make a DenseVector? maybe I'm missing why not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will update it.

@srowen
Copy link
Member

srowen commented Jan 30, 2020

I'll merge soon to unblock #27389 , but if you have any final thoughts on the above soon, that would be good to check.

@zhengruifeng
Copy link
Contributor Author

since LeastSquaresAggregator and HuberAggregator in LiR also mark the linear part coefficients/effectiveCoefficientsVector transient and lazy, so I follow this at the begining.
LiR may also benefit from using blocks instead of vectors, I am working on it.

@SparkQA
Copy link

SparkQA commented Jan 30, 2020

Test build #117538 has finished for PR 27374 at commit c49b379.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen closed this in 073ce12 Jan 30, 2020
@srowen
Copy link
Member

srowen commented Jan 30, 2020

Merged to master

@zhengruifeng zhengruifeng deleted the blockify_lor branch January 31, 2020 04:21
@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Feb 5, 2020

@zhengruifeng Could you provide detail benchmark results separately for:

  • dense features (all features are dense)
  • sparse features (such as 50%, 10%, 1% sparsity)

Thanks!

@mengxr
Copy link
Contributor

mengxr commented Feb 6, 2020

+1 on @WeichenXu123 's suggestion and I would suggest temporarily reverting this change before we have a good solution.

@zhengruifeng @srowen This new approach will introduce significant performance regression on sparse datasets with large number of features, e.g., https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#webspam (16,609,143 features). With block size 1024, it requires ~130GB RAM.

@zhengruifeng
Copy link
Contributor Author

@mengxr @WeichenXu123
I am OK to revert this since it cause regression on high dimensional sparse datasets. Then we may also consider to do revert LinearRegression and LinearSVC, since they three are impled in the same way.

@zhengruifeng zhengruifeng restored the blockify_lor branch February 6, 2020 08:02
@WeichenXu123
Copy link
Contributor

@zhengruifeng Thanks! Also note your ongoing PR blockify GMM #27473 which do similar thing should also suspend for now.
We have found java-BLAS introduce significant JNI overhead and in some scenario introduce regression.

@zhengruifeng
Copy link
Contributor Author

@WeichenXu123 Yes, I just mark GMM WIP.
I am going to study this issue, thanks for pointing it out. I am sorry for failing to make a comprehensive test.

@zhengruifeng
Copy link
Contributor Author

@WeichenXu123 @mengxr @srowen
I just made a quick test on webspam:
I draw the first 10,000 sample from webspam_wc_normalized_trigram.svm, and the numFeatures=8,289,919 in the sampled dataset;
It's sparsity (percentage of non-zero values) is about 0.4489%.

This PR will fail dure to OOM in standardization, so I use a patch:

           val vec = features match {
              case dv: DenseVector =>
                var i = 0
                while (i < dv.size) {
                  val std = featuresStd(i)
                  if (std != 0) {
                    dv.values(i) /= std
                  } else {
                    dv.values(i) = 0.0
                  }
                  i += 1
                }
                dv
              case sv: SparseVector =>
                var j = 0
                while (j < sv.numActives) {
                  val i = sv.indices(j)
                  val std = featuresStd(i)
                  if (std != 0) {
                    sv.values(j) /= std
                  } else {
                    sv.values(j) = 0.0
                  }
                  j += 1
                }
                sv
            }

After that, I use following code to test performance:
spark-shell --driver-memory=32G --conf spark.driver.maxResultSize=4g

import org.apache.spark.ml.classification._
import org.apache.spark.storage.StorageLevel

val df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k").withColumn("label", (col("label")+1)/2)


val lr1 = new LogisticRegression().setMaxIter(100).setFamily("binomial").setBlockSize(128) // this PR
val start = System.currentTimeMillis; val model1 = lr1.fit(df); val end = System.currentTimeMillis; end - start


val lr2 = new LogisticRegression().setMaxIter(100).setFamily("binomial").setBlockSize(1024) // this PR
val start = System.currentTimeMillis; val model2 = lr2.fit(df); val end = System.currentTimeMillis; end - start


val lr = new LogisticRegression().setMaxIter(100).setFamily("binomial") // 2.4.4
val start = System.currentTimeMillis; val model = lr.fit(df); val end = System.currentTimeMillis; end - start

Result:

Impl This PR(blockSize=128) This PR(blockSize=1024) 2.4.4
summary.totalIterations 31 31 31
duration 298514 133982 108375
RAM 425 425 396

For this sparse dataset, this PR (with updated standardization) is about 23% slower, and use 7% more RAM.

So I aggre with you to revert this PR and relative PRs LinearSVC, LinearRegression.
Since ALS/MLP extend HasBlockSize depend on LinearSVC, so may it also need to be reverted for now @huaxingao

@srowen
Copy link
Member

srowen commented Feb 6, 2020

Ahhh, OK. I didn't think enough about whether sparse vectors would behave significantly differently. Of course, should have been checked. I agree. I'm happy to merge a revert PR or @zhengruifeng you can too.

@huaxingao
Copy link
Contributor

I can revert my PR #27389. But before I revert, I want to check with you folks. My PR only has API changes:

  1. add HasBlockSize to ALS so user can specify the blockSize for method blockify
  2. make MLP extend HasBlockSize in sharedParams.scala instead of having its own param blockSize

It seems to me it's OK to keep these changes. Of course, there is no HasBlockSize in sharedParams.scala any more after reverting #27360. I guess I can put HasBlockSize in sharedParams.scala instead of reverting my PR?

@mengxr
Copy link
Contributor

mengxr commented Feb 6, 2020

I think we can keep the API only change and figure out a way to let the implementation automatically decides whether to blockify+densify for performance.

@zhengruifeng
Copy link
Contributor Author

OK, I will revert those PRs, and then @huaxingao can add back ALS/MLP extend HasBlockSize as a separate PR.

zhengruifeng added a commit that referenced this pull request Feb 8, 2020
### What changes were proposed in this pull request?
Revert
#27360
#27396
#27374
#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes #27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <[email protected]>
Signed-off-by: zhengruifeng <[email protected]>
zhengruifeng added a commit that referenced this pull request Feb 25, 2020
### What changes were proposed in this pull request?
Revert
#27360
#27396
#27374
#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes #27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <[email protected]>
Signed-off-by: zhengruifeng <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
Revert
apache#27360
apache#27396
apache#27374
apache#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (apache#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes apache#27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <[email protected]>
Signed-off-by: zhengruifeng <[email protected]>
@zhengruifeng zhengruifeng deleted the blockify_lor branch April 27, 2020 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants