-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors #27374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
env: bin/spark-shell --driver-memory=32G testCode: import org.apache.spark.ml.classification._
import org.apache.spark.storage.StorageLevel
var df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a").withColumn("label", (col("label")+1)/2)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count
(0 until 8).foreach{ _ => df = df.union(df) }
df.count
new LogisticRegression().setMaxIter(10).fit(df)
val lr1 = new LogisticRegression().setMaxIter(100).setFamily("binomial")
val start = System.currentTimeMillis; val model1 = lr1.fit(df); val end = System.currentTimeMillis; end - start
val lr2 = new LogisticRegression().setMaxIter(100).setFitIntercept(false).setFamily("binomial")
val start = System.currentTimeMillis; val model2 = lr2.fit(df); val end = System.currentTimeMillis; end - start
val lr3 = new LogisticRegression().setMaxIter(100).setFamily("multinomial")
val start = System.currentTimeMillis; val model3 = lr3.fit(df); val end = System.currentTimeMillis; end - start
val lr4 = new LogisticRegression().setMaxIter(100).setFitIntercept(false).setFamily("multinomial")
val start = System.currentTimeMillis; val model4 = lr4.fit(df); val end = System.currentTimeMillis; end - startresult: this PR: Master: |
| // If fitIntercept==false, gradientSumArray += mat.T X matrix | ||
| // GEMM requires block.matrix is dense | ||
| val gradSumMat = new DenseMatrix(numClasses, numFeatures, localGradientSumArray) | ||
| BLAS.gemm(1.0, mat.transpose, dm, 1.0, gradSumMat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since gradientSumArray is for Matrix of shape CXFPI, and BLAS.gemm requires the output matrix is not transposed. So only if F(numFeature) == FPI(numFeaturesPlusIntercept) and input block is dense, can I use BLAS.gemm to directly update gradientSumArray.
Otherwise, I need to output the result to a temp matrix multinomialLinearGradSumMat, and then add elements to gradientSumArray
|
ping @srowen |
| } | ||
|
|
||
| // Helper vectors and matrices for binary: | ||
| @transient private lazy val binaryLinear = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, are these lazy just to deal with recreating them after deserialization? they don't seem big, so can they just be non-transient, non-lazy? unless it's a material problem, might be simpler and faster.
Or how much do you need to hold on to scratch vectors like auxiliaryVec vs just locals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binaryLinear, binaryIntercept, multinomialLinear, multinomialIntercept are the linear and bias part of coefficients, repectively.
binaryLinearGradSumVec (numFeatures) and multinomialLinearGradSumMat (numClassXnumFeatures) are used to store result of gemv/gemm if fitIntercept==True, since gradientSumArray contains gradient sums of intercepts and can not be used directly in gemv/gemm.
auxiliaryVec (blockSize) and multinomialAuxiliaryMat (blockSizeXnumClasses) are used to store the intermediate multiplication(margins) and multipliers.
they can be used among blocks, and if they are used multi-times in one call we can assign them to local variables.
However I am OK to make them local variables, since I guess they are not the bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK up to your judgment. It'd be simpler to not even make them members, if it's not much difference to performance
mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala
Show resolved
Hide resolved
|
Also, does this cause any appreciable slowdown at smaller scale? it's not a big deal if something that's fast is a little slower, to make things that are slow much faster, but just want to get a sense of what you know about the scale implications. |
|
Test build #117483 has finished for PR 27374 at commit
|
|
@srowen I had made other performance tests, it seems that the performance is related to Thanks for reviewing! |
|
Yeah that's the question... Level 1 BLAS often isn't a win. L2/L3 yes. It's probably a win, but just dont' want to take a perf hit on most use cases to help large ones. Even that's arguable. |
|
Ok, I will test on small datasets. |
|
Test build #117501 has finished for PR 27374 at commit
|
|
@srowen I found that on small datasets, the speed up is even more significant. data: a9a, numFeatures=123, numInstances=32,561 testCode: import org.apache.spark.ml.classification._
import org.apache.spark.storage.StorageLevel
val df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a").withColumn("label", (col("label")+1)/2)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count
val lr4 = new LogisticRegression().setMaxIter(100).setFitIntercept(false).setFamily("multinomial")
val start = System.currentTimeMillis; val model4 = lr4.fit(df); val end = System.currentTimeMillis; end - start
Seq(64, 256, 1024, 4096, 8192).map { b => val start = System.currentTimeMillis; val model1 = new LogisticRegression().setBlockSize(b).fit(df); val end = System.currentTimeMillis; end - start } // this PR
Seq(64, 256, 1024, 4096, 8192).map { b => val start = System.currentTimeMillis; val model1 = new LogisticRegression().fit(df); val end = System.currentTimeMillis; end - start } // Master
result: about 77%~92% faster. I think that is beacuse on big dataset, the communiation overhead has a bigger impact on the whole procedure; while on small datasets like a9a, high-level BLAS dominates the performance. But the way, I set default value to 1024 base on above result. However, the best blocksize will depend on many factors like whether native-BLAS is used, numFetaures, sparsity, numInstances, etc. |
|
Test build #117507 has finished for PR 27374 at commit
|
|
Param |
|
Test build #117510 has finished for PR 27374 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty OK to me
| } | ||
|
|
||
| // Helper vectors and matrices for binary: | ||
| @transient private lazy val binaryLinear = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK up to your judgment. It'd be simpler to not even make them members, if it's not much difference to performance
| if (fitIntercept) { | ||
| val intercept = coefficientsArray.last | ||
| var i = 0 | ||
| while (i < size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be faster to fill an array with this value and then make a DenseVector? maybe I'm missing why not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will update it.
|
I'll merge soon to unblock #27389 , but if you have any final thoughts on the above soon, that would be good to check. |
|
since |
|
Test build #117538 has finished for PR 27374 at commit
|
|
Merged to master |
|
@zhengruifeng Could you provide detail benchmark results separately for:
Thanks! |
|
+1 on @WeichenXu123 's suggestion and I would suggest temporarily reverting this change before we have a good solution. @zhengruifeng @srowen This new approach will introduce significant performance regression on sparse datasets with large number of features, e.g., https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#webspam (16,609,143 features). With block size 1024, it requires ~130GB RAM. |
|
@mengxr @WeichenXu123 |
|
@zhengruifeng Thanks! Also note your ongoing PR blockify GMM #27473 which do similar thing should also suspend for now. |
|
@WeichenXu123 Yes, I just mark GMM WIP. |
|
@WeichenXu123 @mengxr @srowen This PR will fail dure to OOM in standardization, so I use a patch: val vec = features match {
case dv: DenseVector =>
var i = 0
while (i < dv.size) {
val std = featuresStd(i)
if (std != 0) {
dv.values(i) /= std
} else {
dv.values(i) = 0.0
}
i += 1
}
dv
case sv: SparseVector =>
var j = 0
while (j < sv.numActives) {
val i = sv.indices(j)
val std = featuresStd(i)
if (std != 0) {
sv.values(j) /= std
} else {
sv.values(j) = 0.0
}
j += 1
}
sv
}
After that, I use following code to test performance: import org.apache.spark.ml.classification._
import org.apache.spark.storage.StorageLevel
val df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k").withColumn("label", (col("label")+1)/2)
val lr1 = new LogisticRegression().setMaxIter(100).setFamily("binomial").setBlockSize(128) // this PR
val start = System.currentTimeMillis; val model1 = lr1.fit(df); val end = System.currentTimeMillis; end - start
val lr2 = new LogisticRegression().setMaxIter(100).setFamily("binomial").setBlockSize(1024) // this PR
val start = System.currentTimeMillis; val model2 = lr2.fit(df); val end = System.currentTimeMillis; end - start
val lr = new LogisticRegression().setMaxIter(100).setFamily("binomial") // 2.4.4
val start = System.currentTimeMillis; val model = lr.fit(df); val end = System.currentTimeMillis; end - start
Result:
For this sparse dataset, this PR (with updated standardization) is about 23% slower, and use 7% more RAM. So I aggre with you to revert this PR and relative PRs LinearSVC, LinearRegression. |
|
Ahhh, OK. I didn't think enough about whether sparse vectors would behave significantly differently. Of course, should have been checked. I agree. I'm happy to merge a revert PR or @zhengruifeng you can too. |
|
I can revert my PR #27389. But before I revert, I want to check with you folks. My PR only has API changes:
It seems to me it's OK to keep these changes. Of course, there is no |
|
I think we can keep the API only change and figure out a way to let the implementation automatically decides whether to blockify+densify for performance. |
|
OK, I will revert those PRs, and then @huaxingao can add back ALS/MLP extend HasBlockSize as a separate PR. |
### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <[email protected]> Signed-off-by: zhengruifeng <[email protected]>
### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <[email protected]> Signed-off-by: zhengruifeng <[email protected]>
### What changes were proposed in this pull request? Revert apache#27360 apache#27396 apache#27374 apache#27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (apache#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes apache#27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <[email protected]> Signed-off-by: zhengruifeng <[email protected]>
What changes were proposed in this pull request?
1, use blocks instead of vectors
2, use Level-2 BLAS for binary, use Level-3 BLAS for multinomial
Why are the changes needed?
1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (40% ~ 92%)
Does this PR introduce any user-facing change?
add a new expert param
blockSizeHow was this patch tested?
updated testsuites