-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30662][ML][PySpark] ALS/MLP extend HasBlockSize #27389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #117528 has finished for PR 27389 at commit
|
|
Test build #117530 has finished for PR 27389 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK to get in as a further unification. The code freeze is coming shortly, so will unblock this one ASAP
|
|
||
| /** @group expertGetParam */ | ||
| @Since("1.5.0") | ||
| final def getBlockSize: Int = $(blockSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't actually go away because it's in HasBlockSize ? just checking the API doesn't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, getBlockSize doesn't go away because it is in HasBlockSize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value of blocksize in MLP is 128, so explicitly setDefault(blockSize -> 128) in MLP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is set in MultilayerPerceptronParams at line 83
setDefault(maxIter -> 100, tol -> 1e-6, blockSize -> 128,
solver -> LBFGS, stepSize -> 0.03)
|
|
||
| /** | ||
| * Set block size for stacking input data in matrices. | ||
| * Default is 4096. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think this is about to be a default of 1024, after the very latest PR from @zhengruifeng is merged. I think it's probably good to go so will merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. I actually saw the default changed to 1024 in that PR, but I want the default to be 4096, that's why I set it explicitly in line 675 in the Estimator
setDefault(blockSize -> 4096).
I want the default to be 4096 because the blockify has 4096 as default. I don't want to change the current default value.
private def blockify(
factors: Dataset[(Int, Array[Float])],
blockSize: Int = 4096): Dataset[Seq[(Int, Array[Float])]] = {
import factors.sparkSession.implicits._
factors.mapPartitions(_.grouped(blockSize))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds fine.
|
LGTM |
| * TODO: SPARK-20443 - expose blockSize as a param? | ||
| */ | ||
| private def blockify( | ||
| factors: Dataset[(Int, Array[Float])], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need default blockSize in this method? i.e.,blockSize: Int = 4096.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. No need to have the default blockSize any more. I will update the code. Thanks!
| /** @group expertGetParam */ | ||
| def getColdStartStrategy: String = $(coldStartStrategy).toLowerCase(Locale.ROOT) | ||
|
|
||
| setDefault(blockSize -> 4096) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that I should set Default of blockSize in ALSModelParams, so this will apply to both ALS and ALSModel.
|
Test build #117545 has finished for PR 27389 at commit
|
|
@huaxingao let me know when ready and I'll merge pending tests. I know the code freeze is tomorrow and looks good to get in before that. |
|
@srowen |
|
Merged to master |
|
Thank you all! |
### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <[email protected]> Signed-off-by: zhengruifeng <[email protected]>
### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <[email protected]> Signed-off-by: zhengruifeng <[email protected]>
### What changes were proposed in this pull request? Revert apache#27360 apache#27396 apache#27374 apache#27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (apache#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes apache#27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <[email protected]> Signed-off-by: zhengruifeng <[email protected]>
What changes were proposed in this pull request?
Make ALS/MLP extend
HasBlockSizeWhy are the changes needed?
Currently, MLP has its own
blockSizeparam, we should make MLP extendHasBlockSizesinceHasBlockSizewas added insharedParams.scalarecently.ALS doesn't have
blockSizeparam now, we can make it extendHasBlockSize, so user can specify theblockSize.Does this PR introduce any user-facing change?
Yes
ALS.setBlockSizeandALS.getBlockSizeALSModel.setBlockSizeandALSModel.getBlockSizeHow was this patch tested?
Manually tested. Also added doctest.