[SPARK-30662][ML][PySpark] ALS/MLP extend HasBlockSize #27389

huaxingao · 2020-01-30T00:08:50Z

What changes were proposed in this pull request?

Make ALS/MLP extend HasBlockSize

Why are the changes needed?

Currently, MLP has its own blockSize param, we should make MLP extend HasBlockSize since HasBlockSize was added in sharedParams.scala recently.

ALS doesn't have blockSize param now, we can make it extend HasBlockSize, so user can specify the blockSize.

Does this PR introduce any user-facing change?

Yes
ALS.setBlockSize and ALS.getBlockSize
ALSModel.setBlockSize and ALSModel.getBlockSize

How was this patch tested?

Manually tested. Also added doctest.

SparkQA · 2020-01-30T01:24:48Z

Test build #117528 has finished for PR 27389 at commit 1e0e4ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-30T01:53:52Z

Test build #117530 has finished for PR 27389 at commit abe6fe3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I think this is OK to get in as a further unification. The code freeze is coming shortly, so will unblock this one ASAP

srowen · 2020-01-30T02:14:06Z

mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala

-
-  /** @group expertGetParam */
-  @Since("1.5.0")
-  final def getBlockSize: Int = $(blockSize)


This doesn't actually go away because it's in HasBlockSize ? just checking the API doesn't change.

Yes, getBlockSize doesn't go away because it is in HasBlockSize.

The default value of blocksize in MLP is 128, so explicitly setDefault(blockSize -> 128) in MLP?

It is set in MultilayerPerceptronParams at line 83

setDefault(maxIter -> 100, tol -> 1e-6, blockSize -> 128, solver -> LBFGS, stepSize -> 0.03)

srowen · 2020-01-30T02:14:36Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala


+  /**
+   * Set block size for stacking input data in matrices.
+   * Default is 4096.


Oh, I think this is about to be a default of 1024, after the very latest PR from @zhengruifeng is merged. I think it's probably good to go so will merge it.

Thanks for the comment. I actually saw the default changed to 1024 in that PR, but I want the default to be 4096, that's why I set it explicitly in line 675 in the Estimator
setDefault(blockSize -> 4096).

I want the default to be 4096 because the blockify has 4096 as default. I don't want to change the current default value.

private def blockify( factors: Dataset[(Int, Array[Float])], blockSize: Int = 4096): Dataset[Seq[(Int, Array[Float])]] = { import factors.sparkSession.implicits._ factors.mapPartitions(_.grouped(blockSize)) }

OK, sounds fine.

zhengruifeng · 2020-01-30T04:31:01Z

LGTM

viirya · 2020-01-30T05:33:57Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

-   * TODO: SPARK-20443 - expose blockSize as a param?
   */
  private def blockify(
      factors: Dataset[(Int, Array[Float])],


Do we still need default blockSize in this method? i.e.,blockSize: Int = 4096.

You are right. No need to have the default blockSize any more. I will update the code. Thanks!

huaxingao · 2020-01-30T06:14:31Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

  /** @group expertGetParam */
  def getColdStartStrategy: String = $(coldStartStrategy).toLowerCase(Locale.ROOT)
+
+  setDefault(blockSize -> 4096)


I just realized that I should set Default of blockSize in ALSModelParams, so this will apply to both ALS and ALSModel.

SparkQA · 2020-01-30T07:31:02Z

Test build #117545 has finished for PR 27389 at commit d5711e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-01-30T17:26:40Z

@huaxingao let me know when ready and I'll merge pending tests. I know the code freeze is tomorrow and looks good to get in before that.

huaxingao · 2020-01-30T17:35:47Z

@srowen
Hi Sean, it's ready. Thanks!

srowen · 2020-01-30T19:13:23Z

Merged to master

huaxingao · 2020-01-30T19:30:23Z

Thank you all!

### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <[email protected]> Signed-off-by: zhengruifeng <[email protected]>

### What changes were proposed in this pull request? Revert apache#27360 apache#27396 apache#27374 apache#27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (apache#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes apache#27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <[email protected]> Signed-off-by: zhengruifeng <[email protected]>

huaxingao added 3 commits January 29, 2020 13:26

[SPARK-30662][ML][PySpark] ALS/MLP extend HasBlockSize

3b375af

fix doctest

1e0e4ca

set default blockSize to 4096 for ALS

abe6fe3

srowen requested changes Jan 30, 2020

View reviewed changes

srowen mentioned this pull request Jan 30, 2020

[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors #27374

Closed

viirya approved these changes Jan 30, 2020

View reviewed changes

address comment

d5711e5

huaxingao commented Jan 30, 2020

View reviewed changes

srowen closed this in f59685a Jan 30, 2020

huaxingao deleted the spark-30662 branch January 30, 2020 19:30

zero323 mentioned this pull request Jan 30, 2020

Sync with changes merged after 6502c66025718bf45e0e2ee12398b7b92da41a0c zero323/pyspark-stubs#315

Closed

14 tasks

This was referenced Feb 7, 2020

Revert "[SPARK-30642][SPARK-30659][SPARK-30660][SPARK-30662]" #27486

Closed

Revert "[SPARK-30642][SPARK-30659][SPARK-30660][SPARK-30662]" #27487

Closed

[SPARK-30662][ML][PySpark] ALS/MLP extend HasBlockSize #27389

[SPARK-30662][ML][PySpark] ALS/MLP extend HasBlockSize #27389

Uh oh!

Conversation

huaxingao commented Jan 30, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jan 30, 2020

Uh oh!

SparkQA commented Jan 30, 2020

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Jan 30, 2020

Uh oh!

viirya Jan 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 30, 2020

Uh oh!

srowen commented Jan 30, 2020

Uh oh!

huaxingao commented Jan 30, 2020

Uh oh!

srowen commented Jan 30, 2020

Uh oh!

huaxingao commented Jan 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya Jan 30, 2020 •

edited

Loading