Skip to content

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Make ALS/MLP extend HasBlockSize

Why are the changes needed?

Currently, MLP has its own blockSize param, we should make MLP extend HasBlockSize since HasBlockSize was added in sharedParams.scala recently.

ALS doesn't have blockSize param now, we can make it extend HasBlockSize, so user can specify the blockSize.

Does this PR introduce any user-facing change?

Yes
ALS.setBlockSize and ALS.getBlockSize
ALSModel.setBlockSize and ALSModel.getBlockSize

How was this patch tested?

Manually tested. Also added doctest.

@SparkQA
Copy link

SparkQA commented Jan 30, 2020

Test build #117528 has finished for PR 27389 at commit 1e0e4ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2020

Test build #117530 has finished for PR 27389 at commit abe6fe3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK to get in as a further unification. The code freeze is coming shortly, so will unblock this one ASAP


/** @group expertGetParam */
@Since("1.5.0")
final def getBlockSize: Int = $(blockSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually go away because it's in HasBlockSize ? just checking the API doesn't change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, getBlockSize doesn't go away because it is in HasBlockSize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value of blocksize in MLP is 128, so explicitly setDefault(blockSize -> 128) in MLP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is set in MultilayerPerceptronParams at line 83

  setDefault(maxIter -> 100, tol -> 1e-6, blockSize -> 128,
    solver -> LBFGS, stepSize -> 0.03)


/**
* Set block size for stacking input data in matrices.
* Default is 4096.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think this is about to be a default of 1024, after the very latest PR from @zhengruifeng is merged. I think it's probably good to go so will merge it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I actually saw the default changed to 1024 in that PR, but I want the default to be 4096, that's why I set it explicitly in line 675 in the Estimator
setDefault(blockSize -> 4096).

I want the default to be 4096 because the blockify has 4096 as default. I don't want to change the current default value.

  private def blockify(
      factors: Dataset[(Int, Array[Float])],
      blockSize: Int = 4096): Dataset[Seq[(Int, Array[Float])]] = {
    import factors.sparkSession.implicits._
    factors.mapPartitions(_.grouped(blockSize))
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds fine.

@zhengruifeng
Copy link
Contributor

LGTM

* TODO: SPARK-20443 - expose blockSize as a param?
*/
private def blockify(
factors: Dataset[(Int, Array[Float])],
Copy link
Member

@viirya viirya Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need default blockSize in this method? i.e.,blockSize: Int = 4096.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. No need to have the default blockSize any more. I will update the code. Thanks!

/** @group expertGetParam */
def getColdStartStrategy: String = $(coldStartStrategy).toLowerCase(Locale.ROOT)

setDefault(blockSize -> 4096)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that I should set Default of blockSize in ALSModelParams, so this will apply to both ALS and ALSModel.

@SparkQA
Copy link

SparkQA commented Jan 30, 2020

Test build #117545 has finished for PR 27389 at commit d5711e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 30, 2020

@huaxingao let me know when ready and I'll merge pending tests. I know the code freeze is tomorrow and looks good to get in before that.

@huaxingao
Copy link
Contributor Author

@srowen
Hi Sean, it's ready. Thanks!

@srowen srowen closed this in f59685a Jan 30, 2020
@srowen
Copy link
Member

srowen commented Jan 30, 2020

Merged to master

@huaxingao
Copy link
Contributor Author

Thank you all!

@huaxingao huaxingao deleted the spark-30662 branch January 30, 2020 19:30
zhengruifeng added a commit that referenced this pull request Feb 8, 2020
### What changes were proposed in this pull request?
Revert
#27360
#27396
#27374
#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes #27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <[email protected]>
Signed-off-by: zhengruifeng <[email protected]>
zhengruifeng added a commit that referenced this pull request Feb 25, 2020
### What changes were proposed in this pull request?
Revert
#27360
#27396
#27374
#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes #27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <[email protected]>
Signed-off-by: zhengruifeng <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
Revert
apache#27360
apache#27396
apache#27374
apache#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (apache#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes apache#27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <[email protected]>
Signed-off-by: zhengruifeng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants