Skip to content

Conversation

@srowen
Copy link
Member

@srowen srowen commented Aug 16, 2016

What changes were proposed in this pull request?

Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with VectorAssembler in common usages.

How was this patch tested?

Jenkins tests, including new caes to reflect the new behavior.

@SparkQA
Copy link

SparkQA commented Aug 16, 2016

Test build #63840 has finished for PR 14663 at commit 496a8df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Aug 19, 2016

Another one where I'd welcome comments from ... @holdenk @MLnick @davies et al

@srowen
Copy link
Member Author

srowen commented Aug 22, 2016

going once, going twice. This would simply let an operation proceed where it errored before, at the cost of giving a user a little more rope to hang him/herself. I think it unblocks a legitimate and common set of use cases, so I think it's worth changing.

@MLnick
Copy link
Contributor

MLnick commented Aug 22, 2016

As mentioned on the JIRA discussion, I'm neutral on this, though I tend to lean towards allowing the user to do what they want even if it might be "dangerous". I guess +0?

Though perhaps we may want to explicitly log a WARN message if sparse input is encountered and withMean=true, just so there is a very clear root cause message in the logs if things blow up with OOM (it's in the user guide and doc string, but still may be missed by many users).

@srowen
Copy link
Member Author

srowen commented Aug 22, 2016

Warning seems reasonable. I think you'd have to put in a flag to remember if the user has been warned in order to avoid spewing millions of them. Worth it, you think?

@MLnick
Copy link
Contributor

MLnick commented Aug 22, 2016

Ah right, good point. Actually I realised that the doc in ml.feature.StandardScaler needs updating for withMean:

  /**
   * Whether to center the data with mean before scaling.
   * It will build a dense output, so this does not work on sparse input
   * and will raise an exception.
   * Default: false
   * @group param
   */
  val withMean: BooleanParam = new BooleanParam(this, "withMean",
    "Whether to center data with mean")

... as good a place as any to add a warning about using with sparse input (this can then show up in explainParam doc as well as ScalaDoc. Python side will probably need it too.

@srowen
Copy link
Member Author

srowen commented Aug 23, 2016

If I understood you correctly @MLnick you favored just adding warnings in the doc? I added to three more places that needed it.

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64288 has finished for PR 14663 at commit e263084.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Aug 25, 2016

I'll go for this tomorrow if there are no other comments.

@srowen
Copy link
Member Author

srowen commented Aug 27, 2016

Merged to master

@srowen srowen closed this Aug 27, 2016
@srowen srowen deleted the SPARK-17001 branch August 27, 2016 07:49
srowen added a commit to srowen/spark that referenced this pull request Aug 27, 2016
… when withMean=True

## What changes were proposed in this pull request?

Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.

## How was this patch tested?

Jenkins tests, including new caes to reflect the new behavior.

Author: Sean Owen <[email protected]>

Closes apache#14663 from srowen/SPARK-17001.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants