[SPARK-29566][ML] Imputer should support single-column input/output #26247

huaxingao · 2019-10-25T01:11:16Z

What changes were proposed in this pull request?

add single-column input/output support in Imputer

Why are the changes needed?

Currently, Imputer only has multi-column support. This PR adds single-column input/output support.

Does this PR introduce any user-facing change?

Yes. add single-column input/output support in Imputer
Imputer.setInputCol
Imputer.setOutputCol

How was this patch tested?

add unit tests

huaxingao · 2019-10-25T01:12:25Z

python/pyspark/ml/feature.py

+        Sets the value of :py:attr:`outputCol`.
+        """
+        return self._set(outputCol=value)
+


I added the above setters explicitly because I need to add these later on anyways.

SparkQA · 2019-10-25T02:23:13Z

Test build #112636 has finished for PR 26247 at commit 1800cdc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class _ImputerParams(HasInputCol, HasInputCols, HasOutputCol, HasOutputCols):

viirya · 2019-10-25T04:07:58Z

mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala

  }
+
+  def verifyTransformResult(strategy: String, inputCol: String, outputCol: String,
+                            resultDF: DataFrame): Unit = {


seems not usual indent.

viirya

Looks good in general. Do we need to copy each test of multi-column case to single column case like that? I am neutral on that.

python/pyspark/ml/feature.py

huaxingao · 2019-10-25T05:57:52Z

@viirya @zhengruifeng Thanks for the review. I made the changes. It's probably a overkill to
have a single column test for each of the multi-column test, but i will keep these if no objections.

SparkQA · 2019-10-25T06:57:42Z

Test build #112652 has finished for PR 26247 at commit ca866f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-26T02:20:49Z

Test build #112697 has finished for PR 26247 at commit 2ea9aec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-10-26T20:23:25Z

Same comment -- it's OK for consistency, but I can also argue that one-hot-encoder and imputer more regularly are applied to many columns, so the current API is fine.

huaxingao · 2019-10-27T16:59:24Z

Same as the other one :) @srowen

zhengruifeng · 2019-10-28T05:10:59Z

@huaxingao You need to rebase this PR.

SparkQA · 2019-10-28T14:43:39Z

Test build #112775 has started for PR 26247 at commit f833397.

huaxingao · 2019-10-29T00:00:46Z

@zhengruifeng Rebased. I will rebase the other PR once this one is merged. Thanks!

zhengruifeng · 2019-10-29T01:39:04Z

retest this please

SparkQA · 2019-10-29T02:55:11Z

Test build #112812 has finished for PR 26247 at commit f833397.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-10-29T03:12:43Z

merged to master, thanks all

huaxingao · 2019-10-29T04:47:14Z

Thanks! @srowen @viirya @zhengruifeng

zero323 · 2020-01-13T23:35:25Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala


+  /** @group setParam */
+  @Since("3.0.0")
+  def setInputCol(value: String): this.type = set(inputCol, value)


What is the intended purpose of this method?

As it is implemented right now, it doesn't seem to have any practical applications:

If model has been created with single col, surrogate will contain only a single column, so there is nothing to set here.

If model has been created with multiple cols, setInputCol / setOutputCol should clear setInputCols and setOutputCols, otherwise it will fail to validate. I guess something like this:

@Since("3.0.0") def setInputCol(value: String): this.type = { clear(inputCols) clear(outputCols) set(inputCol, value) } @Since("3.0.0") def setOutputCol(value: String): this.type = { clear(inputCols) clear(outputCols) set(outputCol, value) }

I am asking, because these two are missing in Python (#27195).

@zero323 I actually realized I missed the two setters in python when I checked the parity between python and scala last night. I fixed it along with a few other problems.

@zero323 There is a check on scala side to make sure only setInputCol/setOutputCol or setInputCols/setOutputCols is set

@zero323 There is a check on scala side to make sure only setInputCol/setOutputCol or setInputCols/setOutputCols is set

That's is what confuses me. Let's say the workflow looks like this:

import org.apache.spark.ml.feature.Imputer val df = Seq((1, 2)).toDF("x1", "x2") val mm = new Imputer() .setInputCols(Array("x1", "x2")) .setOutputCols(Array("x1_", "x2_")) .fit(df)

You cannot switch to single col at the model level:

mm.setInputCol("x1").setOutputCol("x1_").transform(df) // java.lang.IllegalArgumentException: requirement failed: ImputerModel ImputerModel: uid=imputer_5923f59d0d3a, strategy=mean, missingValue=NaN, numInputCols=2, numOutputCols=2 requires exactly one of inputCol, inputCols Params to be set, but both are set.

without clearing cols explicitly:

mm.clear(mm.inputCols).clear(mm.outputCols).transform(df)

That's really not intuitive workflow, if this is what was intended.

If we only want to support Imupter.setInputCol -> ImputerModel.setInputcol, then there is no point in having this method at all:

val ms = new Imputer().setInputCol("x1").setOutputCol("x1_").fit(df) ms.setInputCol("x2").setOutputCol("x2_").transform(df) // org.apache.spark.sql.AnalysisException: cannot resolve '`x2`' given input columns: [x1];;

as surrogate contains only the column used for fit

ms.surrogateDF org.apache.spark.sql.DataFrame = [x1: double]

Do I miss something obvious here?

@zero323 It is a problem. I will have a follow up pr to fix this.

I think the sanity check in Scala side for inputCol/outputCol and inputCols/outputCols are more for preventing errors when mixing single and multiple columns at the same time, e.g. set both single and multiple column params, inputCol + outputCols...etc.

It sounds rarely switching between single/multiple column during fitting and transforming.

@viirya But then we're back to the question why we need setInputCol in Models. Should we support

new Estimator().setInputCols(...).fit(...).setInputCol(...).transform(...)

flow at all?

And what about overriding inputCols (taking a subset?)

new Estimator().setInputCols(...).fit(...).setInputCols(...).transform(...)

for that matter?

For a model fitted by an Estimator, I think we usually won't change input/output column(s). The setter is still useful, as there are still cases that we might create a model instance directly. For such cases, we need input column(s) setter.

Having option to overwrite outputCol(s) on can be useful to avoid name clashes on pre-trained models.

But providing setters for inputs seems to be more confusing than useful, and proliferation of Params that support both Col and Cols makes things even more fuzzy, as there is no way to tell which variant we have, without inspecting Param values.

In general I am asking because we seem to have cases like OneHotEncoderModel - which provide setInputCols / setOutputCols but no single column equivalents.

huaxingao · 2020-01-16T03:35:46Z

@zero323 That is the one I missed. I noticed that a couple of days ago, but thinking of fixing it along with all the muticolumns supported algorithms. Since now we will not fix other muticolumns algorithms, I will just have a followup to add the setInputCol/setOutputCol for OneHotEncoderModel on both scala and python sides.

zero323 · 2020-01-16T11:54:55Z

Thanks for the update @huaxingao, though I still think we should have some clear guidelines where such setters should be placed. As @viirya pointed out, Models with direct constructors are one place. But at least having inputCol setter in a fitted Model seems a bit like a piece dead code.

huaxingao commented Oct 25, 2019

View reviewed changes

dongjoon-hyun added ML PYSPARK labels Oct 25, 2019

viirya reviewed Oct 25, 2019

View reviewed changes

zhengruifeng requested changes Oct 25, 2019

View reviewed changes

python/pyspark/ml/feature.py Outdated Show resolved Hide resolved

python/pyspark/ml/feature.py Outdated Show resolved Hide resolved

python/pyspark/ml/feature.py Outdated Show resolved Hide resolved

python/pyspark/ml/feature.py Outdated Show resolved Hide resolved

zhengruifeng approved these changes Oct 25, 2019

View reviewed changes

huaxingao added 3 commits October 28, 2019 07:03

[SPARK-29566][ML] Imputer should support single-column input/ouput

0ee0d49

address comments

e3f3fd8

call existing method getInOutCols

f833397

huaxingao force-pushed the spark-29566 branch from 2ea9aec to f833397 Compare October 28, 2019 14:41

zhengruifeng closed this in c137acb Oct 29, 2019

huaxingao deleted the spark-29566 branch October 29, 2019 04:47

zero323 mentioned this pull request Jan 7, 2020

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

zero323 reviewed Jan 13, 2020

View reviewed changes

zero323 mentioned this pull request Jan 14, 2020

[WIP][SPARK-29566][PYTHON][ML][FOLLOWUP] Add setInputCol and setOutputCol to ImputerModel #27195

Closed

[SPARK-29566][ML] Imputer should support single-column input/output #26247

[SPARK-29566][ML] Imputer should support single-column input/output #26247

Uh oh!

Conversation

huaxingao commented Oct 25, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huaxingao commented Oct 25, 2019

Uh oh!

SparkQA commented Oct 25, 2019

Uh oh!

SparkQA commented Oct 26, 2019

Uh oh!

srowen commented Oct 26, 2019

Uh oh!

huaxingao commented Oct 27, 2019

Uh oh!

zhengruifeng commented Oct 28, 2019

Uh oh!

SparkQA commented Oct 28, 2019

Uh oh!

huaxingao commented Oct 29, 2019

Uh oh!

zhengruifeng commented Oct 29, 2019

Uh oh!

SparkQA commented Oct 29, 2019

Uh oh!

zhengruifeng commented Oct 29, 2019

Uh oh!

huaxingao commented Oct 29, 2019

Uh oh!

zero323 Jan 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zero323 Jan 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zero323 Jan 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zero323 Jan 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Jan 16, 2020

Uh oh!

zero323 commented Jan 16, 2020

Uh oh!

Reviewers

zero323 Jan 13, 2020 •

edited

Loading

zero323 Jan 14, 2020 •

edited

Loading

zero323 Jan 15, 2020 •

edited

Loading

zero323 Jan 16, 2020 •

edited

Loading