Skip to content

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

add single-column input/output support in Imputer

Why are the changes needed?

Currently, Imputer only has multi-column support. This PR adds single-column input/output support.

Does this PR introduce any user-facing change?

Yes. add single-column input/output support in Imputer
Imputer.setInputCol
Imputer.setOutputCol

How was this patch tested?

add unit tests

Sets the value of :py:attr:`outputCol`.
"""
return self._set(outputCol=value)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the above setters explicitly because I need to add these later on anyways.

@SparkQA
Copy link

SparkQA commented Oct 25, 2019

Test build #112636 has finished for PR 26247 at commit 1800cdc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _ImputerParams(HasInputCol, HasInputCols, HasOutputCol, HasOutputCols):

}

def verifyTransformResult(strategy: String, inputCol: String, outputCol: String,
resultDF: DataFrame): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems not usual indent.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. Do we need to copy each test of multi-column case to single column case like that? I am neutral on that.

@huaxingao
Copy link
Contributor Author

@viirya @zhengruifeng Thanks for the review. I made the changes. It's probably a overkill to
have a single column test for each of the multi-column test, but i will keep these if no objections.

@SparkQA
Copy link

SparkQA commented Oct 25, 2019

Test build #112652 has finished for PR 26247 at commit ca866f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2019

Test build #112697 has finished for PR 26247 at commit 2ea9aec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 26, 2019

Same comment -- it's OK for consistency, but I can also argue that one-hot-encoder and imputer more regularly are applied to many columns, so the current API is fine.

@huaxingao
Copy link
Contributor Author

Same as the other one :) @srowen

@zhengruifeng
Copy link
Contributor

@huaxingao You need to rebase this PR.

@SparkQA
Copy link

SparkQA commented Oct 28, 2019

Test build #112775 has started for PR 26247 at commit f833397.

@huaxingao
Copy link
Contributor Author

@zhengruifeng Rebased. I will rebase the other PR once this one is merged. Thanks!

@zhengruifeng
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Oct 29, 2019

Test build #112812 has finished for PR 26247 at commit f833397.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

merged to master, thanks all

@huaxingao
Copy link
Contributor Author

Thanks! @srowen @viirya @zhengruifeng


/** @group setParam */
@Since("3.0.0")
def setInputCol(value: String): this.type = set(inputCol, value)
Copy link
Member

@zero323 zero323 Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intended purpose of this method?

As it is implemented right now, it doesn't seem to have any practical applications:

  • If model has been created with single col, surrogate will contain only a single column, so there is nothing to set here.

  • If model has been created with multiple cols, setInputCol / setOutputCol should clear setInputCols and setOutputCols, otherwise it will fail to validate. I guess something like this:

    @Since("3.0.0")
    def setInputCol(value: String): this.type = {
      clear(inputCols)
      clear(outputCols)
      set(inputCol, value)
    }
    
    @Since("3.0.0")
    def setOutputCol(value: String): this.type = {
      clear(inputCols)
      clear(outputCols)
       set(outputCol, value)
    }
    

I am asking, because these two are missing in Python (#27195).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zero323 I actually realized I missed the two setters in python when I checked the parity between python and scala last night. I fixed it along with a few other problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zero323 There is a check on scala side to make sure only setInputCol/setOutputCol or setInputCols/setOutputCols is set

Copy link
Member

@zero323 zero323 Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zero323 There is a check on scala side to make sure only setInputCol/setOutputCol or setInputCols/setOutputCols is set

That's is what confuses me. Let's say the workflow looks like this:

import org.apache.spark.ml.feature.Imputer

val df = Seq((1, 2)).toDF("x1", "x2")

val mm = new Imputer()
   .setInputCols(Array("x1", "x2"))
   .setOutputCols(Array("x1_", "x2_"))
   .fit(df)

You cannot switch to single col at the model level:

mm.setInputCol("x1").setOutputCol("x1_").transform(df)

// java.lang.IllegalArgumentException: requirement failed: ImputerModel ImputerModel: uid=imputer_5923f59d0d3a, strategy=mean, missingValue=NaN, numInputCols=2, numOutputCols=2 requires exactly one of inputCol, inputCols Params to be set, but both are set.

without clearing cols explicitly:

mm.clear(mm.inputCols).clear(mm.outputCols).transform(df)

That's really not intuitive workflow, if this is what was intended.

If we only want to support Imupter.setInputCol -> ImputerModel.setInputcol, then there is no point in having this method at all:

val ms = new Imputer().setInputCol("x1").setOutputCol("x1_").fit(df)

ms.setInputCol("x2").setOutputCol("x2_").transform(df)
// org.apache.spark.sql.AnalysisException: cannot resolve '`x2`' given input columns: [x1];;

as surrogate contains only the column used for fit

ms.surrogateDF
org.apache.spark.sql.DataFrame = [x1: double]

Do I miss something obvious here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zero323 It is a problem. I will have a follow up pr to fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the sanity check in Scala side for inputCol/outputCol and inputCols/outputCols are more for preventing errors when mixing single and multiple columns at the same time, e.g. set both single and multiple column params, inputCol + outputCols...etc.

It sounds rarely switching between single/multiple column during fitting and transforming.

Copy link
Member

@zero323 zero323 Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya But then we're back to the question why we need setInputCol in Models. Should we support

new Estimator().setInputCols(...).fit(...).setInputCol(...).transform(...)

flow at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what about overriding inputCols (taking a subset?)

new Estimator().setInputCols(...).fit(...).setInputCols(...).transform(...)

for that matter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a model fitted by an Estimator, I think we usually won't change input/output column(s). The setter is still useful, as there are still cases that we might create a model instance directly. For such cases, we need input column(s) setter.

Copy link
Member

@zero323 zero323 Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having option to overwrite outputCol(s) on can be useful to avoid name clashes on pre-trained models.

But providing setters for inputs seems to be more confusing than useful, and proliferation of Params that support both Col and Cols makes things even more fuzzy, as there is no way to tell which variant we have, without inspecting Param values.

In general I am asking because we seem to have cases like OneHotEncoderModel - which provide setInputCols / setOutputCols but no single column equivalents.

@huaxingao
Copy link
Contributor Author

@zero323 That is the one I missed. I noticed that a couple of days ago, but thinking of fixing it along with all the muticolumns supported algorithms. Since now we will not fix other muticolumns algorithms, I will just have a followup to add the setInputCol/setOutputCol for OneHotEncoderModel on both scala and python sides.

@zero323
Copy link
Member

zero323 commented Jan 16, 2020

Thanks for the update @huaxingao, though I still think we should have some clear guidelines where such setters should be placed. As @viirya pointed out, Models with direct constructors are one place. But at least having inputCol setter in a fitted Model seems a bit like a piece dead code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants