-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29566][ML] Imputer should support single-column input/output #26247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Sets the value of :py:attr:`outputCol`. | ||
| """ | ||
| return self._set(outputCol=value) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the above setters explicitly because I need to add these later on anyways.
|
Test build #112636 has finished for PR 26247 at commit
|
| } | ||
|
|
||
| def verifyTransformResult(strategy: String, inputCol: String, outputCol: String, | ||
| resultDF: DataFrame): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems not usual indent.
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general. Do we need to copy each test of multi-column case to single column case like that? I am neutral on that.
|
@viirya @zhengruifeng Thanks for the review. I made the changes. It's probably a overkill to |
|
Test build #112652 has finished for PR 26247 at commit
|
|
Test build #112697 has finished for PR 26247 at commit
|
|
Same comment -- it's OK for consistency, but I can also argue that one-hot-encoder and imputer more regularly are applied to many columns, so the current API is fine. |
|
Same as the other one :) @srowen |
|
@huaxingao You need to rebase this PR. |
2ea9aec to
f833397
Compare
|
Test build #112775 has started for PR 26247 at commit |
|
@zhengruifeng Rebased. I will rebase the other PR once this one is merged. Thanks! |
|
retest this please |
|
Test build #112812 has finished for PR 26247 at commit
|
|
merged to master, thanks all |
|
Thanks! @srowen @viirya @zhengruifeng |
|
|
||
| /** @group setParam */ | ||
| @Since("3.0.0") | ||
| def setInputCol(value: String): this.type = set(inputCol, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the intended purpose of this method?
As it is implemented right now, it doesn't seem to have any practical applications:
-
If model has been created with single col, surrogate will contain only a single column, so there is nothing to set here.
-
If model has been created with multiple cols,
setInputCol/setOutputColshould clearsetInputColsandsetOutputCols, otherwise it will fail to validate. I guess something like this:@Since("3.0.0") def setInputCol(value: String): this.type = { clear(inputCols) clear(outputCols) set(inputCol, value) } @Since("3.0.0") def setOutputCol(value: String): this.type = { clear(inputCols) clear(outputCols) set(outputCol, value) }
I am asking, because these two are missing in Python (#27195).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zero323 I actually realized I missed the two setters in python when I checked the parity between python and scala last night. I fixed it along with a few other problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zero323 There is a check on scala side to make sure only setInputCol/setOutputCol or setInputCols/setOutputCols is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zero323 There is a check on scala side to make sure only
setInputCol/setOutputColorsetInputCols/setOutputColsis set
That's is what confuses me. Let's say the workflow looks like this:
import org.apache.spark.ml.feature.Imputer
val df = Seq((1, 2)).toDF("x1", "x2")
val mm = new Imputer()
.setInputCols(Array("x1", "x2"))
.setOutputCols(Array("x1_", "x2_"))
.fit(df)You cannot switch to single col at the model level:
mm.setInputCol("x1").setOutputCol("x1_").transform(df)
// java.lang.IllegalArgumentException: requirement failed: ImputerModel ImputerModel: uid=imputer_5923f59d0d3a, strategy=mean, missingValue=NaN, numInputCols=2, numOutputCols=2 requires exactly one of inputCol, inputCols Params to be set, but both are set.without clearing cols explicitly:
mm.clear(mm.inputCols).clear(mm.outputCols).transform(df)
That's really not intuitive workflow, if this is what was intended.
If we only want to support Imupter.setInputCol -> ImputerModel.setInputcol, then there is no point in having this method at all:
val ms = new Imputer().setInputCol("x1").setOutputCol("x1_").fit(df)
ms.setInputCol("x2").setOutputCol("x2_").transform(df)
// org.apache.spark.sql.AnalysisException: cannot resolve '`x2`' given input columns: [x1];;
as surrogate contains only the column used for fit
ms.surrogateDF
org.apache.spark.sql.DataFrame = [x1: double]Do I miss something obvious here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zero323 It is a problem. I will have a follow up pr to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the sanity check in Scala side for inputCol/outputCol and inputCols/outputCols are more for preventing errors when mixing single and multiple columns at the same time, e.g. set both single and multiple column params, inputCol + outputCols...etc.
It sounds rarely switching between single/multiple column during fitting and transforming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya But then we're back to the question why we need setInputCol in Models. Should we support
new Estimator().setInputCols(...).fit(...).setInputCol(...).transform(...)
flow at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what about overriding inputCols (taking a subset?)
new Estimator().setInputCols(...).fit(...).setInputCols(...).transform(...)
for that matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a model fitted by an Estimator, I think we usually won't change input/output column(s). The setter is still useful, as there are still cases that we might create a model instance directly. For such cases, we need input column(s) setter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having option to overwrite outputCol(s) on can be useful to avoid name clashes on pre-trained models.
But providing setters for inputs seems to be more confusing than useful, and proliferation of Params that support both Col and Cols makes things even more fuzzy, as there is no way to tell which variant we have, without inspecting Param values.
In general I am asking because we seem to have cases like OneHotEncoderModel - which provide setInputCols / setOutputCols but no single column equivalents.
|
@zero323 That is the one I missed. I noticed that a couple of days ago, but thinking of fixing it along with all the muticolumns supported algorithms. Since now we will not fix other muticolumns algorithms, I will just have a followup to add the setInputCol/setOutputCol for OneHotEncoderModel on both scala and python sides. |
|
Thanks for the update @huaxingao, though I still think we should have some clear guidelines where such setters should be placed. As @viirya pointed out, |
What changes were proposed in this pull request?
add single-column input/output support in Imputer
Why are the changes needed?
Currently, Imputer only has multi-column support. This PR adds single-column input/output support.
Does this PR introduce any user-facing change?
Yes. add single-column input/output support in Imputer
Imputer.setInputColImputer.setOutputColHow was this patch tested?
add unit tests