-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29566][ML] Imputer should support single-column input/output #26247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the intended purpose of this method?
As it is implemented right now, it doesn't seem to have any practical applications:
If model has been created with single col, surrogate will contain only a single column, so there is nothing to set here.
If model has been created with multiple cols,
setInputCol/setOutputColshould clearsetInputColsandsetOutputCols, otherwise it will fail to validate. I guess something like this:I am asking, because these two are missing in Python (#27195).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zero323 I actually realized I missed the two setters in python when I checked the parity between python and scala last night. I fixed it along with a few other problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zero323 There is a check on scala side to make sure only
setInputCol/setOutputColorsetInputCols/setOutputColsis setUh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's is what confuses me. Let's say the workflow looks like this:
You cannot switch to single
colat the model level:without clearing
colsexplicitly:That's really not intuitive workflow, if this is what was intended.
If we only want to support
Imupter.setInputCol->ImputerModel.setInputcol, then there is no point in having this method at all:as surrogate contains only the column used for fit
Do I miss something obvious here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zero323 It is a problem. I will have a follow up pr to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the sanity check in Scala side for inputCol/outputCol and inputCols/outputCols are more for preventing errors when mixing single and multiple columns at the same time, e.g. set both single and multiple column params, inputCol + outputCols...etc.
It sounds rarely switching between single/multiple column during fitting and transforming.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya But then we're back to the question why we need
setInputColinModels. Should we supportflow at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what about overriding
inputCols(taking a subset?)for that matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a model fitted by an Estimator, I think we usually won't change input/output column(s). The setter is still useful, as there are still cases that we might create a model instance directly. For such cases, we need input column(s) setter.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having option to overwrite
outputCol(s)on can be useful to avoid name clashes on pre-trained models.But providing setters for inputs seems to be more confusing than useful, and proliferation of
Paramsthat support bothColandColsmakes things even more fuzzy, as there is no way to tell which variant we have, without inspectingParamvalues.In general I am asking because we seem to have cases like
OneHotEncoderModel- which providesetInputCols/setOutputColsbut no single column equivalents.