Combined methods related to splitting data into one single method. Also fixed related issues.#5227
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5227 +/- ##
==========================================
+ Coverage 73.49% 73.59% +0.10%
==========================================
Files 1014 1014
Lines 188680 188963 +283
Branches 20330 20326 -4
==========================================
+ Hits 138677 139074 +397
+ Misses 44493 44396 -97
+ Partials 5510 5493 -17
|
| new NormalizingEstimator.MinMaxColumnOptions(stratCol, stratificationColumn, ensureZeroUntouched: true)) | ||
| .Fit(data).Transform(data); | ||
| { | ||
| data = new ColumnCopyingEstimator(host,(stratCol,stratificationColumn)).Fit(data).Transform(data); |
There was a problem hiding this comment.
ColumnCopyingEstimator [](start = 35, length = 22)
Should this be added to EnsureGroupPreservationColumn as well?
And more generally, would it make sense to unify the two methods? There is also a method called GetSplitColumn in line 285 in CrossValidationCommand.cs that does more or less the same thing. I think the same utility method should be called for maml, entry points and the C# API. #Resolved
There was a problem hiding this comment.
So I took a quick look into the EnsureGroupPreservationColumn callers, and it doesn't seem that they drop the created column, so this wouldn't be an issue there. I'll look into it more later to confirm this.
About merging these methods, I think it would be a good idea to maintain consistency across the codebase, although it wouldn't be necessary to fix the issue opened by @ganik to unblock NimbusML update. I'll look into this idea.
In reply to: 438552643 [](ancestors = 438552643)
There was a problem hiding this comment.
I confirmed that when EnsureGroupPreservationColumn was called, the samplingKeyColumn was never dropped, so the issue didn't apply on that case.
Also, I've merged the methods you've mentioned, so now only CreateGroupPreservationColumn exists
In reply to: 438969558 [](ancestors = 438969558,438552643)
There was a problem hiding this comment.
Update: I've renamed CreateGroupPreservationColumn to CreateSplitColumn and made similar renames elsewhere to make the code easier to read. Now "splitColumn" is the name of the temporary column we create for splitting, regardless if it's based-off a "samplingKeyColumn" (ML.NET) or a "stratificationColumn" (NimbusML, Maml, legacy TLC naming conventions...)
Also now I do drop the new "splitColumn" when it's created while splitting through ML.NET's API, because not dropping it was causing other issues... see comment:
#5227 (comment)
Added 2 tests to cheked this is dropped, and updated my PR description to explain all the issues it's now fixing.
In reply to: 441992076 [](ancestors = 441992076,438969558,438552643)
|
Can you add a unit test that has a similar case to the NimbusML test that exposed this bug? #Resolved |
…xNormalizer for this assertion
| Assert.Contains(5, ids); | ||
| split = mlContext.Data.TrainTestSplit(input, 0.5, nameof(Input.FloatStrat)); | ||
| ids = split.TrainSet.GetColumn<int>(split.TrainSet.Schema[nameof(Input.Id)]); | ||
| ids = split.TestSet.GetColumn<int>(split.TestSet.Schema[nameof(Input.Id)]); |
There was a problem hiding this comment.
TestSet [](start = 24, length = 7)
why? #Resolved
There was a problem hiding this comment.
Because of the change to ensureZeroUntouched = false
In reply to: 443437139 [](ancestors = 443437139)
** Fix issue with AutoML ** Enforce having the same schema before and after splitting, avoiding future issues * Added tests for these * Enforce that samplingKeyColumnName shouldn't be null by the time we split stuff, since anyway it will throw in the RangeFilter if its null. Also change its name to tempSamplingKeyColumnName since it's going to be dropped.
| public void AutoFitWithPresplittedData() | ||
| { | ||
| // Models created in AutoML should work over the same data, | ||
| // no matter how that data is splitted before passing it to the experiment execution | ||
| // or to the model for prediction |
There was a problem hiding this comment.
This test abstracts the issue in #5256 (which is also the issue in #4048 and dotnet/machinelearning-modelbuilder#694 )
This test fails with a
System.ArgumentOutOfRangeException : Could not find input column 'SamplingKeyColumn' without the changes on this PR.
The particular change that fixed this issue is dropping the (temporary version of) SamplingKeyColumn (now called SplitColumn), on DataOperationsCatalog.CrossValidationSplit and DataOperationsCatalog.TrainTestSplit in ML.NET (no change was needed in AutoML, but I added this AutoML test since this issue was found repeatedly by AutoML users).
…r renames for consistency. Internally we'll call the new column SplitColumn, regardless if it's based off a "samplingKeyColumn" or a "stratificationColumn"
LittleLittleCloud
left a comment
There was a problem hiding this comment.
Approve on behalf of AutoML test change
| /// </summary> | ||
| internal static IEnumerable<TrainTestData> CrossValidationSplit(IHostEnvironment env, IDataView data, string splitColumn, int numberOfFolds = 5) | ||
| { | ||
| env.CheckValue(splitColumn, nameof(splitColumn)); |
There was a problem hiding this comment.
splitColumn [](start = 27, length = 11)
Should we check/assert that this column exists in the data?
There was a problem hiding this comment.
I did it because if the column doesn't exist (or if splitColumn was null) it is going to throw an exception anyway inside the RangeFilter. Since it's necessary that the column exists for the RangeFilter to work, I thought it was sensible to check for this here and don't let splitColumn to be null.
In reply to: 447414598 [](ancestors = 447414598)
| }, data); | ||
|
|
||
| yield return new TrainTestData(trainFilter, testFilter); | ||
| var trainDV = ColumnSelectingTransformer.CreateDrop(env, trainFilter, splitColumn); |
There was a problem hiding this comment.
CreateDrop [](start = 57, length = 10)
Is this a new behavior? Or is it needed now because of the other changes?
If it is a new behavior, why is it needed?
There was a problem hiding this comment.
Dropping the SplitColumn is something new to fix the situation I mentioned here:
#5227 (comment)
Not dropping it was causing an issue with AutoML. And also it didn't make sense that we didn't drop it, because if we didn't do it then the schema of a DataView was changed by splitting it, which I think should be considered unexpected behavior.
In reply to: 447415004 [](ancestors = 447415004)
This PR combines
CreateStratificationColumnandEnsureGroupPreservationColumn, and some code onGetSplitColumn, into a new method calledCreateSplitColumn. This method receives asamplingKeyColumnNameparameter and always creates a new column namedSplitColumnbased on thesamplingKeyColumn... the new column is elsewhere in the code used for splitting datasets for Train-Test splits or Cross-Validation Splits, and it's always safe to drop the new column as well. Note: thesamplingKeyColumnthatCreateSplitColumnreceives as parameter, or the column it creates, are calledsamplingKeyColumn,stratificationColumn,groupPreservationColumn, orsplitColumn(or maybe evenrowGroupColumn?) elsewhere in the code or docs, depending on where on the code you're / which API you're calling (ML.NET's TrainTestSplit, CrossValidationSplit, Evaluate or CrossValidate... AutoML, MAML, or NimbusML/Entrypoints). The inconsistency of the naming for this column is a long-standing problem, but since they're names used both in public APIs and for historical reasons, I guess there's not much we can do now (see #2536, and related issues and PR's).This PR also solves multiple issues related with the "samplingKeyColumn" when splitting data:
Issue: When
samplingKeyColumnis KeyType, dropping the original samplingKeyColumn when calling from entrypoints, makes it impossible to reuse that columnSolution: Use a copy column transformer to copy the original column, so that it's safe to drop the new column elsewhere without loosing the original column
Fixes #5221 : Issue with NimbusML test
Issue : a "samplingKeyColumn" is created, but not dropped, when users split data with ML.NET's API before training a model with AutoML, causing some problems when reusing the AutoML model
Solution: Always drop the new column when calling from ML.NET's APIs. Note: to distinguish between the "samplingKeyColumn" provided by the user, and the one we internally create, the new column is now called "SplitColumn".
Fixes #5256
Fixes #4048
Fixes dotnet/machinelearning-modelbuilder#694
Issue: If the "samplingKeyColumn" is float/double type, and includes negative values, or is normalized and outside the (0,1) range, then it might not work as expected.
Solution: Use
ensureZeroUntouched = falsein the normalizing estimator, and always normalize if it's float/doubleFixes #5227 (comment)