Combined methods related to splitting data into one single method. Also fixed related issues. by antoniovs1029 · Pull Request #5227 · dotnet/machinelearning

antoniovs1029 · 2020-06-10T23:57:14Z

This PR combines CreateStratificationColumn and EnsureGroupPreservationColumn, and some code on GetSplitColumn, into a new method called CreateSplitColumn. This method receives a samplingKeyColumnName parameter and always creates a new column named SplitColumn based on the samplingKeyColumn... the new column is elsewhere in the code used for splitting datasets for Train-Test splits or Cross-Validation Splits, and it's always safe to drop the new column as well. Note: the samplingKeyColumn that CreateSplitColumn receives as parameter, or the column it creates, are called samplingKeyColumn, stratificationColumn, groupPreservationColumn, or splitColumn (or maybe even rowGroupColumn ?) elsewhere in the code or docs, depending on where on the code you're / which API you're calling (ML.NET's TrainTestSplit, CrossValidationSplit, Evaluate or CrossValidate... AutoML, MAML, or NimbusML/Entrypoints). The inconsistency of the naming for this column is a long-standing problem, but since they're names used both in public APIs and for historical reasons, I guess there's not much we can do now (see #2536, and related issues and PR's).

This PR also solves multiple issues related with the "samplingKeyColumn" when splitting data:

Issue: When samplingKeyColumn is KeyType, dropping the original samplingKeyColumn when calling from entrypoints, makes it impossible to reuse that column
Solution: Use a copy column transformer to copy the original column, so that it's safe to drop the new column elsewhere without loosing the original column
Fixes #5221 : Issue with NimbusML test

Issue : a "samplingKeyColumn" is created, but not dropped, when users split data with ML.NET's API before training a model with AutoML, causing some problems when reusing the AutoML model
Solution: Always drop the new column when calling from ML.NET's APIs. Note: to distinguish between the "samplingKeyColumn" provided by the user, and the one we internally create, the new column is now called "SplitColumn".
Fixes #5256
Fixes #4048
Fixes dotnet/machinelearning-modelbuilder#694

Issue: If the "samplingKeyColumn" is float/double type, and includes negative values, or is normalized and outside the (0,1) range, then it might not work as expected.
Solution: Use ensureZeroUntouched = false in the normalizing estimator, and always normalize if it's float/double
Fixes #5227 (comment)

codecov · 2020-06-11T01:08:10Z

Codecov Report

Merging #5227 into master will increase coverage by 0.10%.
The diff coverage is 99.73%.

@@            Coverage Diff             @@
##           master    #5227      +/-   ##
==========================================
+ Coverage   73.49%   73.59%   +0.10%     
==========================================
  Files        1014     1014              
  Lines      188680   188963     +283     
  Branches    20330    20326       -4     
==========================================
+ Hits       138677   139074     +397     
+ Misses      44493    44396      -97     
+ Partials     5510     5493      -17

Flag	Coverage Δ
#Debug	`73.59% <99.73%> (+0.10%)`	⬆️
#production	`69.37% <98.27%> (+0.07%)`	⬆️
#test	`87.52% <100.00%> (+0.08%)`	⬆️

Impacted Files	Coverage Δ
...soft.ML.Data/DataLoadSave/DataOperationsCatalog.cs	`85.50% <97.72%> (+3.68%)`	⬆️
...crosoft.ML.Data/Commands/CrossValidationCommand.cs	`84.04% <100.00%> (+0.52%)`	⬆️
src/Microsoft.ML.Data/TrainCatalog.cs	`83.94% <100.00%> (ø)`
src/Microsoft.ML.EntryPoints/CVSplit.cs	`100.00% <100.00%> (ø)`
src/Microsoft.ML.EntryPoints/TrainTestSplit.cs	`94.44% <100.00%> (+17.52%)`	⬆️
test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs	`92.27% <100.00%> (+1.96%)`	⬆️
...crosoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs	`98.43% <100.00%> (+0.05%)`	⬆️
test/Microsoft.ML.Tests/Scenarios/Api/TestApi.cs	`94.16% <100.00%> (+1.31%)`	⬆️
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs	`79.48% <0.00%> (-20.52%)`	⬇️
src/Microsoft.ML.FastTree/RegressionTree.cs	`75.51% <0.00%> (-8.17%)`	⬇️
... and 21 more

yaeldekel · 2020-06-11T05:36:56Z

src/Microsoft.ML.EntryPoints/TrainTestSplit.cs

-                        new NormalizingEstimator.MinMaxColumnOptions(stratCol, stratificationColumn, ensureZeroUntouched: true))
-                        .Fit(data).Transform(data);
+                    {
+                        data = new ColumnCopyingEstimator(host,(stratCol,stratificationColumn)).Fit(data).Transform(data);


ColumnCopyingEstimator [](start = 35, length = 22)

Should this be added to EnsureGroupPreservationColumn as well?

And more generally, would it make sense to unify the two methods? There is also a method called GetSplitColumn in line 285 in CrossValidationCommand.cs that does more or less the same thing. I think the same utility method should be called for maml, entry points and the C# API. #Resolved

So I took a quick look into the EnsureGroupPreservationColumn callers, and it doesn't seem that they drop the created column, so this wouldn't be an issue there. I'll look into it more later to confirm this.

About merging these methods, I think it would be a good idea to maintain consistency across the codebase, although it wouldn't be necessary to fix the issue opened by @ganik to unblock NimbusML update. I'll look into this idea.

In reply to: 438552643 [](ancestors = 438552643)

I confirmed that when EnsureGroupPreservationColumn was called, the samplingKeyColumn was never dropped, so the issue didn't apply on that case.

Also, I've merged the methods you've mentioned, so now only CreateGroupPreservationColumn exists

In reply to: 438969558 [](ancestors = 438969558,438552643)

Update: I've renamed CreateGroupPreservationColumn to CreateSplitColumn and made similar renames elsewhere to make the code easier to read. Now "splitColumn" is the name of the temporary column we create for splitting, regardless if it's based-off a "samplingKeyColumn" (ML.NET) or a "stratificationColumn" (NimbusML, Maml, legacy TLC naming conventions...)

Also now I do drop the new "splitColumn" when it's created while splitting through ML.NET's API, because not dropping it was causing other issues... see comment:
#5227 (comment)
Added 2 tests to cheked this is dropped, and updated my PR description to explain all the issues it's now fixing.

In reply to: 441992076 [](ancestors = 441992076,438969558,438552643)

yaeldekel · 2020-06-11T05:37:51Z

Can you add a unit test that has a similar case to the NimbusML test that exposed this bug? #Resolved

test/Microsoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs

… a split

…xNormalizer for this assertion

…umn instead

…olumn

… failing test)

src/Microsoft.ML.Data/DataLoadSave/DataOperationsCatalog.cs

test/Microsoft.ML.Tests/Scenarios/Api/TestApi.cs

test/Microsoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs

yaeldekel · 2020-06-22T09:38:55Z

test/Microsoft.ML.Tests/Scenarios/Api/TestApi.cs

            Assert.Contains(5, ids);
            split = mlContext.Data.TrainTestSplit(input, 0.5, nameof(Input.FloatStrat));
-            ids = split.TrainSet.GetColumn<int>(split.TrainSet.Schema[nameof(Input.Id)]);
+            ids = split.TestSet.GetColumn<int>(split.TestSet.Schema[nameof(Input.Id)]);


TestSet [](start = 24, length = 7)

why? #Resolved

Because of the change to ensureZeroUntouched = false

In reply to: 443437139 [](ancestors = 443437139)

** Fix issue with AutoML ** Enforce having the same schema before and after splitting, avoiding future issues * Added tests for these * Enforce that samplingKeyColumnName shouldn't be null by the time we split stuff, since anyway it will throw in the RangeFilter if its null. Also change its name to tempSamplingKeyColumnName since it's going to be dropped.

antoniovs1029 · 2020-06-27T04:59:20Z

test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs

+        public void AutoFitWithPresplittedData()
+        {
+            // Models created in AutoML should work over the same data,
+            // no matter how that data is splitted before passing it to the experiment execution
+            // or to the model for prediction


This test abstracts the issue in #5256 (which is also the issue in #4048 and dotnet/machinelearning-modelbuilder#694 )

This test fails with a
System.ArgumentOutOfRangeException : Could not find input column 'SamplingKeyColumn' without the changes on this PR.

The particular change that fixed this issue is dropping the (temporary version of) SamplingKeyColumn (now called SplitColumn), on DataOperationsCatalog.CrossValidationSplit and DataOperationsCatalog.TrainTestSplit in ML.NET (no change was needed in AutoML, but I added this AutoML test since this issue was found repeatedly by AutoML users).

…r renames for consistency. Internally we'll call the new column SplitColumn, regardless if it's based off a "samplingKeyColumn" or a "stratificationColumn"

LittleLittleCloud

Approve on behalf of AutoML test change

yaeldekel · 2020-06-30T05:15:43Z

src/Microsoft.ML.Data/DataLoadSave/DataOperationsCatalog.cs

+        /// </summary>
+        internal static IEnumerable<TrainTestData> CrossValidationSplit(IHostEnvironment env, IDataView data, string splitColumn, int numberOfFolds = 5)
        {
+            env.CheckValue(splitColumn, nameof(splitColumn));


splitColumn [](start = 27, length = 11)

Should we check/assert that this column exists in the data?

I did it because if the column doesn't exist (or if splitColumn was null) it is going to throw an exception anyway inside the RangeFilter. Since it's necessary that the column exists for the RangeFilter to work, I thought it was sensible to check for this here and don't let splitColumn to be null.

In reply to: 447414598 [](ancestors = 447414598)

yaeldekel · 2020-06-30T05:17:07Z

src/Microsoft.ML.Data/DataLoadSave/DataOperationsCatalog.cs

                }, data);

-                yield return new TrainTestData(trainFilter, testFilter);
+                var trainDV = ColumnSelectingTransformer.CreateDrop(env, trainFilter, splitColumn);


CreateDrop [](start = 57, length = 10)

Is this a new behavior? Or is it needed now because of the other changes?
If it is a new behavior, why is it needed?

Dropping the SplitColumn is something new to fix the situation I mentioned here:
#5227 (comment)

Not dropping it was causing an issue with AutoML. And also it didn't make sense that we didn't drop it, because if we didn't do it then the schema of a DataView was changed by splitting it, which I think should be considered unexpected behavior.

In reply to: 447415004 [](ancestors = 447415004)

yaeldekel

Fixed issue

d2fb318

antoniovs1029 requested a review from a team as a code owner June 10, 2020 23:57

antoniovs1029 requested review from ganik, harishsk and yaeldekel June 10, 2020 23:57

yaeldekel reviewed Jun 11, 2020

View reviewed changes

Added test

e9454a0

antoniovs1029 commented Jun 16, 2020

View reviewed changes

test/Microsoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs Show resolved Hide resolved

antoniovs1029 added 8 commits June 16, 2020 09:56

Added LightGBM Fact Attribute to test

8a747e5

Added test for Key type SamplingKeyColumn on TrainTestSplit

1f94584

Added test for CV and that SamplingKeyColumn isn't removed by running…

033ae83

… a split

Added assertion for non-emptiness and fixed mini-issue with the MinMa…

1638341

…xNormalizer for this assertion

Removed CreateStratificationColumn and use EnsureGroupPreservationCol…

5e9a8e4

…umn instead

Change MAML to also use EnsureGroupPreservationColumn method

cc9f6ed

Change name EnsureGroupPreservationColumn to CreateGroupPreservationC…

37332c1

…olumn

Fallback on EnvSeed when calling through TrainCatalog (fix Tensorflow…

177fa04

… failing test)

antoniovs1029 commented Jun 18, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/DataOperationsCatalog.cs Show resolved Hide resolved

antoniovs1029 commented Jun 18, 2020

View reviewed changes

test/Microsoft.ML.Tests/Scenarios/Api/TestApi.cs Show resolved Hide resolved

antoniovs1029 commented Jun 18, 2020

View reviewed changes

test/Microsoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs Show resolved Hide resolved

antoniovs1029 changed the title ~~Fixed issue when running CrossValidation from an entrypoint~~ Combined methods related to splitting data into one single method. Also fixed related issues. Jun 18, 2020

yaeldekel reviewed Jun 22, 2020

View reviewed changes

antoniovs1029 added 3 commits June 22, 2020 20:09

Normalize even the column was already normalized

e2a76c7

Merge remote-tracking branch 'upstream/master' into forNimbusML

c7e0437

antoniovs1029 requested a review from a team as a code owner June 27, 2020 04:49

antoniovs1029 commented Jun 27, 2020

View reviewed changes

antoniovs1029 requested a review from yaeldekel June 27, 2020 05:17

Renamed CreatePreservationColumn to CreatSplitColumn and other simila…

3656236

…r renames for consistency. Internally we'll call the new column SplitColumn, regardless if it's based off a "samplingKeyColumn" or a "stratificationColumn"

antoniovs1029 requested a review from LittleLittleCloud June 27, 2020 06:22

LittleLittleCloud approved these changes Jun 27, 2020

View reviewed changes

yaeldekel reviewed Jun 30, 2020

View reviewed changes

yaeldekel approved these changes Jun 30, 2020

View reviewed changes

antoniovs1029 merged commit 709ec02 into dotnet:master Jun 30, 2020

ghost locked as resolved and limited conversation to collaborators Mar 18, 2022

Conversation

antoniovs1029 commented Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yaeldekel Jun 11, 2020 • edited by antoniovs1029 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 11, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 18, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaeldekel commented Jun 11, 2020 • edited by antoniovs1029 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaeldekel Jun 22, 2020 • edited by antoniovs1029 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 22, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LittleLittleCloud left a comment

Choose a reason for hiding this comment

Uh oh!

yaeldekel Jun 30, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaeldekel Jun 30, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 30, 2020

Choose a reason for hiding this comment

Uh oh!

yaeldekel left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

antoniovs1029 commented Jun 10, 2020 •

edited

Loading

codecov bot commented Jun 11, 2020 •

edited

Loading

yaeldekel Jun 11, 2020 •

edited by antoniovs1029

Loading

antoniovs1029 Jun 27, 2020 •

edited

Loading

yaeldekel commented Jun 11, 2020 •

edited by antoniovs1029

Loading

yaeldekel Jun 22, 2020 •

edited by antoniovs1029

Loading

antoniovs1029 Jun 27, 2020 •

edited

Loading

antoniovs1029 Jun 30, 2020 •

edited

Loading