-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Allow TextLoader to load empty float/double fields as NaN instead of 0 #5198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 16 commits
a5ac022
e7e7da7
a7a454f
7d10358
4dd081c
b952e71
c0cdbec
66c6c24
684d359
b1cc176
65fcb59
a68da25
df32ba1
f6912b1
3dd66ff
280a35e
8e05dff
1cc7eff
b074412
19241ef
00a972b
a417228
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -433,10 +433,9 @@ public class Options | |
| /// </summary> | ||
| [Argument(ArgumentType.AtMostOnce, | ||
| HelpText = | ||
| "Whether the input may include quoted values, which can contain separator characters, colons," + | ||
| " and distinguish empty values from missing values. When true, consecutive separators denote a" + | ||
| " missing value and an empty value is denoted by \"\". When false, consecutive separators" + | ||
| " denote an empty value.", | ||
| "Whether the input may include double-quoted values. This parameter is used to distinguish separator characters in an input value" + | ||
| "from actual separators. When true, separators within double quotes are treated as part of the input value. When false, all" + | ||
| "separators, even those within quotes, are treated as delimiting a new column.", | ||
| ShortName = "quote")] | ||
| public bool AllowQuoting = Defaults.AllowQuoting; | ||
|
|
||
|
|
@@ -533,6 +532,12 @@ public class Options | |
| [Argument(ArgumentType.AtMostOnce, HelpText = "Character to use to escape quotes inside quoted fields. It can't be a character used as separator.", ShortName = "escapechar")] | ||
| public char EscapeChar = Defaults.EscapeChar; | ||
|
|
||
| /// <summary> | ||
| /// If true, empty float fields will be loaded as NaN. If false, they'll be loaded as 0. Default is false. | ||
| /// </summary> | ||
| [Argument(ArgumentType.AtMostOnce, HelpText = "If true, empty float fields will be loaded as NaN. If false, they'll be loaded as 0. Default is false.", ShortName = "imputefloat")] | ||
| public bool ImputeEmptyFloats = Defaults.ImputeEmptyFloats; | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not happy with the name I chose for this option, and for text I wrote as its doc. I think the name "field" can mean many things in the C# world (but inside the code of the TextLoader a "field" is the value found on a given column on a given row). Also, this option is supposed to be related to both Single and Doubles, so "float" is misleading. And "empty" would actually mean empty fields or fields with only whitespace. So I'm not sure how to name this and what to say on the public docs.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure 'Impute' fits the terminology here. A few options to consider: ParseEmptyFloatsAsNan, ParseEmptyRealsAsNan, EmptyFieldsToNan, EmptyValuesToNan, etc. In reply to: 434309670 [](ancestors = 434309670)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I used the word "impute" as that was the original name @justinormont suggested here #4132 (comment) As for the names you provided, perhaps EmptyRealsAsNan would fit well, as it avoids the word Field, and "Reals" could imply Floats or Doubles. I will leave unaddressed right now, and see if we get any suggestion on this. It's only a variable name change, and I'll be able to make the change before merging the PR. In reply to: 434886717 [](ancestors = 434886717,434309670)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've decided to go for MissingRealsAsNan , since it's not only the Empty fields that get mapped as NaNs, but also the ones that only have whitespace, and now also the ones were there are missing columns in a given row. In reply to: 435163479 [](ancestors = 435163479,434886717,434309670)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In that case you should fix the documentation above to read "If true, empty or missing float and double fields will be loaded as NaN" In reply to: 437077933 [](ancestors = 437077933,435163479,434886717,434309670)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, the documentation reads: In reply to: 437170658 [](ancestors = 437170658,437077933,435163479,434886717,434309670) |
||
|
|
||
| /// <summary> | ||
| /// Checks that all column specifications are valid (that is, ranges are disjoint and have min<=max). | ||
| /// </summary> | ||
|
|
@@ -552,6 +557,7 @@ internal static class Defaults | |
| internal const bool TrimWhitespace = false; | ||
| internal const bool ReadMultilines = false; | ||
| internal const char EscapeChar = '"'; | ||
| internal const bool ImputeEmptyFloats = false; | ||
| } | ||
|
|
||
| /// <summary> | ||
|
|
@@ -1078,7 +1084,8 @@ private static VersionInfo GetVersionInfo() | |
| //verWrittenCur: 0x0001000A, // Added ForceVector in Range | ||
| //verWrittenCur: 0x0001000B, // Header now retained if used and present | ||
| //verWrittenCur: 0x0001000C, // Removed Min and Contiguous from KeyType, and added ReadMultilines flag to OptionFlags | ||
| verWrittenCur: 0x0001000D, // Added escapeChar option and decimal marker option to allow for ',' to be a decimal marker | ||
| //verWrittenCur: 0x0001000D, // Added escapeChar and decimalMarker chars | ||
| verWrittenCur: 0x0001000E, // Added imputeEmptyFloats flag | ||
| verReadableCur: 0x0001000A, | ||
| verWeCanReadBack: 0x00010009, | ||
| loaderSignature: LoaderSignature, | ||
|
|
@@ -1097,7 +1104,8 @@ private enum OptionFlags : uint | |
| AllowQuoting = 0x04, | ||
| AllowSparse = 0x08, | ||
| ReadMultilines = 0x10, | ||
| All = TrimWhitespace | HasHeader | AllowQuoting | AllowSparse | ReadMultilines | ||
| ImputeEmptyFloats = 0x20, | ||
| All = TrimWhitespace | HasHeader | AllowQuoting | AllowSparse | ReadMultilines | ImputeEmptyFloats | ||
| } | ||
|
|
||
| // This is reserved to mean the range extends to the end (the segment is variable). | ||
|
|
@@ -1179,6 +1187,8 @@ internal TextLoader(IHostEnvironment env, Options options = null, IMultiStreamSo | |
| _flags |= OptionFlags.AllowSparse; | ||
| if (options.AllowQuoting && options.ReadMultilines) | ||
| _flags |= OptionFlags.ReadMultilines; | ||
| if (options.ImputeEmptyFloats) | ||
|
antoniovs1029 marked this conversation as resolved.
Outdated
|
||
| _flags |= OptionFlags.ImputeEmptyFloats; | ||
|
|
||
| // REVIEW: This should be persisted (if it should be maintained). | ||
| _maxRows = options.MaxRows ?? long.MaxValue; | ||
|
|
@@ -1407,7 +1417,25 @@ private TextLoader(IHost host, ModelLoadContext ctx) | |
| _maxRows = ctx.Reader.ReadInt64(); | ||
| host.CheckDecode(_maxRows > 0); | ||
| _flags = (OptionFlags)ctx.Reader.ReadUInt32(); | ||
| host.CheckDecode((_flags & ~OptionFlags.All) == 0); | ||
|
|
||
| // Flags introduced with the first ML.NET commit: | ||
| var acceptableFlags = OptionFlags.TrimWhitespace; | ||
| acceptableFlags |= OptionFlags.HasHeader; | ||
| acceptableFlags |= OptionFlags.AllowQuoting; | ||
| acceptableFlags |= OptionFlags.AllowSparse; | ||
|
|
||
| // Flags added on later versions of TextLoader: | ||
| if(ctx.Header.ModelVerWritten >= 0x0001000C) | ||
| { | ||
| acceptableFlags |= OptionFlags.ReadMultilines; | ||
| } | ||
| if(ctx.Header.ModelVerWritten >= 0x0001000E) | ||
| { | ||
| acceptableFlags |= OptionFlags.ImputeEmptyFloats; | ||
| } | ||
|
|
||
| host.CheckDecode((_flags & ~acceptableFlags) == 0); | ||
|
|
||
| _inputSize = ctx.Reader.ReadInt32(); | ||
| host.CheckDecode(0 <= _inputSize && _inputSize < SrcLim); | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| int,description,num1,num2,date,num3,num4 | ||
| 0,"this is a description",0.12,0.34,01/01/2001,0.56,0.78 | ||
| 0,"this has an empty int and date", 1.1, 11.11,1/1/0001,111.111,1111.11111 | ||
| 0,"this has a quoted empty int and date", 1.1, 11.11,1/1/0001,111.111,1111.11111 | ||
| 1,"this has a quoted int and date", 1.1, 11.11,1/1/2001,111.111,1111.11111 | ||
| 2,"this has an empty num1 and a space in num3",NaN,22.22,2/2/2002,NaN,2222.2222 | ||
| 3,"this has an empty quoted num1 and a quoted space in num3",NaN,33.33,3/3/2003,NaN,3333.3333 | ||
| 4,"this has a space in num2 and a space in num4",4.4,NaN,4/4/2004,444.444,NaN | ||
| 5,"this has a quoted space num2 and quoted space in num4",5.5,NaN,5/5/2005,555.555,NaN | ||
| // The next two rows map the missing columns as 0, as it was decided not to impute with NaN | ||
| // in this case | ||
| 6,"this has no date, num3 or num4 (the separator corresponding to them is also missing)",6.6,66.66,1/1/0001,0,0 | ||
| 7,"this has no num4 (the separator corresponding to it is missing)",7.7,77.77,7/7/2007,777.777,0 | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (appended to example to allow for threaded discussion, this is how it will actually load missing columns at the end of the row) The feature I implement on this PR means that "1,,dog" will be loaded as "1,NaN,dog" if the second column is a float/double. And to load "1,dog," as "1,dog,NaN" if the third column is a float/double. This was the behavior requested on #4132 What to do with incomplete rows was undefined in the request. I.e. what to do if there are supposed to be 6 columns but on one row there's only 2? (I.e. In such a case, I took the decision to simply load the missing columns with default values (i.e. 0 for Float and Double) even when using the new feature of this PR. The reason behind this is that the implementation required to load those missing columns as NaNs only for float/doubles would be somewhat hacky (and I didn't figure out how to make it work for vector types which included fields on missing columns on a given row). Besides, since it might be a fringe case. it's perhaps better not to support it. Any thoughts on this, @justinormont @harishsk ? #Resolved
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the current behavior of the Textloader when it encounters incomplete rows? Does it thrown an error or does it load default values for missing columns? In reply to: 434313813 [](ancestors = 434313813)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the case of incomplete rows it returns default values, and doesn't throw, both with or without the new ImputeEmptyFloats option. I'll update this PR to have it return NaNs, but I won't close this thread because I'd still want to leave this discussion on. Once I update this PR with that fix, we can decide if this case is worth it the changes that would need to be introduced to cover it. In reply to: 434896173 [](ancestors = 434896173,434313813)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To fix this case I'll also need to do some checks like The same concerns I've mentioned there apply to this, although in here I think I can change the code to do the cast only when there are missing float/doubles columns for a given row and the ImputeEmptyFloats option is enabled. In reply to: 435167058 [](ancestors = 435167058,434896173,434313813)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It turned out fixing this case was much simpler than what I had originally expected, and no need to the In reply to: 435219809 [](ancestors = 435219809,435167058,434896173,434313813) |
||
| // In the next case we do impute with NaN because the separator is there | ||
| 8,"this has nothing in num4, but includes the last separator",8.8,88.88,8/8/2008,888.888,NaN | ||
| 9,,9.9,99.99,9/9/2009,999.999,NaN | ||
| 0,"",10.10,NaN,10/10/2010,101010.101010,NaN | ||
| 11,NaN,NaN,NaN,11/11/2011,NaN,Infinity | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| int,description,num1,num2,date,num3,num4 | ||
| 0,"this is a description",0.12,0.34,01/01/2001,0.56,0.78 | ||
| 0,"this has an empty int and date", 1.1, 11.11,1/1/0001,111.111,1111.11111 | ||
| 0,"this has a quoted empty int and date", 1.1, 11.11,1/1/0001,111.111,1111.11111 | ||
| 1,"this has a quoted int and date", 1.1, 11.11,1/1/2001,111.111,1111.11111 | ||
| 2,"this has an empty num1 and a space in num3",0,22.22,2/2/2002,0,2222.2222 | ||
| 3,"this has an empty quoted num1 and a quoted space in num3",0,33.33,3/3/2003,0,3333.3333 | ||
| 4,"this has a space in num2 and a space in num4",4.4,0,4/4/2004,444.444,0 | ||
| 5,"this has a quoted space num2 and quoted space in num4",5.5,0,5/5/2005,555.555,0 | ||
| 6,"this has no date, num3 or num4 (the separator corresponding to them is also missing)",6.6,66.66,1/1/0001,0,0 | ||
| 7,"this has no num4 (the separator corresponding to it is missing)",7.7,77.77,7/7/2007,777.777,0 | ||
| 8,"this has nothing in num4, but includes the last separator",8.8,88.88,8/8/2008,888.888,0 | ||
| 9,,9.9,99.99,9/9/2009,999.999,NaN | ||
| 0,,10.10,NaN,10/10/2010,101010.101010,NaN | ||
| 11,NaN,NaN,NaN,11/11/2011,0,Infinity |
Uh oh!
There was an error while loading. Please reload this page.