fix TextLoader bug when there's newline between quotes #4584
fix TextLoader bug when there's newline between quotes #4584LittleLittleCloud wants to merge 4 commits intodotnet:masterfrom
Conversation
|
|
||
| public static string ReadEntry(this TextReader sr) | ||
| { | ||
| string strReturn = string.Empty; |
| // And get more lines until the number of quotes is even | ||
| while (strReturn.GetNumberOf("\"").IsOdd()) | ||
| { | ||
| string strNow = sr.ReadLine(); |
| } | ||
| } | ||
|
|
||
| public static int GetNumberOf(this string s, string strSearchString) |
There was a problem hiding this comment.
I would just leave this as a static method instead of an extension method. It's easier to read that way.
There was a problem hiding this comment.
I would agree, especially with completion for unimported extension methods coming from Roslyn. Specialized extension methods for common types will be uncomfortably noisy during development.
|
|
||
| public static int GetNumberOf(this string s, string strSearchString) | ||
| { | ||
| return s.Length - s.Replace(strSearchString, string.Empty).Length; |
There was a problem hiding this comment.
Actually there's a bug.... It should divided by the length of strSearchString to get the correct result. also if I do that, it won't work on string.Empty (div by 0)... so hard to write the correct code!
| entry += sr.ReadLine(); | ||
|
|
||
| // And get more lines until the number of quotes is even | ||
| while (GetNumberOf(entry, "\"") % 2 != 0 ) |
There was a problem hiding this comment.
Need to first check that that quoting is enabled, and need to check for escaped quotes. May need a flag/field to define the escaping type, defaulting to allow either none or both.
Some tests:
True,"asdf,123is perfectly ok without quoting enabledTrue,"12\" in a foot",123has an escaped quote w/ slash escapingTrue,"12"" in a foot",123has an escaped quote w/ double-double quotes (Excel style)True, "asdf,123is ok, if following Excel style as it ignores quoted fields with a space before
There was a problem hiding this comment.
Also need to not append to lines at the start of the file which begin with // or #:
if (entry[0] == '#' || entry[0] == '/' && entry[1] == '/') { ... }There was a problem hiding this comment.
So try to repeat the logic here
read line
if 'line' starts with '#' or starts with '\' => return 'line' # comments
if quote is disabled: => use the existing logic in TextLoader
else
if ( 'line' has only odd# real quote ) => read next 'line' until sum of real quote is even and return 'line' # real quote means " and it's not one of [ "", \" "]
else return 'line'
And BTW it seems the existing enable quote logic doesn't check if it's real quote or not. ( for example in lines: "blablabla""blablabla" will cause an error in FetchNextField function, but in Excel it's legal) Do we have a plan for adding support for that?
|
|
||
| public static int GetNumberOf(string s, string strSearchString) | ||
| { | ||
| if(strSearchString.Length == 0 || s.Length == 0) |
There was a problem hiding this comment.
| if(strSearchString.Length == 0 || s.Length == 0) | |
| if (strSearchString.Length == 0 || s.Length == 0) |
| { | ||
| return 0; | ||
| } | ||
| return (s.Length - s.Replace(strSearchString, string.Empty).Length) / strSearchString.Length; |
There was a problem hiding this comment.
String.Replace() is not efficient as it reallocates.
| return; | ||
|
|
||
| text = rdr.ReadLine(); | ||
| text = rdr.ReadEntry(); |
There was a problem hiding this comment.
Should check if moving to multi-line reading will interfere with multi-threaded reading.
There was a problem hiding this comment.
How to check that, could you expand a bit?
| // introducing a CharSpan type (similar to ReadOnlyMemory but based on char[] or StringBuilder) | ||
| // and implementing all the necessary conversion functionality on it. See task 3871. | ||
| text = rdr.ReadLine(); | ||
| text = rdr.ReadEntry(); |
There was a problem hiding this comment.
Recommend a flag for multi-line support.
There was a problem hiding this comment.
I think it can share the same flag with EnableQuote, If we allow quote, then multi-line support is necessary to make sure TextLoader function well
| while (GetNumberOf(entry, "\"") % 2 != 0 ) | ||
| { | ||
| string line = sr.ReadLine(); | ||
| entry += line; |
There was a problem hiding this comment.
Likely want to begin a string builder instead of a +=.
There was a problem hiding this comment.
You'll want to push thru a newline into the string builder. ReadLine() doesn't include the \n or \r or \r\n [ref].
Otherwise:
a,b,"c my missing
space",e,fWould be read as:
a,b,"c my missingspace",e,fInstead of:
a,b,"c my missing\nspace",e,fIt likely is not important, but it may be better to match the \n or \r or \r\n of the last line read. This would only be important if the user read the file w/ TextLoader but fed the model with an IEnumerable.
A quick workaround for this Issue
#4460
perhaps also this one
#4555