Added decimal marker option in TextLoader#5145
Conversation
…to TextLoaderCursor and TextLoaderParser
Codecov Report
@@ Coverage Diff @@
## master #5145 +/- ##
==========================================
+ Coverage 75.76% 75.78% +0.01%
==========================================
Files 993 993
Lines 180746 180915 +169
Branches 19463 19474 +11
==========================================
+ Hits 136944 137108 +164
- Misses 38516 38518 +2
- Partials 5286 5289 +3
|
|
This PR also opens up a possible case where datasets having different decimal separators being read in at the same time can result in an issue. For example, when on 2 separate threads we're loading two datasets where one has Edit: I have added a note in |
| } | ||
|
|
||
| [Fact] | ||
| public void TestCommaAsDecimalMarkerDouble() |
There was a problem hiding this comment.
This test is very similar to TestCommaAsDecimalMarkerFloat() above. This is because it is necessary to test decimal markers with floats and doubles. I was wishing to use InlineData to input a bool for using doubles and floats, however the specification of float/double is necessary to instantiate the VBuffer and ValueGetters, and there is no better way to specify varying types that I know.
With floats:
machinelearning/test/Microsoft.ML.Tests/TextLoaderTests.cs
Lines 1022 to 1023 in 6837f64
With doubles:
machinelearning/test/Microsoft.ML.Tests/TextLoaderTests.cs
Lines 1087 to 1088 in 6837f64
If anyone has a better way of prevention code repetition, I'm all eyes and ears! #Resolved
There was a problem hiding this comment.
As discussed offline, I agree with your approach of having 2 separate tests for each case, for the reasons you've mentioned above. #Resolved
There was a problem hiding this comment.
You can do this a bit better by refactoring the core test as TestCommaAsDecimalMarker<T>() and then calling TestCommaAsDecimalMarker<float> from within TestCommaAsDecimalMarkerFloat(). You will still have two tests, but the code will be easier to follow that this is the same test on two different types.
In reply to: 429001199 [](ancestors = 429001199)
There was a problem hiding this comment.
@harishsk now that in TestCommaAsDecimalMarkerFloat() I am testing both the .csv (with comma as both separator and decimal marker) and the .txt versions of the same dataset by using InlineDataAttribute, I think both tests would become harder to follow with the addition of a generic type parameter and calling TestCommaAsDecimalMarker<float> from within TestCommaAsDecimalMarkerFloat() #Resolved
There was a problem hiding this comment.
Please try it out. This is the pattern used in a lot of onnx conversion tests and it has improved readability there. #Resolved
There was a problem hiding this comment.
@harishsk thank you for your suggestion. I have now compressed the two tests, where I have a theory TestCommaAsDecimalMarker(bool useCsvVersion) test that calls a TestCommaAsDecimalMarkerHelper<T>(bool useCsvVersion) helper test. If only I knew a way to pass in the floating point value types float and double as inputs to TestCommaAsDecimalMarkerHelper<T>(bool useCsvVersion), I would have done the helper test this way to only have 1 unit test:
[Theory]
[InlineData(true, typeof(float))]
[InlineData(true, typeof(double))]
[InlineData(false, typeof(float))]
[InlineData(false, typeof(double))]
public void TestCommaAsDecimalMarkerHelper(bool useCsvVersion, GENERIC_TYPE featureType)
{
...
// Who knew compressing tests can be so beautiful 😄
}
#Resolved|
LGTM. There's only one minor nit comment I left. Also, there's the potential issue you've mentioned here. As discussed offline with @harishsk , we all agree that the scenario where it could be a problem (i.e. having different textloaders with different Decimal Marker options reading, at the same time, different datasets) would be fringe enough, to consider it an edge case that shouldn't stop us from adding this new feature you've worked in. Still, I will soon try to fix issue #4132 and as part of my work there I will try to explore other ways to better connect the options in textloader to the DoubleParser, without running with the potential issue I've mentioned. Aside from that, your tests do show that the feature works for the general cases where it would be used. So thanks for the thorough testing, @mstfbl 😄 #Resolved |
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
This PR adds the decimal marker option in TextLoader, so that cultures where a comma is the decimal marker (as in 3,5 = 3.5 * 10^1) can use their appropriate datasets. This also updates
verWrittenCuras it is now writingdecimalMarkerduring serialization as well. In addition, this PR also adds in a unit test to check whether or not a dataset with','as its decimal marker is read in and processed correctly.Fix #4910