-
Notifications
You must be signed in to change notification settings - Fork 344
New M.D.A.IO proj and improved LoadCsv #2927
Conversation
...a.Analysis.IO.Tests/Microsoft.Data.Analysis.IO.Tests/Microsoft.Data.Analysis.IO.Tests.csproj
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,266 @@ | |||
// Licensed to the .NET Foundation under one or more agreements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is mostly a duplicate of M.D.A.DataFrameIO.cs. I've called out the minor changes as comments
return res; | ||
} | ||
|
||
public static DataFrame LoadCsv(Func<Stream> csvStream, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LoadCsv method parameters are slightly modified here.
|
||
List<DataFrameColumn> columns; | ||
// First pass: schema and number of rows. | ||
using (TextFieldParser parser = new TextFieldParser(stream, defaultEncoding: encoding ?? Encoding.UTF8)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses TextFieldParser instead of reading a stream. TextFieldParser handles quotes correctly and is in-box
} | ||
} | ||
|
||
using (TextFieldParser parser = new TextFieldParser(csvStream(), defaultEncoding: encoding ?? Encoding.UTF8)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextFieldParser seems to be forward-only. So I'm making 2 calls here as a quick proof of concept
public void TestReadCsvWithQuotes() | ||
{ | ||
string data = @"vendor_id,rate_code,passenger_count,trip_time_in_secs,trip_distance,payment_type,fare_amount" + Environment.NewLine + | ||
"\"CMT, Comma\",1,1,1271,3.8,CRD,17.5" + Environment.NewLine + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the new unit test. Note the ,
inside the quotes
<Project Sdk="Microsoft.NET.Sdk"> | ||
|
||
<PropertyGroup> | ||
<TargetFrameworks>netcoreapp3.1;net461</TargetFrameworks> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<TargetFrameworks>netcoreapp3.1;net461</TargetFrameworks> | |
<TargetFrameworks>netstandard2.1;net461</TargetFrameworks> |
Is this possible? Or even
<TargetFrameworks>netcoreapp3.1;net461</TargetFrameworks> | |
<TargetFrameworks>netstandard2.1;netstandard2.0</TargetFrameworks> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
netcoreapp -> netstandard I'm good with. There was a reason eerhardt put in net461 explicitly. I don't remember why exactly now, so I'll leave it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I'd forgotten that Microsoft.VisualBasic.FileIO is not in netstandard. https://apisof.net/catalog/Microsoft.VisualBasic.FileIO.TextFieldParser. Have to go back to netcoreapp3.0 and net461
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which brings me to the question: @eerhardt: MDA target netstandard2.0. MDA.IO (the new csproj in this PR) is targeting netcoreapp3.0 and net461. Technically someone could have .NET Core 2.2 on their system and will be able to use MDA, but not MDA.IO. I'm thinking this is OK and our answer here is to upgrade to .NET Core 3.0 at the least? Thoughts?
...a.Analysis.IO.Tests/Microsoft.Data.Analysis.IO.Tests/Microsoft.Data.Analysis.IO.Tests.csproj
Outdated
Show resolved
Hide resolved
tests/Microsoft.Data.Analysis.IO.Tests/Microsoft.Data.Analysis.IO.Tests/DataFrame.IOTests.cs
Outdated
Show resolved
Hide resolved
tests/Microsoft.Data.Analysis.IO.Tests/Microsoft.Data.Analysis.IO.Tests/DataFrame.IOTests.cs
Show resolved
Hide resolved
hey @pgovind , I pushed some changes on one of my branches (https://github.com/chriss2401/corefxlab/commit/293c9acfe996af897a73dc0d380433d4245d543d). Among other things I have:
Unfortunately when I try to do a PR to So not sure whether I should make a new PR or do something else ? C. |
@chriss2401: Go ahead and make a new PR. I'll merge the changes on my end :) |
@eerhardt: Can you take a look at this when you have some time please? |
Ok, done :) #2938 |
Closing in favor of #2938 |
This PR gets a new IO project started for DataFrame so all the IO issues/PRs can go into the right project. I don't have much time to work on it at the moment, so it's not super clean yet. I copied some code from DataFrame so it's duplicated in 2 projects at the moment. I'll clean it up in a follow-up PR when I have some time. We're also able to handle quotes in text fields now. See the new
TestReadCsvWithQuotes
unit test.I'm thinking we can put this new project up for now to accept PRs/fix IO issues and clean up our code as we go? Thoughts?