Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

New M.D.A.IO proj and improved LoadCsv #2927

Closed
wants to merge 2 commits into from

Conversation

pgovind
Copy link

@pgovind pgovind commented Jun 1, 2020

This PR gets a new IO project started for DataFrame so all the IO issues/PRs can go into the right project. I don't have much time to work on it at the moment, so it's not super clean yet. I copied some code from DataFrame so it's duplicated in 2 projects at the moment. I'll clean it up in a follow-up PR when I have some time. We're also able to handle quotes in text fields now. See the new TestReadCsvWithQuotes unit test.

I'm thinking we can put this new project up for now to accept PRs/fix IO issues and clean up our code as we go? Thoughts?

@pgovind pgovind changed the title New M.D.A.IO proj and LoadCsv New M.D.A.IO proj and improved LoadCsv Jun 1, 2020
@pgovind pgovind requested review from eerhardt and 333fred and removed request for 333fred June 1, 2020 18:46
@@ -0,0 +1,266 @@
// Licensed to the .NET Foundation under one or more agreements.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is mostly a duplicate of M.D.A.DataFrameIO.cs. I've called out the minor changes as comments

return res;
}

public static DataFrame LoadCsv(Func<Stream> csvStream,
Copy link
Author

@pgovind pgovind Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LoadCsv method parameters are slightly modified here.


List<DataFrameColumn> columns;
// First pass: schema and number of rows.
using (TextFieldParser parser = new TextFieldParser(stream, defaultEncoding: encoding ?? Encoding.UTF8))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses TextFieldParser instead of reading a stream. TextFieldParser handles quotes correctly and is in-box

}
}

using (TextFieldParser parser = new TextFieldParser(csvStream(), defaultEncoding: encoding ?? Encoding.UTF8))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextFieldParser seems to be forward-only. So I'm making 2 calls here as a quick proof of concept

public void TestReadCsvWithQuotes()
{
string data = @"vendor_id,rate_code,passenger_count,trip_time_in_secs,trip_distance,payment_type,fare_amount" + Environment.NewLine +
"\"CMT, Comma\",1,1,1271,3.8,CRD,17.5" + Environment.NewLine +
Copy link
Author

@pgovind pgovind Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new unit test. Note the , inside the quotes

src/Microsoft.Data.Analysis.IO/DataFrameIO.cs Outdated Show resolved Hide resolved
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<TargetFrameworks>netcoreapp3.1;net461</TargetFrameworks>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<TargetFrameworks>netcoreapp3.1;net461</TargetFrameworks>
<TargetFrameworks>netstandard2.1;net461</TargetFrameworks>

Is this possible? Or even

Suggested change
<TargetFrameworks>netcoreapp3.1;net461</TargetFrameworks>
<TargetFrameworks>netstandard2.1;netstandard2.0</TargetFrameworks>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

netcoreapp -> netstandard I'm good with. There was a reason eerhardt put in net461 explicitly. I don't remember why exactly now, so I'll leave it in.

Copy link
Author

@pgovind pgovind Jun 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'd forgotten that Microsoft.VisualBasic.FileIO is not in netstandard. https://apisof.net/catalog/Microsoft.VisualBasic.FileIO.TextFieldParser. Have to go back to netcoreapp3.0 and net461

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which brings me to the question: @eerhardt: MDA target netstandard2.0. MDA.IO (the new csproj in this PR) is targeting netcoreapp3.0 and net461. Technically someone could have .NET Core 2.2 on their system and will be able to use MDA, but not MDA.IO. I'm thinking this is OK and our answer here is to upgrade to .NET Core 3.0 at the least? Thoughts?

@chriss2401
Copy link

hey @pgovind , I pushed some changes on one of my branches (https://github.com/chriss2401/corefxlab/commit/293c9acfe996af897a73dc0d380433d4245d543d). Among other things I have:

  • Cleaned duplicate code
  • Added a CultureInfo to the public api to handle commas and dots for single/doubles ( #2926 )
  • Added handling of conversion exceptions and NaN values (if the data type is single/double - #2902 )
  • Wrote unit tests

Unfortunately when I try to do a PR to pgovind:DataFrameIO , I cannot find your fork in the list of forks :
pr_github

So not sure whether I should make a new PR or do something else ?

C.

@pgovind
Copy link
Author

pgovind commented Jun 19, 2020

@chriss2401: Go ahead and make a new PR. I'll merge the changes on my end :)

@pgovind
Copy link
Author

pgovind commented Jun 19, 2020

@eerhardt: Can you take a look at this when you have some time please?

@chriss2401
Copy link

@chriss2401: Go ahead and make a new PR. I'll merge the changes on my end :)

Ok, done :) #2938

@pgovind
Copy link
Author

pgovind commented Jul 28, 2020

Closing in favor of #2938

@pgovind pgovind closed this Jul 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants