Skip to content

[DataFrame] Fully implement read_csv#2108

Closed
pschafhalter wants to merge 2 commits intoray-project:masterfrom
pschafhalter:df-read-csv
Closed

[DataFrame] Fully implement read_csv#2108
pschafhalter wants to merge 2 commits intoray-project:masterfrom
pschafhalter:df-read-csv

Conversation

@pschafhalter
Copy link
Contributor

Changes

  • Updates read_csv API to match Pandas 0.23
  • Refactors code related to read_csv
  • More efficient reading of CSV file
  • More detailed warnings when defaulting to Pandas
  • Bugfixes

Notes

Performance benchmarks

Performed on a 144MB CSV file.

Time to read

Current master

In [3]: %time df = rdf.read_csv("creditcard.csv")
CPU times: user 27.5 ms, sys: 5.23 ms, total: 32.7 ms
Wall time: 76.7 ms

PR:

%time df = rdf.read_csv("creditcard.csv")
CPU times: user 32.7 ms, sys: 12.3 ms, total: 45 ms
Wall time: 63.5 ms

Time to read and display

Pandas 0.23:

In [3]: %time df = pd.read_csv("creditcard.csv")
CPU times: user 3.1 s, sys: 123 ms, total: 3.22 s
Wall time: 3.22 s

Current master (Pandas 0.22):

In [2]: %%time
   ...: df = rdf.read_csv("creditcard.csv")
   ...: result = repr(df)
   ...: 
CPU times: user 160 ms, sys: 84.2 ms, total: 245 ms
Wall time: 2.63 s

PR:

In [5]: %%time
   ...: df = rdf.read_csv("creditcard.csv")
   ...: result = repr(df)
   ...: 
CPU times: user 153 ms, sys: 85.9 ms, total: 239 ms
Wall time: 2.18 s

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5514/
Test PASSed.

@pschafhalter pschafhalter force-pushed the df-read-csv branch 2 times, most recently from 7470e6b to 1570871 Compare May 24, 2018 00:00
Refactor read_csv

Fix bug where py._path is unaccessible

Return iterator when passing chunksize

Fix encoding errors

Correct index name
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5607/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5608/
Test PASSed.

@pschafhalter
Copy link
Contributor Author

Moved to modin-project/modin#4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deprecated parameters in read_csv method for pandas 0.23

2 participants