add multi-proc in `to_csv` #2896

bhavitvyamalik · 2021-09-10T21:35:09Z

This PR extends the multi-proc method used in #2747 forto_json to to_csv as well.

Results on my machine post benchmarking on ascent_kb dataset (giving ~45% improvement when compared to num_proc = 1):

Time taken on 1 num_proc, 10000 batch_size  674.2055702209473
Time taken on 4 num_proc, 10000 batch_size  425.6553490161896

Time taken on 1 num_proc, 50000 batch_size  623.5897650718689
Time taken on 4 num_proc, 50000 batch_size  380.0402421951294

Time taken on 4 num_proc, 100000 batch_size  361.7168130874634

This is a WIP as writing tests is pending for this PR.

I'm also exploring this approach for which I'm using pyarrow-5.0.0.

lhoestq

Thanks ! This is going in the right direction :)

I think it's better if we stick with the pandas CSV writer rather than Arrow's, for consistency with JSON but also because the pandas one may be more mature

lhoestq · 2021-10-08T16:15:41Z

I think you can just add a test test_dataset_to_csv_multiproc in tests/io/test_csv.py and we'll be good

bhavitvyamalik · 2021-10-14T14:09:41Z

Hi @lhoestq,
I've added test_dataset_to_csv apart from test_dataset_to_csv_multiproc as no test was there to check generated CSV file when num_proc=1. Please let me know if anything is also required!

tests/io/test_csv.py

lhoestq

Looks all good now, thanks @bhavitvyamalik @mariosasko :)

add multi-proc in to_csv

f1c70e8

lhoestq reviewed Oct 8, 2021

View reviewed changes

bhavitvyamalik added 5 commits October 9, 2021 23:11

Merge remote-tracking branch 'origin/master' into to_csv

4a825cb

add tests for dataset to csv

c2eef04

Merge remote-tracking branch 'origin/master' into to_csv

2fbf680

make style

254d73e

fix imports

09a04a8

bhavitvyamalik marked this pull request as ready for review October 14, 2021 14:14

bhavitvyamalik added 2 commits October 14, 2021 20:06

fix path for windows

b9deb18

make style

9fe735f

bhavitvyamalik commented Oct 15, 2021

View reviewed changes

tests/io/test_csv.py Show resolved Hide resolved

Fix tests

0927a0c

lhoestq approved these changes Oct 26, 2021

View reviewed changes

lhoestq merged commit 12b7e13 into huggingface:master Oct 26, 2021

bhavitvyamalik deleted the to_csv branch October 28, 2021 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add multi-proc in `to_csv` #2896

add multi-proc in `to_csv` #2896

Uh oh!

bhavitvyamalik commented Sep 10, 2021

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq commented Oct 8, 2021

Uh oh!

bhavitvyamalik commented Oct 14, 2021 •

edited

Loading

Uh oh!

Uh oh!

lhoestq left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add multi-proc in to_csv #2896

add multi-proc in to_csv #2896

Uh oh!

Conversation

bhavitvyamalik commented Sep 10, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Oct 8, 2021

Uh oh!

bhavitvyamalik commented Oct 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add multi-proc in `to_csv` #2896

add multi-proc in `to_csv` #2896

bhavitvyamalik commented Oct 14, 2021 •

edited

Loading