Skip to content

Conversation

@bhavitvyamalik
Copy link
Contributor

This PR extends the multi-proc method used in #2747 forto_json to to_csv as well.

Results on my machine post benchmarking on ascent_kb dataset (giving ~45% improvement when compared to num_proc = 1):

Time taken on 1 num_proc, 10000 batch_size  674.2055702209473
Time taken on 4 num_proc, 10000 batch_size  425.6553490161896

Time taken on 1 num_proc, 50000 batch_size  623.5897650718689
Time taken on 4 num_proc, 50000 batch_size  380.0402421951294

Time taken on 4 num_proc, 100000 batch_size  361.7168130874634

This is a WIP as writing tests is pending for this PR.

I'm also exploring this approach for which I'm using pyarrow-5.0.0.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! This is going in the right direction :)

I think it's better if we stick with the pandas CSV writer rather than Arrow's, for consistency with JSON but also because the pandas one may be more mature

@lhoestq
Copy link
Member

lhoestq commented Oct 8, 2021

I think you can just add a test test_dataset_to_csv_multiproc in tests/io/test_csv.py and we'll be good

@bhavitvyamalik
Copy link
Contributor Author

bhavitvyamalik commented Oct 14, 2021

Hi @lhoestq,
I've added test_dataset_to_csv apart from test_dataset_to_csv_multiproc as no test was there to check generated CSV file when num_proc=1. Please let me know if anything is also required!

@bhavitvyamalik bhavitvyamalik marked this pull request as ready for review October 14, 2021 14:14
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good now, thanks @bhavitvyamalik @mariosasko :)

@lhoestq lhoestq merged commit 12b7e13 into huggingface:master Oct 26, 2021
@bhavitvyamalik bhavitvyamalik deleted the to_csv branch October 28, 2021 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants