Skip to content

ENH/PERF: provide pyarrow engine option for to_csv #53618

@jorisvandenbossche

Description

@jorisvandenbossche

We added the engine="pyarrow" option to read_csv(), but we could also do the equivalent for writing CSVs with to_csv.

Also for writing CSVs, the pyarrow.csv writer can give a significant speed-up (especially because our own writer is pure python). Quick showcase with full numeric dataframe:

In [1]: df = pd.DataFrame(np.random.randn(1_000_000, 10), columns=list("abcdefghij"))

In [2]: %time df.to_csv("test_pandas.csv", index=False)
CPU times: user 10.7 s, sys: 418 ms, total: 11.1 s
Wall time: 12.2 s

In [3]: from pyarrow.csv import write_csv

In [4]: %timeit write_csv(pa.table(df), "test_arrow.csv")
1.88 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Similarly as for reading CSVs, we will need to do some mapping of our keywords to the pyarrow's keywords and set some matching defaults.
For the example above (without any custom settings), there are some small differences: the float format precision seems the same by default, but the quoting of strings is different (that might be something to report to Arrow to make this more configurable, currently all or nothing quoting for strings):

$ head -3 test_pandas.csv
a,b,c,d,e,f,g,h,i,j
0.9524744196076045,0.21913063081328743,-1.3427643339799094,-2.66862972948282,0.09749875898199477,1.2870849976054641,-0.4253992607503571,0.7960922061946342,-0.5458462139415978,1.797736594226238
-0.9861117158157412,-0.14856474665751174,0.7884605447776409,0.5774211281637796,-0.024799957231053778,0.2859682446685537,1.0508204680473783,-1.1513513705558094,0.3334435129938111,-0.28739104967528223
$ head -3 test_arrow.csv
"a","b","c","d","e","f","g","h","i","j"
0.9524744196076045,0.21913063081328743,-1.3427643339799094,-2.66862972948282,0.09749875898199477,1.2870849976054641,-0.4253992607503571,0.7960922061946342,-0.5458462139415978,1.797736594226238
-0.9861117158157412,-0.14856474665751174,0.7884605447776409,0.5774211281637796,-0.024799957231053778,0.2859682446685537,1.0508204680473783,-1.1513513705558094,0.3334435129938111,-0.28739104967528223

Metadata

Metadata

Assignees

Labels

Arrowpyarrow functionalityEnhancementIO CSVread_csv, to_csvPerformanceMemory or execution speed performance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions