-
-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Open
Labels
Arrowpyarrow functionalitypyarrow functionalityEnhancementIO CSVread_csv, to_csvread_csv, to_csvPerformanceMemory or execution speed performanceMemory or execution speed performance
Description
We added the engine="pyarrow"
option to read_csv()
, but we could also do the equivalent for writing CSVs with to_csv
.
Also for writing CSVs, the pyarrow.csv
writer can give a significant speed-up (especially because our own writer is pure python). Quick showcase with full numeric dataframe:
In [1]: df = pd.DataFrame(np.random.randn(1_000_000, 10), columns=list("abcdefghij"))
In [2]: %time df.to_csv("test_pandas.csv", index=False)
CPU times: user 10.7 s, sys: 418 ms, total: 11.1 s
Wall time: 12.2 s
In [3]: from pyarrow.csv import write_csv
In [4]: %timeit write_csv(pa.table(df), "test_arrow.csv")
1.88 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Similarly as for reading CSVs, we will need to do some mapping of our keywords to the pyarrow's keywords and set some matching defaults.
For the example above (without any custom settings), there are some small differences: the float format precision seems the same by default, but the quoting of strings is different (that might be something to report to Arrow to make this more configurable, currently all or nothing quoting for strings):
$ head -3 test_pandas.csv
a,b,c,d,e,f,g,h,i,j
0.9524744196076045,0.21913063081328743,-1.3427643339799094,-2.66862972948282,0.09749875898199477,1.2870849976054641,-0.4253992607503571,0.7960922061946342,-0.5458462139415978,1.797736594226238
-0.9861117158157412,-0.14856474665751174,0.7884605447776409,0.5774211281637796,-0.024799957231053778,0.2859682446685537,1.0508204680473783,-1.1513513705558094,0.3334435129938111,-0.28739104967528223
$ head -3 test_arrow.csv
"a","b","c","d","e","f","g","h","i","j"
0.9524744196076045,0.21913063081328743,-1.3427643339799094,-2.66862972948282,0.09749875898199477,1.2870849976054641,-0.4253992607503571,0.7960922061946342,-0.5458462139415978,1.797736594226238
-0.9861117158157412,-0.14856474665751174,0.7884605447776409,0.5774211281637796,-0.024799957231053778,0.2859682446685537,1.0508204680473783,-1.1513513705558094,0.3334435129938111,-0.28739104967528223
SysuJayce, matzl and csbe-spaquettepare
Metadata
Metadata
Assignees
Labels
Arrowpyarrow functionalitypyarrow functionalityEnhancementIO CSVread_csv, to_csvread_csv, to_csvPerformanceMemory or execution speed performanceMemory or execution speed performance