Add PyExcelerate as Excel Writer Engine #4517

rhstanton · 2013-08-09T03:38:15Z

Here's an example with just 65,000 elements (it's much worse with 250,000) so I can compare xls and xlsx:

from pandas import *

df = DataFrame({'col1' : [0.0] * 65000,
'col2' : 1,
'col3' : 2})

%timeit df.to_csv('sample.csv')
10 loops, best of 3: 109 ms per loop

%timeit df.to_excel('sample.xls')
1 loops, best of 3: 8.35 s per loop

%timeit df.to_excel('sample.xlsx')
1 loops, best of 3: 1min 31s per loop

jtratner · 2013-08-09T03:43:34Z

I agree this is a frustrating issue (and if you get a big enough file, you can actually run out of memory entirely). I've had the thought in the back of my head that we should try to use PyExcelerate - which should be faster and use much less memory

jreback · 2013-08-09T15:58:38Z

these use xlwt for writing, an option would be to use openpyxl for writing xls (its used by default for .xlsx), and using optimize_write=True in the Workbook constructor. I don't know why this is not used (maybe more functionaily exposes by xlwt?).

jreback · 2013-09-27T20:14:21Z

@jtratner close this as a result of #4542 ?

jtratner · 2013-09-27T21:30:30Z

Yes, xlsxwriter is likely faster (and may become even faster in the
future). PyExcelerate may also be added. (cheers to @jmcnamara for this)

jreback · 2013-09-28T13:52:15Z

closed by #4542

brydavis · 2016-06-29T20:06:54Z

Update, default writer for df.to_excel(...) still super slow for large data.

Finally gave up and used df.to_csv(...).

Worked for my purposes. Thanks!

sancau · 2017-03-24T14:27:16Z

Using PyExcelerate helps a lot when it comes to dumping lots of data.

With a DataFrame (120000, 120) of real mixed data (not ones and zeros: )) it took 4 minutes to write down an .xlsx

Another test I did was a (189121, 27) DF that took only 2min 33s (.xlsx). Also tested Pandas to_excel() and it took 5min 23s. PyExcelerate was more then 2 times faster.

I also noticed that it consumes much less memory during the process

Though PyExcelerate might require some manual data preparation in some cases (NaNs NaTs and so on)
Personaly, I faced an error with a empty value in a datetime column and had to manualy fix that.
Anyways if one needs a bulk excel dump PyExcelerate is a way to go. Cheers for its authors!

For trivial cases I use something like this and it works fine:

from pyexcelerate import Workbook

def df_to_excel(df, path, sheet_name='Sheet 1'):
    data = [df.columns.tolist(), ] + df.values.tolist()
    wb = Workbook()
    wb.new_sheet(sheet_name, data=data)
    wb.save(path)

jreback · 2017-03-24T14:32:33Z

@sancau it you would like to add this to the supported engines would be fine (most of the work would be making sure dtypes are correct and round-trippable).

ghost · 2017-07-21T18:20:27Z

Is there any update on this?
xls does not support more than 255 columns, while saving in xlsx in pandas is really slow
I tried using pyexcelerate, but that requires to copy the entire pandas DataFrame to a multi-dimension list in memory
Is there any plan to support pyexcelerate natively, for example?

gfyoung · 2017-07-21T19:45:40Z

Is there any update on this?

Unfortunately, no AFAIK 😢

Is there any plan to support pyexcelerate natively, for example?

Judging from the comment from @jreback , we would certainly be happy with supporting it, but the PR needs to be done right for it to be incorporated. If you would like to jumpstart that effort, go for it!

ghost · 2017-07-22T11:29:07Z

I wouldn't even know where to start :(

gfyoung · 2017-07-22T18:59:06Z

Though PyExcelerate might require some manual data preparation in some cases (NaNs NaTs and so on) Personaly, I faced an error with a empty value in a datetime column and had to manualy fix that.

@sancau @jreback : this makes me skeptical about performance because data preparation is very important for us to ensure round-trippability. I think this might be why we've had issues implementing this.

gfyoung · 2017-07-22T19:00:34Z

I'll re-open for now, just so that people know we have this on our radar.

gfyoung · 2017-07-22T19:03:20Z

I tried using pyexcelerate, but that requires to copy the entire pandas DataFrame to a multi-dimension list in memory Is there any plan to support pyexcelerate natively, for example?

For us to avoid the in-memory issue, you would need to be able to write with pyexcerlate in chunks. As your Excel code is largely in Python, it would be otherwise difficult to avoid that problem.

@raffam : do you know if such functionality is possible?

ghost · 2017-08-10T10:07:17Z

@gfyoung I don't knkow. What do you mean exactly by "writing in chunks"? The way to write an xlsx file with pyexcelerate seems to be like this

from pyexcelerate import Workbook

data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # data is a 2D array

wb = Workbook()
wb.new_sheet("sheet name", data=data)
wb.save("output.xlsx")

This is more or less what I had done (I have switched to CSV since then). So I too converted the pandas Dataframe to a multi-dimensional array and passed that to pyexcelerate. I don't have deep knowledge of pandas' insight to assess if it would be possible to avoid that passage

gfyoung · 2017-08-10T15:19:16Z

@raffam : What I mean by that is that instead of writing the entire DataFrame (and having to make an entire copy of it in memory as you mentioned above), you would take a portion of the DataFrame on each iteration and write it to your Excel file.

jmcnamara · 2017-08-10T16:57:02Z

It is probably worth pointing out that according to the benchmark in the pyexcelerate docs it is only 2x faster than XlsxWriter. Also openpyxl with lxml is probably as fast as xlsxwriter now.

So although pyexcelerate may be faster it isn't going to get the end user anywhere close to CSV speed due to the verbose nature of the xlsx format and the fact that it needs to be zipped. Ever the C version of XlsxWriter (libxlsxwriter) is only about 10x faster than the Python version.

So it is questionable if any of this is worth the effort.

If the Pandas xlsx writer was restructured to write the data row by row then I might be able to optimise XlsxWriter up to the pyexcelerate speed. But again I'm not sure if it is worth the effort.

Final note, I've seen a 10x increase in speed running long XlsxWriter programs under pypy.

ghost · 2017-08-10T22:04:07Z

I have made a rapid testcase.
I built two functions that save a pandas DataFrame to Excel using pyexcelerate. The first function iterates over rows, the second function iterates over columns
The test is run against 2 matrices: one with lots of rows, and the other one with lots of columns.
It's highly likely that my code can be optimized further.
Anyway the results of the second implementation are consistent with what declared on pyexcelerate doc: approximately 2 times faster in any case.

The code was run under the Intel distribution of Python 3.5.2. Similar results are obtained under "normal" CPython 3.6.2

FRAME WITH LOTS OF ROWS
to_excel took 2.97 secs
pyexcelerate1 took 1.94 secs
34.68% faster than base time
pyexcelerate2 took 1.73 secs
41.75% faster than base time
FRAME WITH LOTS OF COLUMNS
to_excel took 2.13 secs
pyexcelerate1 took 1.15 secs
46.01% faster than base time
pyexcelerate2 took 1.06 secs
50.23% faster than base time

Here is the code

from pyexcelerate import Workbook, Range
import pandas as pd
import timeit
import math

def toPandasExcel(file, frame):
    start = timeit.default_timer()
    frame.to_excel(file, index=False, header=False)
    end = timeit.default_timer()
    delta = round(end-start,2)
    print("to_excel took "+str(delta)+" secs")
    return delta

# Iterate over rows
def toExcelerate1(file, frame):
    start = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    col_num = frame.shape[1]
    row_num = 1
    for row_tup in frame.itertuples(name=None, index=False):
        ws.range((row_num,1), (row_num,col_num)).value = [[*row_tup]]
        row_num += 1        
    wb.save(file)
    end = timeit.default_timer()
    delta = round(end-start,2)
    print("pyexcelerate1 took "+str(delta)+" secs")
    return delta

# Iterate over columns
def toExcelerate2(file, frame):
    start_time = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    row_num = frame.shape[0]    
    col_num = 1
    for col_name, col_series in frame.iteritems():
        ws.range((1,col_num), (row_num,col_num)).value = [[x] for x in col_series]
        col_num += 1        
    wb.save(file)
    end_time = timeit.default_timer()
    delta = round(end_time-start_time,2)
    print("pyexcelerate2 took "+str(delta)+" secs")
    return delta
    
def perf(base, eps):
    perc = ((base-eps)/base)*100
    perc = round(perc, 2)
    print(str(perc)+"% faster than base time")
    
print("FRAME WITH LOTS OF ROWS")
size = 50000
dic = {'a':[50]*size, 'b':[1]*size, 'c':['hi']*size}
#print(dic)
frame_rows = pd.DataFrame(dic)
del dic
base = toPandasExcel("rows_pandas.xlsx", frame_rows)
e1 = toExcelerate1("rows_pyexcelerate1.xlsx", frame_rows)
perf(base,e1)
e2 = toExcelerate2("rows_pyexcelerate2.xlsx", frame_rows)
perf(base,e2)

print("FRAME WITH LOTS OF COLUMNS")
dic = dict()
for x in range(10000):
    dic.update({"A"+str(x):['c']*5})
frame_cols = pd.DataFrame(dic)
del dic
base = toPandasExcel("cols_pandas.xlsx", frame_cols)
e1 = toExcelerate1("cols_pyexcelerate1.xlsx", frame_cols)
perf(base,e1)
e2 = toExcelerate2("cols_pyexcelerate2.xlsx", frame_cols)
perf(base,e2)

EDIT: it was not a fair comparison since to_excel wrote indexes and headers. Now it is a fair comparison
EDIT2: performance improvements

ghost · 2017-08-13T07:25:04Z

@gfyoung would one of the 2 implementations be ok?
I don't think is possible to actually "write" the Excel file in chunks, since as it is already been said, it needs to be compressed
It may be possible to "build" the precursor data structure in chunks, so you effectively has 2 copy of the dataframe in memory (one as pandas DataFrame, one as pyexcelerate Workbook) instead of three (pandas, pyexcelerate and the temporary 2-dimensional array you need if you build the pyexcelerate Workbook in one step)

gfyoung · 2017-08-13T07:58:22Z

@raffam : Hmm...your timing results are encouraging but not convincing enough. Can you try with even larger sizes (think millions 😉 ) ?

ghost · 2017-08-13T09:26:31Z

The limits are 1,048,576 rows by 16,384 columns

gfyoung · 2017-08-13T11:00:45Z

@raffam : Right, so you can test with WAY MORE than 50,000 rows is my point.

ghost · 2017-08-16T09:42:27Z

I understand, but unfortunately my PC stalls with very high size of the matrix :(
Here is the code I have written in case someone wants to test it:

from pyexcelerate import Workbook, Range
import pandas as pd
import timeit
import math
import numpy as np

def toPandasExcel(file, frame):
    start = timeit.default_timer()
    frame.to_excel(file, index=False, header=False)
    end = timeit.default_timer()
    delta = round(end-start,2)
    print("to_excel took "+str(delta)+" secs")
    return delta

# Iterate over rows
def toExcelerate1(file, frame):
    start = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    col_num = frame.shape[1]
    row_num = 1
    for row_tup in frame.itertuples(name=None, index=False):
        ws.range((row_num,1), (row_num,col_num)).value = [[*row_tup]]
        row_num += 1        
    wb.save(file)
    end = timeit.default_timer()
    delta = round(end-start,2)
    print("pyexcelerate1 took "+str(delta)+" secs")
    return delta

# Iterate over columns
def toExcelerate2(file, frame):
    start_time = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    row_num = frame.shape[0]    
    col_num = 1
    for col_name, col_series in frame.iteritems():
        ws.range((1,col_num), (row_num,col_num)).value = list(map(lambda a:[a],col_series))
        col_num += 1        
    wb.save(file)
    end_time = timeit.default_timer()
    delta = round(end_time-start_time,2)
    print("pyexcelerate2 took "+str(delta)+" secs")
    return delta
    
# Iterate over columns
def toExcelerate3(file, frame):
    start_time = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    row_num = frame.shape[0]    
    col_num = frame.shape[1]
    ws.range((1,1), (row_num,col_num)).value = frame.values.tolist()
    wb.save(file)
    end_time = timeit.default_timer()
    delta = round(end_time-start_time,2)
    print("pyexcelerate3 took "+str(delta)+" secs")
    return delta
    
def perf(base, eps):
    perc = (base-eps)/base
    print("{0:.2%} over base time".format(perc))
    
print("BUILDING DATAFRAME")
frame = pd.DataFrame()
rows = int(math.pow(10,5))
cols = 5000
dic = dict()
for x in range(cols):
    frame[str(x)] = pd.Series(np.random.randn(rows))

print("to_excel()")
base = toPandasExcel("pandas.xlsx", frame)
print("pyexcelerate1")
e1 = toExcelerate1("pyexcelerate1.xlsx", frame)
perf(base,e1)
print("pyexcelerate2")
e2 = toExcelerate2("pyexcelerate2.xlsx", frame)
perf(base,e2)
print("pyexcelerate3")
e3 = toExcelerate3("pyexcelerate3.xlsx", frame)
perf(base,e3)

gfyoung · 2017-08-16T10:02:11Z

@raffam : Thanks for this! This will be very helpful.

cryptotvync · 2018-03-09T11:09:59Z

Used toExcelerate3 function and defaults xlsx function to write dataframe 333243*34 to file. Results:
to_excel (xlsx) default: 1360 seconds
toExcelerate3: 773 seconds

mapa17 · 2018-09-04T19:51:53Z

For who is interested, I created a simple helper function to write DataFrames to excel (including header s and Index) using pyexcelerate. ( https://gist.github.com/mapa17/bc04be36e447cab0746a0ec8903cc49f )

I thought about adding a excel writer engine to pandas/pandas/io/excel.py but I am a bit worried looking through the other already implanted engines. They support all kind of fancy cell formatting.

Do you think it would it be sufficient to provide a minimalistic excel writer engine using pyexcelerate, writing only unformatted excel files?

gfyoung · 2018-09-04T21:28:14Z

Hmm...from a maintenance perspective, I think I would want to maintain fewer engines than multiple because of the compatibility issues down the road (e.g. maintaining consistency between a Python and C parser for CSV has quite difficult).

I could potentially see this as a "fast track," but it sounds a little corner case. Thus, I'm -0.5 on this overall because I'm leaning towards this being more maintenance than useful for the broader audience.

cc @jreback

Masterxilo · 2018-10-23T20:51:04Z

This thing is indeed too slow, I'm not surprised that I find others thinking the same. 5 minutes for producing a 50 MB excel file is too much.

WillAyd · 2019-02-13T08:03:13Z

Based off of conversation above I think we'll close this for now as its unclear if there's really anything to be gained here. If anyone disagrees feel free to reopen

CrimsonVex · 2023-01-18T00:10:03Z

My console logs make me sad about to_excel performance, especially given it was almost 10 years since this issue was opened; not much has improvd.

df.to_csv(_filename, index = False)
CSV created with 36 columns, 255,489 rows  (took 3.66 seconds)

df.to_excel(_filename, index = False, engine = 'xlsxwriter', freeze_panes = (1, 0))
XLSX created with 36 columns, 255,489 rows  (took 99.35 seconds)

olivekt · 2024-04-30T01:02:21Z

Looked at this a bit, as I was considering trying to implement this. It seems like (at least in 2024) the speed difference is primarily within Pandas itself.
When testing pure xlsxwriter against pyexcelerate using the benchmark code the ghost shared above, the times are close with a small dataset (1000 rows, 100 columns) [output below].

to_excel()
to_excel took 0.68 secs
xlsxwriter
xlsxwriter took 0.37 secs
45.59% over base time
pyexcelerate1
pyexcelerate1 took 0.34 secs
50.00% over base time
pyexcelerate3
pyexcelerate3 took 0.34 secs
50.00% over base time

However, when increasing even to just 10000 rows, pure xlsxwriter is the fastest of all implementations [output below].

to_excel()
to_excel took 6.95 secs
xlsxwriter
xlsxwriter took 4.24 secs
38.99% over base time
pyexcelerate1
pyexcelerate1 took 7.39 secs
-6.33% over base time
pyexcelerate3
pyexcelerate3 took 7.51 secs
-8.06% over base time

Looking at a cProfile run, it seems the biggest time-sink on the process is this call:
stylekey = json.dumps(cell.style)

I'd have to look more into exactly why the styles are pulled in this way, but it seems worth looking into. Still wouldn't see to_csv() level performance, but a 20% speed increase seems possible.

Adding in a flag for if the call is coming from Dataframe().to_excel() vs Styler().to_excel() seems like it would also be enough to get this speed boost for all values-only usages, which I would expect to be the vast majority.

jmcnamara mentioned this issue Aug 12, 2013

ENH: Adding XlsxWriter as an ExcelWriter() option #4542

Closed

jreback closed this as completed Sep 28, 2013

gfyoung modified the milestones: Next Major Release, 0.13 Jul 22, 2017

gfyoung reopened this Jul 22, 2017

WillAyd changed the title ~~df.to_excel() is *really* slow with large datasets~~ Add PyExcelerate as Excel Writer Engine Jan 21, 2019

WillAyd closed this as completed Feb 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PyExcelerate as Excel Writer Engine #4517

Add PyExcelerate as Excel Writer Engine #4517

rhstanton commented Aug 9, 2013

jtratner commented Aug 9, 2013

jreback commented Aug 9, 2013

jreback commented Sep 27, 2013

jtratner commented Sep 27, 2013

jreback commented Sep 28, 2013

brydavis commented Jun 29, 2016

sancau commented Mar 24, 2017

jreback commented Mar 24, 2017

ghost commented Jul 21, 2017

gfyoung commented Jul 21, 2017

ghost commented Jul 22, 2017

gfyoung commented Jul 22, 2017 •

edited

Loading

gfyoung commented Jul 22, 2017

gfyoung commented Jul 22, 2017 •

edited

Loading

ghost commented Aug 10, 2017 •

edited by ghost

Loading

gfyoung commented Aug 10, 2017

jmcnamara commented Aug 10, 2017

ghost commented Aug 10, 2017 •

edited by ghost

Loading

ghost commented Aug 13, 2017

gfyoung commented Aug 13, 2017

ghost commented Aug 13, 2017

gfyoung commented Aug 13, 2017

ghost commented Aug 16, 2017

gfyoung commented Aug 16, 2017

cryptotvync commented Mar 9, 2018

mapa17 commented Sep 4, 2018

gfyoung commented Sep 4, 2018 •

edited

Loading

Masterxilo commented Oct 23, 2018

WillAyd commented Feb 13, 2019

CrimsonVex commented Jan 18, 2023 •

edited

Loading

olivekt commented Apr 30, 2024

Add PyExcelerate as Excel Writer Engine #4517

Add PyExcelerate as Excel Writer Engine #4517

Comments

rhstanton commented Aug 9, 2013

jtratner commented Aug 9, 2013

jreback commented Aug 9, 2013

jreback commented Sep 27, 2013

jtratner commented Sep 27, 2013

jreback commented Sep 28, 2013

brydavis commented Jun 29, 2016

sancau commented Mar 24, 2017

jreback commented Mar 24, 2017

ghost commented Jul 21, 2017

gfyoung commented Jul 21, 2017

ghost commented Jul 22, 2017

gfyoung commented Jul 22, 2017 • edited Loading

gfyoung commented Jul 22, 2017

gfyoung commented Jul 22, 2017 • edited Loading

ghost commented Aug 10, 2017 • edited by ghost Loading

gfyoung commented Aug 10, 2017

jmcnamara commented Aug 10, 2017

ghost commented Aug 10, 2017 • edited by ghost Loading

ghost commented Aug 13, 2017

gfyoung commented Aug 13, 2017

ghost commented Aug 13, 2017

gfyoung commented Aug 13, 2017

ghost commented Aug 16, 2017

gfyoung commented Aug 16, 2017

cryptotvync commented Mar 9, 2018

mapa17 commented Sep 4, 2018

gfyoung commented Sep 4, 2018 • edited Loading

Masterxilo commented Oct 23, 2018

WillAyd commented Feb 13, 2019

CrimsonVex commented Jan 18, 2023 • edited Loading

olivekt commented Apr 30, 2024

gfyoung commented Jul 22, 2017 •

edited

Loading

gfyoung commented Jul 22, 2017 •

edited

Loading

ghost commented Aug 10, 2017 •

edited by ghost

Loading

ghost commented Aug 10, 2017 •

edited by ghost

Loading

gfyoung commented Sep 4, 2018 •

edited

Loading

CrimsonVex commented Jan 18, 2023 •

edited

Loading