Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PyExcelerate as Excel Writer Engine #4517

Closed
rhstanton opened this issue Aug 9, 2013 · 31 comments
Closed

Add PyExcelerate as Excel Writer Engine #4517

rhstanton opened this issue Aug 9, 2013 · 31 comments
Labels
IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel Performance Memory or execution speed performance

Comments

@rhstanton
Copy link
Contributor

Here's an example with just 65,000 elements (it's much worse with 250,000) so I can compare xls and xlsx:

from pandas import *

df = DataFrame({'col1' : [0.0] * 65000,
'col2' : 1,
'col3' : 2})

%timeit df.to_csv('sample.csv')
10 loops, best of 3: 109 ms per loop

%timeit df.to_excel('sample.xls')
1 loops, best of 3: 8.35 s per loop

%timeit df.to_excel('sample.xlsx')
1 loops, best of 3: 1min 31s per loop

@jtratner
Copy link
Contributor

jtratner commented Aug 9, 2013

I agree this is a frustrating issue (and if you get a big enough file, you can actually run out of memory entirely). I've had the thought in the back of my head that we should try to use PyExcelerate - which should be faster and use much less memory

@jreback
Copy link
Contributor

jreback commented Aug 9, 2013

these use xlwt for writing, an option would be to use openpyxl for writing xls (its used by default for .xlsx), and using optimize_write=True in the Workbook constructor. I don't know why this is not used (maybe more functionaily exposes by xlwt?).

@jreback
Copy link
Contributor

jreback commented Sep 27, 2013

@jtratner close this as a result of #4542 ?

@jtratner
Copy link
Contributor

Yes, xlsxwriter is likely faster (and may become even faster in the
future). PyExcelerate may also be added. (cheers to @jmcnamara for this)

@jreback
Copy link
Contributor

jreback commented Sep 28, 2013

closed by #4542

@jreback jreback closed this as completed Sep 28, 2013
@brydavis
Copy link

Update, default writer for df.to_excel(...) still super slow for large data.

Finally gave up and used df.to_csv(...).

Worked for my purposes. Thanks!

@sancau
Copy link

sancau commented Mar 24, 2017

Using PyExcelerate helps a lot when it comes to dumping lots of data.

With a DataFrame (120000, 120) of real mixed data (not ones and zeros: )) it took 4 minutes to write down an .xlsx

Another test I did was a (189121, 27) DF that took only 2min 33s (.xlsx). Also tested Pandas to_excel() and it took 5min 23s. PyExcelerate was more then 2 times faster.

I also noticed that it consumes much less memory during the process

Though PyExcelerate might require some manual data preparation in some cases (NaNs NaTs and so on)
Personaly, I faced an error with a empty value in a datetime column and had to manualy fix that.
Anyways if one needs a bulk excel dump PyExcelerate is a way to go. Cheers for its authors!

For trivial cases I use something like this and it works fine:

from pyexcelerate import Workbook

def df_to_excel(df, path, sheet_name='Sheet 1'):
    data = [df.columns.tolist(), ] + df.values.tolist()
    wb = Workbook()
    wb.new_sheet(sheet_name, data=data)
    wb.save(path)

@jreback
Copy link
Contributor

jreback commented Mar 24, 2017

@sancau it you would like to add this to the supported engines would be fine (most of the work would be making sure dtypes are correct and round-trippable).

@ghost
Copy link

ghost commented Jul 21, 2017

Is there any update on this?
xls does not support more than 255 columns, while saving in xlsx in pandas is really slow
I tried using pyexcelerate, but that requires to copy the entire pandas DataFrame to a multi-dimension list in memory
Is there any plan to support pyexcelerate natively, for example?

@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

Is there any update on this?

Unfortunately, no AFAIK 😢

Is there any plan to support pyexcelerate natively, for example?

Judging from the comment from @jreback , we would certainly be happy with supporting it, but the PR needs to be done right for it to be incorporated. If you would like to jumpstart that effort, go for it!

@ghost
Copy link

ghost commented Jul 22, 2017

I wouldn't even know where to start :(

@gfyoung
Copy link
Member

gfyoung commented Jul 22, 2017

Though PyExcelerate might require some manual data preparation in some cases (NaNs NaTs and so on) Personaly, I faced an error with a empty value in a datetime column and had to manualy fix that.

@sancau @jreback : this makes me skeptical about performance because data preparation is very important for us to ensure round-trippability. I think this might be why we've had issues implementing this.

@gfyoung gfyoung modified the milestones: Next Major Release, 0.13 Jul 22, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 22, 2017

I'll re-open for now, just so that people know we have this on our radar.

@gfyoung gfyoung reopened this Jul 22, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 22, 2017

I tried using pyexcelerate, but that requires to copy the entire pandas DataFrame to a multi-dimension list in memory Is there any plan to support pyexcelerate natively, for example?

For us to avoid the in-memory issue, you would need to be able to write with pyexcerlate in chunks. As your Excel code is largely in Python, it would be otherwise difficult to avoid that problem.

@raffam : do you know if such functionality is possible?

@ghost
Copy link

ghost commented Aug 10, 2017

@gfyoung I don't knkow. What do you mean exactly by "writing in chunks"? The way to write an xlsx file with pyexcelerate seems to be like this

from pyexcelerate import Workbook

data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # data is a 2D array

wb = Workbook()
wb.new_sheet("sheet name", data=data)
wb.save("output.xlsx")

This is more or less what I had done (I have switched to CSV since then). So I too converted the pandas Dataframe to a multi-dimensional array and passed that to pyexcelerate. I don't have deep knowledge of pandas' insight to assess if it would be possible to avoid that passage

@gfyoung
Copy link
Member

gfyoung commented Aug 10, 2017

@raffam : What I mean by that is that instead of writing the entire DataFrame (and having to make an entire copy of it in memory as you mentioned above), you would take a portion of the DataFrame on each iteration and write it to your Excel file.

@jmcnamara
Copy link
Contributor

It is probably worth pointing out that according to the benchmark in the pyexcelerate docs it is only 2x faster than XlsxWriter. Also openpyxl with lxml is probably as fast as xlsxwriter now.

So although pyexcelerate may be faster it isn't going to get the end user anywhere close to CSV speed due to the verbose nature of the xlsx format and the fact that it needs to be zipped. Ever the C version of XlsxWriter (libxlsxwriter) is only about 10x faster than the Python version.

So it is questionable if any of this is worth the effort.

If the Pandas xlsx writer was restructured to write the data row by row then I might be able to optimise XlsxWriter up to the pyexcelerate speed. But again I'm not sure if it is worth the effort.

Final note, I've seen a 10x increase in speed running long XlsxWriter programs under pypy.

@ghost
Copy link

ghost commented Aug 10, 2017

I have made a rapid testcase.
I built two functions that save a pandas DataFrame to Excel using pyexcelerate. The first function iterates over rows, the second function iterates over columns
The test is run against 2 matrices: one with lots of rows, and the other one with lots of columns.
It's highly likely that my code can be optimized further.
Anyway the results of the second implementation are consistent with what declared on pyexcelerate doc: approximately 2 times faster in any case.

The code was run under the Intel distribution of Python 3.5.2. Similar results are obtained under "normal" CPython 3.6.2

FRAME WITH LOTS OF ROWS
to_excel took 2.97 secs
pyexcelerate1 took 1.94 secs
34.68% faster than base time
pyexcelerate2 took 1.73 secs
41.75% faster than base time
FRAME WITH LOTS OF COLUMNS
to_excel took 2.13 secs
pyexcelerate1 took 1.15 secs
46.01% faster than base time
pyexcelerate2 took 1.06 secs
50.23% faster than base time

Here is the code

from pyexcelerate import Workbook, Range
import pandas as pd
import timeit
import math

def toPandasExcel(file, frame):
    start = timeit.default_timer()
    frame.to_excel(file, index=False, header=False)
    end = timeit.default_timer()
    delta = round(end-start,2)
    print("to_excel took "+str(delta)+" secs")
    return delta

# Iterate over rows
def toExcelerate1(file, frame):
    start = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    col_num = frame.shape[1]
    row_num = 1
    for row_tup in frame.itertuples(name=None, index=False):
        ws.range((row_num,1), (row_num,col_num)).value = [[*row_tup]]
        row_num += 1        
    wb.save(file)
    end = timeit.default_timer()
    delta = round(end-start,2)
    print("pyexcelerate1 took "+str(delta)+" secs")
    return delta

# Iterate over columns
def toExcelerate2(file, frame):
    start_time = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    row_num = frame.shape[0]    
    col_num = 1
    for col_name, col_series in frame.iteritems():
        ws.range((1,col_num), (row_num,col_num)).value = [[x] for x in col_series]
        col_num += 1        
    wb.save(file)
    end_time = timeit.default_timer()
    delta = round(end_time-start_time,2)
    print("pyexcelerate2 took "+str(delta)+" secs")
    return delta
    
def perf(base, eps):
    perc = ((base-eps)/base)*100
    perc = round(perc, 2)
    print(str(perc)+"% faster than base time")
    
print("FRAME WITH LOTS OF ROWS")
size = 50000
dic = {'a':[50]*size, 'b':[1]*size, 'c':['hi']*size}
#print(dic)
frame_rows = pd.DataFrame(dic)
del dic
base = toPandasExcel("rows_pandas.xlsx", frame_rows)
e1 = toExcelerate1("rows_pyexcelerate1.xlsx", frame_rows)
perf(base,e1)
e2 = toExcelerate2("rows_pyexcelerate2.xlsx", frame_rows)
perf(base,e2)

print("FRAME WITH LOTS OF COLUMNS")
dic = dict()
for x in range(10000):
    dic.update({"A"+str(x):['c']*5})
frame_cols = pd.DataFrame(dic)
del dic
base = toPandasExcel("cols_pandas.xlsx", frame_cols)
e1 = toExcelerate1("cols_pyexcelerate1.xlsx", frame_cols)
perf(base,e1)
e2 = toExcelerate2("cols_pyexcelerate2.xlsx", frame_cols)
perf(base,e2)

EDIT: it was not a fair comparison since to_excel wrote indexes and headers. Now it is a fair comparison
EDIT2: performance improvements

@ghost
Copy link

ghost commented Aug 13, 2017

@gfyoung would one of the 2 implementations be ok?
I don't think is possible to actually "write" the Excel file in chunks, since as it is already been said, it needs to be compressed
It may be possible to "build" the precursor data structure in chunks, so you effectively has 2 copy of the dataframe in memory (one as pandas DataFrame, one as pyexcelerate Workbook) instead of three (pandas, pyexcelerate and the temporary 2-dimensional array you need if you build the pyexcelerate Workbook in one step)

@gfyoung
Copy link
Member

gfyoung commented Aug 13, 2017

@raffam : Hmm...your timing results are encouraging but not convincing enough. Can you try with even larger sizes (think millions 😉 ) ?

@ghost
Copy link

ghost commented Aug 13, 2017

The limits are 1,048,576 rows by 16,384 columns

@gfyoung
Copy link
Member

gfyoung commented Aug 13, 2017

@raffam : Right, so you can test with WAY MORE than 50,000 rows is my point.

@ghost
Copy link

ghost commented Aug 16, 2017

I understand, but unfortunately my PC stalls with very high size of the matrix :(
Here is the code I have written in case someone wants to test it:

from pyexcelerate import Workbook, Range
import pandas as pd
import timeit
import math
import numpy as np

def toPandasExcel(file, frame):
    start = timeit.default_timer()
    frame.to_excel(file, index=False, header=False)
    end = timeit.default_timer()
    delta = round(end-start,2)
    print("to_excel took "+str(delta)+" secs")
    return delta

# Iterate over rows
def toExcelerate1(file, frame):
    start = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    col_num = frame.shape[1]
    row_num = 1
    for row_tup in frame.itertuples(name=None, index=False):
        ws.range((row_num,1), (row_num,col_num)).value = [[*row_tup]]
        row_num += 1        
    wb.save(file)
    end = timeit.default_timer()
    delta = round(end-start,2)
    print("pyexcelerate1 took "+str(delta)+" secs")
    return delta

# Iterate over columns
def toExcelerate2(file, frame):
    start_time = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    row_num = frame.shape[0]    
    col_num = 1
    for col_name, col_series in frame.iteritems():
        ws.range((1,col_num), (row_num,col_num)).value = list(map(lambda a:[a],col_series))
        col_num += 1        
    wb.save(file)
    end_time = timeit.default_timer()
    delta = round(end_time-start_time,2)
    print("pyexcelerate2 took "+str(delta)+" secs")
    return delta
    
# Iterate over columns
def toExcelerate3(file, frame):
    start_time = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    row_num = frame.shape[0]    
    col_num = frame.shape[1]
    ws.range((1,1), (row_num,col_num)).value = frame.values.tolist()
    wb.save(file)
    end_time = timeit.default_timer()
    delta = round(end_time-start_time,2)
    print("pyexcelerate3 took "+str(delta)+" secs")
    return delta
    
def perf(base, eps):
    perc = (base-eps)/base
    print("{0:.2%} over base time".format(perc))
    
print("BUILDING DATAFRAME")
frame = pd.DataFrame()
rows = int(math.pow(10,5))
cols = 5000
dic = dict()
for x in range(cols):
    frame[str(x)] = pd.Series(np.random.randn(rows))

print("to_excel()")
base = toPandasExcel("pandas.xlsx", frame)
print("pyexcelerate1")
e1 = toExcelerate1("pyexcelerate1.xlsx", frame)
perf(base,e1)
print("pyexcelerate2")
e2 = toExcelerate2("pyexcelerate2.xlsx", frame)
perf(base,e2)
print("pyexcelerate3")
e3 = toExcelerate3("pyexcelerate3.xlsx", frame)
perf(base,e3)

@gfyoung
Copy link
Member

gfyoung commented Aug 16, 2017

@raffam : Thanks for this! This will be very helpful.

@cryptotvync
Copy link

Used toExcelerate3 function and defaults xlsx function to write dataframe 333243*34 to file. Results:
to_excel (xlsx) default: 1360 seconds
toExcelerate3: 773 seconds

@mapa17
Copy link

mapa17 commented Sep 4, 2018

For who is interested, I created a simple helper function to write DataFrames to excel (including header s and Index) using pyexcelerate. ( https://gist.github.com/mapa17/bc04be36e447cab0746a0ec8903cc49f )

I thought about adding a excel writer engine to pandas/pandas/io/excel.py but I am a bit worried looking through the other already implanted engines. They support all kind of fancy cell formatting.

Do you think it would it be sufficient to provide a minimalistic excel writer engine using pyexcelerate, writing only unformatted excel files?

@gfyoung
Copy link
Member

gfyoung commented Sep 4, 2018

Hmm...from a maintenance perspective, I think I would want to maintain fewer engines than multiple because of the compatibility issues down the road (e.g. maintaining consistency between a Python and C parser for CSV has quite difficult).

I could potentially see this as a "fast track," but it sounds a little corner case. Thus, I'm -0.5 on this overall because I'm leaning towards this being more maintenance than useful for the broader audience.

cc @jreback

@Masterxilo
Copy link

This thing is indeed too slow, I'm not surprised that I find others thinking the same. 5 minutes for producing a 50 MB excel file is too much.

@WillAyd WillAyd changed the title df.to_excel() is *really* slow with large datasets Add PyExcelerate as Excel Writer Engine Jan 21, 2019
@WillAyd
Copy link
Member

WillAyd commented Feb 13, 2019

Based off of conversation above I think we'll close this for now as its unclear if there's really anything to be gained here. If anyone disagrees feel free to reopen

@WillAyd WillAyd closed this as completed Feb 13, 2019
@CrimsonVex
Copy link

CrimsonVex commented Jan 18, 2023

My console logs make me sad about to_excel performance, especially given it was almost 10 years since this issue was opened; not much has improvd.

df.to_csv(_filename, index = False)
CSV created with 36 columns, 255,489 rows  (took 3.66 seconds)

df.to_excel(_filename, index = False, engine = 'xlsxwriter', freeze_panes = (1, 0))
XLSX created with 36 columns, 255,489 rows  (took 99.35 seconds)

@olivekt
Copy link

olivekt commented Apr 30, 2024

Looked at this a bit, as I was considering trying to implement this. It seems like (at least in 2024) the speed difference is primarily within Pandas itself.
When testing pure xlsxwriter against pyexcelerate using the benchmark code the ghost shared above, the times are close with a small dataset (1000 rows, 100 columns) [output below].

to_excel()
to_excel took 0.68 secs
xlsxwriter
xlsxwriter took 0.37 secs
45.59% over base time
pyexcelerate1
pyexcelerate1 took 0.34 secs
50.00% over base time
pyexcelerate3
pyexcelerate3 took 0.34 secs
50.00% over base time

However, when increasing even to just 10000 rows, pure xlsxwriter is the fastest of all implementations [output below].

to_excel()
to_excel took 6.95 secs
xlsxwriter
xlsxwriter took 4.24 secs
38.99% over base time
pyexcelerate1
pyexcelerate1 took 7.39 secs
-6.33% over base time
pyexcelerate3
pyexcelerate3 took 7.51 secs
-8.06% over base time

Looking at a cProfile run, it seems the biggest time-sink on the process is this call:
stylekey = json.dumps(cell.style)

I'd have to look more into exactly why the styles are pulled in this way, but it seems worth looking into. Still wouldn't see to_csv() level performance, but a 20% speed increase seems possible.

Adding in a flag for if the call is coming from Dataframe().to_excel() vs Styler().to_excel() seems like it would also be enough to get this speed boost for all values-only usages, which I would expect to be the vast majority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests