-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PyExcelerate as Excel Writer Engine #4517
Comments
I agree this is a frustrating issue (and if you get a big enough file, you can actually run out of memory entirely). I've had the thought in the back of my head that we should try to use PyExcelerate - which should be faster and use much less memory |
these use |
Yes, xlsxwriter is likely faster (and may become even faster in the |
closed by #4542 |
Update, default writer for Finally gave up and used Worked for my purposes. Thanks! |
Using PyExcelerate helps a lot when it comes to dumping lots of data. With a DataFrame (120000, 120) of real mixed data (not ones and zeros: )) it took 4 minutes to write down an .xlsx Another test I did was a (189121, 27) DF that took only 2min 33s (.xlsx). Also tested Pandas to_excel() and it took 5min 23s. PyExcelerate was more then 2 times faster. I also noticed that it consumes much less memory during the process Though PyExcelerate might require some manual data preparation in some cases (NaNs NaTs and so on) For trivial cases I use something like this and it works fine: from pyexcelerate import Workbook
def df_to_excel(df, path, sheet_name='Sheet 1'):
data = [df.columns.tolist(), ] + df.values.tolist()
wb = Workbook()
wb.new_sheet(sheet_name, data=data)
wb.save(path) |
@sancau it you would like to add this to the supported engines would be fine (most of the work would be making sure dtypes are correct and round-trippable). |
Is there any update on this? |
Unfortunately, no AFAIK 😢
Judging from the comment from @jreback , we would certainly be happy with supporting it, but the PR needs to be done right for it to be incorporated. If you would like to jumpstart that effort, go for it! |
I wouldn't even know where to start :( |
@sancau @jreback : this makes me skeptical about performance because data preparation is very important for us to ensure round-trippability. I think this might be why we've had issues implementing this. |
I'll re-open for now, just so that people know we have this on our radar. |
For us to avoid the in-memory issue, you would need to be able to write with @raffam : do you know if such functionality is possible? |
@gfyoung I don't knkow. What do you mean exactly by "writing in chunks"? The way to write an xlsx file with pyexcelerate seems to be like this
This is more or less what I had done (I have switched to CSV since then). So I too converted the pandas Dataframe to a multi-dimensional array and passed that to pyexcelerate. I don't have deep knowledge of pandas' insight to assess if it would be possible to avoid that passage |
@raffam : What I mean by that is that instead of writing the entire |
It is probably worth pointing out that according to the benchmark in the pyexcelerate docs it is only 2x faster than XlsxWriter. Also openpyxl with lxml is probably as fast as xlsxwriter now. So although pyexcelerate may be faster it isn't going to get the end user anywhere close to CSV speed due to the verbose nature of the xlsx format and the fact that it needs to be zipped. Ever the C version of XlsxWriter (libxlsxwriter) is only about 10x faster than the Python version. So it is questionable if any of this is worth the effort. If the Pandas xlsx writer was restructured to write the data row by row then I might be able to optimise XlsxWriter up to the pyexcelerate speed. But again I'm not sure if it is worth the effort. Final note, I've seen a 10x increase in speed running long XlsxWriter programs under pypy. |
I have made a rapid testcase. The code was run under the Intel distribution of Python 3.5.2. Similar results are obtained under "normal" CPython 3.6.2
Here is the code
EDIT: it was not a fair comparison since to_excel wrote indexes and headers. Now it is a fair comparison |
@gfyoung would one of the 2 implementations be ok? |
@raffam : Hmm...your timing results are encouraging but not convincing enough. Can you try with even larger sizes (think millions 😉 ) ? |
The limits are 1,048,576 rows by 16,384 columns |
@raffam : Right, so you can test with WAY MORE than 50,000 rows is my point. |
I understand, but unfortunately my PC stalls with very high size of the matrix :(
|
@raffam : Thanks for this! This will be very helpful. |
Used toExcelerate3 function and defaults xlsx function to write dataframe 333243*34 to file. Results: |
For who is interested, I created a simple helper function to write DataFrames to excel (including header s and Index) using pyexcelerate. ( https://gist.github.com/mapa17/bc04be36e447cab0746a0ec8903cc49f ) I thought about adding a excel writer engine to pandas/pandas/io/excel.py but I am a bit worried looking through the other already implanted engines. They support all kind of fancy cell formatting. Do you think it would it be sufficient to provide a minimalistic excel writer engine using pyexcelerate, writing only unformatted excel files? |
Hmm...from a maintenance perspective, I think I would want to maintain fewer engines than multiple because of the compatibility issues down the road (e.g. maintaining consistency between a Python and C parser for CSV has quite difficult). I could potentially see this as a "fast track," but it sounds a little corner case. Thus, I'm -0.5 on this overall because I'm leaning towards this being more maintenance than useful for the broader audience. cc @jreback |
This thing is indeed too slow, I'm not surprised that I find others thinking the same. 5 minutes for producing a 50 MB excel file is too much. |
Based off of conversation above I think we'll close this for now as its unclear if there's really anything to be gained here. If anyone disagrees feel free to reopen |
My console logs make me sad about to_excel performance, especially given it was almost 10 years since this issue was opened; not much has improvd.
|
Looked at this a bit, as I was considering trying to implement this. It seems like (at least in 2024) the speed difference is primarily within Pandas itself.
However, when increasing even to just 10000 rows, pure xlsxwriter is the fastest of all implementations [output below].
Looking at a cProfile run, it seems the biggest time-sink on the process is this call: I'd have to look more into exactly why the styles are pulled in this way, but it seems worth looking into. Still wouldn't see to_csv() level performance, but a 20% speed increase seems possible. Adding in a flag for if the call is coming from Dataframe().to_excel() vs Styler().to_excel() seems like it would also be enough to get this speed boost for all values-only usages, which I would expect to be the vast majority. |
Here's an example with just 65,000 elements (it's much worse with 250,000) so I can compare xls and xlsx:
from pandas import *
df = DataFrame({'col1' : [0.0] * 65000,
'col2' : 1,
'col3' : 2})
%timeit df.to_csv('sample.csv')
10 loops, best of 3: 109 ms per loop
%timeit df.to_excel('sample.xls')
1 loops, best of 3: 8.35 s per loop
%timeit df.to_excel('sample.xlsx')
1 loops, best of 3: 1min 31s per loop
The text was updated successfully, but these errors were encountered: