Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save and load parquet with MultiIndex (row) index and columns #1476

Open
ikravets opened this issue May 10, 2020 · 1 comment
Open

save and load parquet with MultiIndex (row) index and columns #1476

ikravets opened this issue May 10, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@ikravets
Copy link

I'm experimenting with Koalas. My pandas dataframes use MultiIndex both for rows and columns. Such pandas dataframes can be saved to / loaded from parquet files using PyArrow. Koalas can successfully translate such dataframes to/from pandas. However, Koalas cannot save/load such dataframes directly to/from parquet. Having to go through Pandas just to load/store the data severely limits the supported data size and kind of defeats the purpose of using Koalas.

PyArrow stores the information necessary to reconstruct MultiIndex in parquet metadata. It would be nice to have Koalas use the same approach for better compatibility, maybe even reuse PyArrow lib. Pointers to PyArrow implementation:

Right now Koalas supports MultiIndex save/load for rows, but it requires specifying index_col parameter for each to_parquet()/read_parquet() call, which is inferior to PyArrow approach.

@HyukjinKwon HyukjinKwon added the enhancement New feature or request label May 11, 2020
@ueshin
Copy link
Collaborator

ueshin commented Aug 4, 2020

FYI: for the read path, it was resolved at #1695.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants