Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow latest parquet map column type isn't supported #2262

Closed
den-rifiniti opened this issue Jul 13, 2018 · 7 comments
Closed

pyarrow latest parquet map column type isn't supported #2262

den-rifiniti opened this issue Jul 13, 2018 · 7 comments

Comments

@den-rifiniti
Copy link

den-rifiniti commented Jul 13, 2018

Hello
When I trying read parquet file with column of type map, pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. exception is throws.

Seems like this was fixed here #1530, and only release is required?

@xhochy
Copy link
Member

xhochy commented Jul 14, 2018

No, this would also require apache/parquet-cpp#462

@wesm
Copy link
Member

wesm commented Jul 19, 2018

Assistance with this would be much appreciated. Unfortunately we haven't been able to get this done in time for 0.10, so it will have to be later this year

@wesm wesm closed this as completed Jul 19, 2018
@damache
Copy link

damache commented Dec 7, 2018

was this fixed? I installed the following

!conda install -c conda-forge pyarrow

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    arrow-cpp:   0.10.0-py35h70250a7_0 conda-forge
    boost-cpp:   1.67.0-h3a22d5f_0     conda-forge
    parquet-cpp: 1.5.0.pre-h83d4a3d_0  conda-forge
    pyarrow:     0.10.0-py35hfc679d8_0 conda-forge

boost-cpp-1.67 100% |################################| Time: 0:00:00  90.87 MB/s
arrow-cpp-0.10 100% |################################| Time: 0:00:00  47.85 MB/s
parquet-cpp-1. 100% |################################| Time: 0:00:00  68.96 MB/s
pyarrow-0.10.0 100% |################################| Time: 0:00:00  62.08 MB/s

then tried this code

import io
import pandas as pd
import pyarrow.parquet as pq

# Read the parquet file
buffer = io.BytesIO()
object = cos.Object('*********','*****************')
object.download_fileobj(buffer)
table = pq.read_table(buffer)
df = table.to_pandas()
print(df.head())

but I get this error

ArrowNotImplementedError                  Traceback (most recent call last)
<ipython-input-11-a1e8748910ba> in <module>()
      7 object = cos.Object('********','*************')
      8 object.download_fileobj(buffer)
----> 9 table = pq.read_table(buffer)
     10 df = table.to_pandas()
     11 print(df.head())

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/pyarrow/parquet.py in read_table(source, columns, nthreads, metadata, use_pandas_metadata)
   1048     pf = ParquetFile(source, metadata=metadata)
   1049     return pf.read(columns=columns, nthreads=nthreads,
-> 1050                    use_pandas_metadata=use_pandas_metadata)
   1051 
   1052 

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata)
    150             columns, use_pandas_metadata=use_pandas_metadata)
    151         return self.reader.read_all(column_indices=column_indices,
--> 152                                     nthreads=nthreads)
    153 
    154     def scan_contents(self, columns=None, batch_size=65536):

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all()

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: lists with structs are not supported.

@wesm
Copy link
Member

wesm commented Dec 8, 2018

No It has not yet been implemented

@wesm
Copy link
Member

wesm commented Dec 9, 2018

@damache would anyone from IBM like to get involved with Parquet development? We could really use the help.

@sujayramaiah
Copy link

Most of the data files in our data lake has map columns. Not being able to read parquet files with map columns using pyarrow creates dependency on Spark. Is there a plan to support map columns?

@wesm
Copy link
Member

wesm commented Jan 28, 2020

Yes, but someone has to do the implementation work. See ARROW-1644 and related issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants