Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Schema definition for dictionaries #4045

Closed
MasterSalami opened this issue Mar 27, 2019 · 4 comments
Closed

[Python] Schema definition for dictionaries #4045

MasterSalami opened this issue Mar 27, 2019 · 4 comments

Comments

@MasterSalami
Copy link

MasterSalami commented Mar 27, 2019

This issue is similar to #3804 but I couldn't find a corresponding JIRA issue.
I have a pandas dataframe (created from json file) where one column contains a python dictionary (type):

[ { "type": {"type_1" : "test_1", "type_2":"test_2"}, "id": "foo", "name": "Bar" },  { "type": {"type_1" : "test_1", "type_2":"test_2"}, "id": "foo_1", "name": "Bar_1" } ]

The schema I wrote is :

table_schema = pa.schema([
("type",pa.struct({
("type_1",pa.string()),
("type_2",pa.string())
})),
("id",pa.string()),
("name",pa.string())
])

The type pyarrow detects with pa.DictionaryArray.from_pandas(df['type']).type is also struct<type_1: string, type_2: string>.
However, when building the pyarrow table with table = pa.Table.from_pandas(df, table_schema), an error always appears :

Traceback (most recent call last):
File "test_pyarrow.py", line 29, in table = pa.Table.from_pandas(df, table_schema)
File "pyarrow\table.pxi", line 1141, in pyarrow.lib.Table.from_pandas
File "C:\Users####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarrow\pandas_compat.py", line 443,
types
File "C:\Users####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarrow\pandas_compat.py", line 217,
df.columns, column_names, df_types
File "C:\Users####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarrow\pandas_compat.py", line 216,
) for col_name, sanitized_name, arrow_type in zip(
File "C:\Users####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarrow\pandas_compat.py", line 160,
logical_type = get_logical_type(arrow_type)
File "C:\Users####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarrow\pandas_compat.py", line 87, i
raise NotImplementedError(str(arrow_type))
NotImplementedError: struct<type_2: string, type_1: string>

Are nested dictionaries not implemented yet ? Or is there a way to declare a Dictionary in the schema (with DictionaryType) ?

@MasterSalami
Copy link
Author

Found a way to do it, so here's a quick summary of the code. I'll admit it looks quite like a hack but it still works :

import pandas as pd
import pyarrow as pa

# Data is just as in my previous comment
data = [ { "type": {"type_1" : "test_1", "type_2":"test_2"}, "id": "foo", "name": "Bar" },  { "type": {"type_1" : "test_1", "type_2":"test_2"}, "id": "foo_1", "name": "Bar_1" } ]
df = pd.DataFrame(data)

# Create a first table without the nested type
table = pa.Table.from_pandas(df[['id','name']])

# Get your nested pandas column and put it in a pyarrow column
column_to_add = pa.column('type', pa.array(df["type"]))
# Insert it at the position you want (mine was 0)
table = table.add_column(0, column_to_add)

print(table.to_pandas())
type id name
0 {'type_1': 'test_1', 'type_2': 'test_2'} foo bar
1 {'type_1': 'test_1', 'type_2': 'test_2'} foo_1 Bar_1

And it keeps the nested structure in the "type" column. So it kind of works, but only kind of :

  1. When getting the schema of that table and trying to re-use it, it doesn't work :
# Removed 3rd because it is the __index_level_0__ column and doesn't belong
schema_to_use = table.schema.remove(3)
pa_table_from_df = pa.Table.from_pandas(df, schema = schema_to_use)

Yields the same error as before : NotImplementedError: struct<type_1: string, type_2: string>

  1. My goal was to save the table as parquet, and that didn't work either :
import pyarrow.parquet as pq
pq.write_table(table, 'test.parquet')

File "C:\Users\####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarro
writer.write_table(table, row_group_size=row_group_size)
File "C:\Users\####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarro
self.writer.write_table(table, row_group_size=row_group_size)
File "pyarrow_parquet.pyx", line 924, in pyarrow._parquet.ParquetWriter.write_table
File "pyarrow\error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children

Following https://jira.apache.org/jira/browse/ARROW-2587 and apache/parquet-cpp#462, it seems that saving to parquet with nested columns isn't implemented yet.
If I'm wrong, I'd love to be corrected on this 😅
Thanks in advance

@pitrou
Copy link
Member

pitrou commented Apr 11, 2019

Sorry for the delay in answering this. As you noticed, an issue already exists for the Parquet issue.
As for the Pandas-to-Arrow conversion issue, I opened ARROW-5161.

@pitrou pitrou closed this as completed Apr 11, 2019
@harikrishnan-dev
Copy link

harikrishnan-dev commented Sep 27, 2019

I am still getting this error when I tried to save a nested json into a parquet format. I followed these steps

from pyarrow import json
import pyarrow.parquet as pq
r = json.read_json('example.jl')
pq.write_table(r,'example.parquet')

my json has few struct types
I ge this error ArrowInvalid: Nested column branch had multiple children

@wesm
Copy link
Member

wesm commented Sep 27, 2019

Can you please open a JIRA issue with a reproducible example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants