[Python] Schema definition for dictionaries #4045

MasterSalami · 2019-03-27T10:48:40Z

This issue is similar to #3804 but I couldn't find a corresponding JIRA issue.
I have a pandas dataframe (created from json file) where one column contains a python dictionary (type):

[ { "type": {"type_1" : "test_1", "type_2":"test_2"}, "id": "foo", "name": "Bar" },  { "type": {"type_1" : "test_1", "type_2":"test_2"}, "id": "foo_1", "name": "Bar_1" } ]

The schema I wrote is :

table_schema = pa.schema([
("type",pa.struct({
("type_1",pa.string()),
("type_2",pa.string())
})),
("id",pa.string()),
("name",pa.string())
])

The type pyarrow detects with pa.DictionaryArray.from_pandas(df['type']).type is also struct<type_1: string, type_2: string>.
However, when building the pyarrow table with table = pa.Table.from_pandas(df, table_schema), an error always appears :

Traceback (most recent call last):
File "test_pyarrow.py", line 29, in table = pa.Table.from_pandas(df, table_schema)
File "pyarrow\table.pxi", line 1141, in pyarrow.lib.Table.from_pandas
File "C:\Users####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarrow\pandas_compat.py", line 443,
types
File "C:\Users####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarrow\pandas_compat.py", line 217,
df.columns, column_names, df_types
File "C:\Users####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarrow\pandas_compat.py", line 216,
) for col_name, sanitized_name, arrow_type in zip(
File "C:\Users####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarrow\pandas_compat.py", line 160,
logical_type = get_logical_type(arrow_type)
File "C:\Users####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarrow\pandas_compat.py", line 87, i
raise NotImplementedError(str(arrow_type))
NotImplementedError: struct<type_2: string, type_1: string>

Are nested dictionaries not implemented yet ? Or is there a way to declare a Dictionary in the schema (with DictionaryType) ?

The text was updated successfully, but these errors were encountered:

MasterSalami · 2019-03-28T18:03:44Z

Found a way to do it, so here's a quick summary of the code. I'll admit it looks quite like a hack but it still works :

import pandas as pd
import pyarrow as pa

# Data is just as in my previous comment
data = [ { "type": {"type_1" : "test_1", "type_2":"test_2"}, "id": "foo", "name": "Bar" },  { "type": {"type_1" : "test_1", "type_2":"test_2"}, "id": "foo_1", "name": "Bar_1" } ]
df = pd.DataFrame(data)

# Create a first table without the nested type
table = pa.Table.from_pandas(df[['id','name']])

# Get your nested pandas column and put it in a pyarrow column
column_to_add = pa.column('type', pa.array(df["type"]))
# Insert it at the position you want (mine was 0)
table = table.add_column(0, column_to_add)

print(table.to_pandas())

	type	id	name
0	{'type_1': 'test_1', 'type_2': 'test_2'}	foo	bar
1	{'type_1': 'test_1', 'type_2': 'test_2'}	foo_1	Bar_1

And it keeps the nested structure in the "type" column. So it kind of works, but only kind of :

When getting the schema of that table and trying to re-use it, it doesn't work :

# Removed 3rd because it is the __index_level_0__ column and doesn't belong
schema_to_use = table.schema.remove(3)
pa_table_from_df = pa.Table.from_pandas(df, schema = schema_to_use)

Yields the same error as before : NotImplementedError: struct<type_1: string, type_2: string>

My goal was to save the table as parquet, and that didn't work either :

import pyarrow.parquet as pq
pq.write_table(table, 'test.parquet')

File "C:\Users\####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarro
writer.write_table(table, row_group_size=row_group_size)
File "C:\Users\####\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarro
self.writer.write_table(table, row_group_size=row_group_size)
File "pyarrow_parquet.pyx", line 924, in pyarrow._parquet.ParquetWriter.write_table
File "pyarrow\error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children

Following https://jira.apache.org/jira/browse/ARROW-2587 and apache/parquet-cpp#462, it seems that saving to parquet with nested columns isn't implemented yet.
If I'm wrong, I'd love to be corrected on this 😅
Thanks in advance

pitrou · 2019-04-11T15:22:48Z

Sorry for the delay in answering this. As you noticed, an issue already exists for the Parquet issue.
As for the Pandas-to-Arrow conversion issue, I opened ARROW-5161.

harikrishnan-dev · 2019-09-27T09:33:12Z

I am still getting this error when I tried to save a nested json into a parquet format. I followed these steps

from pyarrow import json
import pyarrow.parquet as pq
r = json.read_json('example.jl')
pq.write_table(r,'example.parquet')

my json has few struct types
I ge this error ArrowInvalid: Nested column branch had multiple children

wesm · 2019-09-27T17:37:25Z

Can you please open a JIRA issue with a reproducible example?

pitrou closed this as completed Apr 11, 2019

This was referenced Sep 6, 2019

[Python] Cannot convert struct type from Pandas object column #21640

Closed

Nested column branch had multiple children #23078

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Schema definition for dictionaries #4045

[Python] Schema definition for dictionaries #4045

MasterSalami commented Mar 27, 2019 •

edited

Loading

MasterSalami commented Mar 28, 2019

pitrou commented Apr 11, 2019

harikrishnan-dev commented Sep 27, 2019 •

edited

Loading

wesm commented Sep 27, 2019

[Python] Schema definition for dictionaries #4045

[Python] Schema definition for dictionaries #4045

Comments

MasterSalami commented Mar 27, 2019 • edited Loading

MasterSalami commented Mar 28, 2019

pitrou commented Apr 11, 2019

harikrishnan-dev commented Sep 27, 2019 • edited Loading

wesm commented Sep 27, 2019

MasterSalami commented Mar 27, 2019 •

edited

Loading

harikrishnan-dev commented Sep 27, 2019 •

edited

Loading