Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Column metadata is not saved or loaded in parquet #20926

Closed
asfimport opened this issue Jan 24, 2019 · 7 comments
Closed

[Python] Column metadata is not saved or loaded in parquet #20926

asfimport opened this issue Jan 24, 2019 · 7 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jan 24, 2019

Hi all,

a while ago I posted this issue: ARROW-3866

While working with Pyarrow I encountered another potential bug related to column metadata: If I create a table containing columns with metadata everything is fine. But after I save the table to parquet and load it back as a table using pq.read_table, the column metadata is gone.

 
As of now I can not say yet whether the metadata is not saved correctly or not loaded correctly, as I have no idea how to verify it. Unfortunately I also don't have the time try a lot, but I wanted to let you know anyway.

 

field0 = pa.field('field1', pa.int64(), metadata=dict(a="A", b="B"))
field1 = pa.field('field2', pa.int64(), nullable=False)
columns = [
    pa.column(field0, pa.array([1, 2])),
    pa.column(field1, pa.array([3, 4]))
]
table = pa.Table.from_arrays(columns)

pq.write_table(tab, path)

tab2 = pq.read_table(path)
tab2.column(0).field.metadata

 

Reporter: Seb Fru

Related issues:

Note: This issue was originally created as ARROW-4359. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
The arrow field metadata could in principle map to parquet's ColumnChunkMetaData->key_value_metadata, but I don't think this is implemented.

The key_value_metadata on the full schema is implemented, and this roundtrips to parquet.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
This looks kinda buggy, maybe it's fixed now. I added to 0.15.0 so we can see

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I converted the example to be compatible with the latest pyarrow:

field1 = pa.field('field1', pa.int64(), metadata=dict(a="A", b="B")) 
field2 = pa.field('field2', pa.int64(), nullable=False)
table = pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], schema=pa.schema([field1, field2]))

pq.write_table(table, path)
table2 = pq.read_table(path)

table2.schema.field_by_name('field1').metadata

It's not yet fixed.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Thanks

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
This is tricky since there is field metadata found in each row group.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
It could also be an option to use the serialized arrow schema for this. That would be sufficient for an arrow <-> parquet roundtrip, but is of course not a general parquet mechanism (so the metadata is not in the proper format to be reusable from other parquet readers)

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
If I try @jorisvandenbossche's reproducer, it seems to work now:

>>> path = "xxx.pq"
>>> field1 = pa.field('field1', pa.int64(), metadata=dict(a="A", b="B"))
>>> field2 = pa.field('field2', pa.int64(), nullable=False)
>>> table = pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], schema=pa.schema([field1, field2]))
>>> 
>>> pq.write_table(table, path)
>>> table2 = pq.read_table(path)
>>> 
>>> table2.schema.field('field1')
pyarrow.Field<field1: int64>
>>> table2.schema.field('field2')
pyarrow.Field<field2: int64 not null>
>>> table2.schema.field('field1').metadata
{b'PARQUET:field_id': b'1', b'a': b'A', b'b': b'B'}
>>> table2.schema.field('field2').metadata
{b'PARQUET:field_id': b'2'}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant