[C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels #17654

asfimport · 2017-10-05T00:43:40Z

Related issues:

Some logical types not supported when loading Parquet (blocks)

_{Note: This issue was originally created as PARQUET-911. Please see the migration documentation for further details.}

asfimport · 2017-10-05T00:55:58Z

Wes McKinney / @wesm:
At the moment we do not have mixed-nesting reading and writing implemented. If the nesting levels are all repeated (lists) or all groups (structs) vs. a mix (structs and lists/repeated fields) then we can read and write them. I recently wrote an important patch to help with this (PARQUET-1100 apache/parquet-cpp@4b09ac7), but we could really use some help with the encoding and decoding of nested data. I will eventually get to it if no one else does but that could be anytime from 1 month from now to 6 months from now given how many other projects I have before me.

asfimport · 2017-10-05T00:57:38Z

Wes McKinney / @wesm:
I updated the issue metadata

asfimport · 2017-10-05T01:25:48Z

DB Tsai:
@wesm Thanks for the detail reply. Are you saying that in parquet-cpp side, it already supports it but we require some work on the python side? It will be really nice to see it's supported in pyarrow since most of the data in my team is in this mixed-nesting format.

asfimport · 2017-10-05T01:28:21Z

Wes McKinney / @wesm:
The work is pretty much all on the parquet-cpp side, so strictly an Arrow <-> Parquet nested encoding conversion problem in C++. We'll want to have unit tests in pyarrow to verify that we can faithfully round trip the data of course. I think Python will be useful for generating schemas and synthetic data sets to find edge cases

asfimport · 2018-05-12T16:44:41Z

Joshua Storck / @joshuastorck:
The reading half of this issue is addressed by this: apache/parquet-cpp#462. Perhaps we should split this into two separate issues?

asfimport · 2018-08-29T02:36:08Z

Charles Pritchard:
I came across this use case with some code that's using nested data with namedtuple, looking to use pyarrow as part of serialization/deserialization to Parquet. context.register of course works for namedtuple but there's no happy way to get to pa.Table.

asfimport · 2018-12-20T17:56:41Z

Francisco Sanchez:
Any future plans for this feature?

asfimport · 2018-12-20T18:29:42Z

Wes McKinney / @wesm:
It's in the issue tracker. The work will eventually get done but it's hard to say when

asfimport · 2019-02-11T18:59:59Z

Wes McKinney / @wesm:
It's dubious whether this will be completed in 0.13 but I plan to start working on it with target completion in time for 0.14 (~end of May 2019)

asfimport · 2019-03-05T20:20:25Z

David Lee / @davlee1972:
I've been able to write parquet columns which are lists, but I haven't been able to write a column which is a list of struct(s)

This works:

schema = pa.schema([
    pa.field('test_id', pa.string()),
    pa.field('a', pa.list_(pa.string())),
    pa.field('b', pa.list_(pa.int32()))
])

This structure isn't supported yet

schema = pa.schema([
    pa.field('test_id', pa.string()),
    pa.field('testlist', pa.list_(pa.struct([('a', pa.string()), ('b', pa.int32())])))
])

new_records = list()
new_records.append({'test_id': '123', 'testlist': [{'a': 'xyz', 'b': 22}]})
new_records.append({'test_id': '789', 'testlist': [{'a': 'aaa', 'b': 33}]})

arrow_columns = list()

for column in schema.names:
    arrow_columns.append(pa.array([v[column] for v in new_records], type=schema.types[schema.get_field_index(column)]))

arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)

arrow_table
arrow_table[0]
arrow_table[1]
arrow_table[1][0]
arrow_table[1][1]

>>> pq.write_table(arrow_table, "test.parquet")
Traceback (most recent call last):
packages/pyarrow/parquet.py", line 1160, in write_table
writer.write_table(table, row_group_size=row_group_size)
self.writer.write_table(table, row_group_size=row_group_size)
File "pyarrow/_parquet.pyx", line 924, in pyarrow._parquet.ParquetWriter.write_table
File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children

Supporting structs is the missing piece to being able to save structured JSON as columnar parquet which would make json searchable.

asfimport · 2019-07-03T02:36:42Z

Wes McKinney / @wesm:
[~[email protected]] please keep in mind there are multiple styles of nested data encoding (1-, 2-, and 3-level list encoding), this can be known from the schema so we'll probably need to support all 3 kinds

asfimport · 2019-07-03T03:05:55Z

Micah Kornfield / @emkornfield:
@wesm are you referencing https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types?

asfimport · 2019-07-03T15:56:21Z

Wes McKinney / @wesm:
Yes, essentially.

1-level encoding

group schema {
  optional INT32 some_other_value
  repeated T list_item;
}

2-level encoding (I think? need to confirm my understanding)

group schema {
  optional INT32 some_other_value
  repeated group list_value {
    optional/required T list_item;
  }
}

3-level encoding

group schema {
  optional INT32 some_other_value
  optional/required group list_value LIST {
    repeated group box {
      optional/required T list_item;
    }
  }
}

note

The 1-level encoding can only encode array<item NOT NULL> NOT NULL.
The 2-level encoding can only encode array<item> NOT NULL
The 3-level encoding can encode either nullity of list items or the lists themselves array<item [nullable?]> [nullable?]

The decode path is slightly different for the 1/2 level cases

asfimport · 2019-07-22T15:19:19Z

Micah Kornfield / @emkornfield:
Since there has been some interest on the old PR, I'll give a quick status update.

I'm about 50% done getting the write path finished, I hope to have this done by end of this week or next. I'll then be starting on the read path. It is likely I will try to leverage some code from the old PR or #4066 but will have a better idea once I take a close look.

asfimport · 2019-08-19T14:48:02Z

Brendan Hogan:
Hello, I'm interested in the latest status here and when it may be available for use. I know there is work-in-progress by [~[email protected]] and maybe others, although as it's been four weeks since an update I thought it was worth a check. Thanks in advance for any info (and the work on this)!

asfimport · 2019-08-19T19:12:38Z

Wes McKinney / @wesm:
I'm not aware of any updates; there are no patches available yet

asfimport · 2019-08-20T05:30:28Z

Micah Kornfield / @emkornfield:
[~bhogan-mitre] there isn't much an update. I put this off a little because @wesm was doing some major refactoring. If you want to contribute, we can probably divide the work for read and write ( @wesm are you planning on anything else major with the arrow parquet code?)

asfimport · 2019-08-20T13:56:59Z

Wes McKinney / @wesm:
Nope, the large projects I had planned are done. The only further work I'd be interested in would be expanding the encoders / low-level column reader/writer classes to handle more dictionary-encoded types. None of that should affect the nested data disassembly / reassembly logic. One of my goals with these recent refactors was actually to move the "flat" serialization/deserialization code "out of the way" (since the prior effort on this caused performance regressions on flat data)

asfimport · 2019-08-20T14:48:06Z

Brian Phillips:
My main use case for (py)arrow is converting very nested protobuf data to parquet for storage. Currently I'm forced to store as json instead because there is no nested data support. Would love to see this implemented, but unfortunately can't be much help as I don't know C++.

asfimport · 2019-08-20T21:02:34Z

Wes McKinney / @wesm:
Note that contributing to other parts of the project helps free up developers to work on larger projects like this.

asfimport · 2019-11-07T17:55:26Z

William Young:
Are there plans to merge this code? I have a use-case.

asfimport · 2019-11-08T01:29:39Z

Micah Kornfield / @emkornfield:
The code isn't really super useable since it is based on the old repo and a lot of changes have been made (and it had a performance regression). I haven't had time to work on this, but still hope to get some bandwidth in the next month or so. But if there are motivated parties I'm happy to remove my name from the assignment.

asfimport · 2019-11-21T14:24:07Z

Rinke Hoekstra:
I was just trying this with the example found in the pyarrow docs at http://arrow.apache.org/docs/python/json.html

The documented example does not work. Is this related to this issue, or is it another matter?

It says to load the following JSON file:

{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}}
{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}

I fixed this to make it valid (but that's another issue):

[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03",}}
{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"]}}

Then reading the JSON from a file called my_data.json:

from pyarrow import json
table = json.read_json("my_data.json")

Gives the following error:

{{---------------------------------------------------------------------------}}
ArrowInvalid Traceback (most recent call last)
<ipython-input-69-f974c21f0941> in <module>()
1 from pyarrow import json}}
----> 2 table = json.read_json('test.json')

~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx in pyarrow._json.read_json()

~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: JSON parse error: A column changed from object to array

asfimport · 2019-11-21T14:33:26Z

Joris Van den Bossche / @jorisvandenbossche:
[~RinkeHoekstra] that looks unrelated (the json reader is mostly independent from the parquet IO). Can you open a separate JIRA ticket?

asfimport · 2019-11-21T15:25:47Z

David Lee / @davlee1972:
The format is valid. http://jsonlines.org
Line delimited json is a better format for data since you can leverage threads to speed up read operations.

You also added a comma and bracket incorrectly which turned valid jsonl to invalid json. They should be outside the curly braces.

[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}},
{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}} ]

asfimport · 2019-11-21T15:33:58Z

Rinke Hoekstra:
[~[email protected]] good point about the multi-threaded loading, but (at the risk of being pedantic) it is valid JSON Lines, but not valid JSON: most if not all JSON parsers will raise an exception at the missing enclosing brackets and missing comma.

In any case, the issue is now raised at: https://issues.apache.org/jira/browse/ARROW-7226

asfimport · 2020-01-30T17:39:59Z

Zack Gancarz:
Hi Wes, any progress on this one? Seems like it's a common use case as a lot of people have the use case to save nested protobuffs to .parquet.

Thank you kindly

asfimport · 2020-01-30T22:36:01Z

Wes McKinney / @wesm:
Please see the recent e-mail discussion https://lists.apache.org/thread.html/r831c41a3f9b353cfc187dc2092515fe707d4ebf13cbb727524f1c600%40%3Cdev.arrow.apache.org%3E

asfimport · 2020-04-17T03:44:57Z

Micah Kornfield / @emkornfield:
For anyone that is interested in reading mixed level data, if there is an ability to provide sample parquet files (probably no more then 5-10MB of data) to run microbenchmarks against it would help ensure we are writing code with the right trade-off.

asfimport · 2020-04-17T23:56:45Z

Dmitry Kalinkin / @veprbl:
Here: https://transfer.sh/w4IQ0/test_nested.parquet
This is written with rust/parquet and tested to be readable with parquet-tools

asfimport · 2020-06-05T15:05:40Z

Eric Czech:
This may be another useful example: https://storage.googleapis.com/open-targets-data-releases/20.04/input/evidence-files/progeny-2018-07-23.json.gz

It's a 620K (uncompressed) set of json records with gene pathways that regulate various types of cancer. It has a good mix of structs within structs, arrays with structs, arrays of structs that are themselves in other structs, etc.

asfimport · 2020-06-05T16:11:47Z

Wes McKinney / @wesm:
Thanks. We should follow up on the mailing list discussion and see what is the latest game plan for implementing the Parquet nested reader. Some of my colleagues should be able to help

asfimport · 2020-07-20T20:38:16Z

Luke Higgins:
Has there been any more discussion on this in the mailing list? This as a feature would cause me to move a lot more to pyarrow and parquet that are sadly json today. Thanks!

asfimport · 2020-07-21T03:39:16Z

Micah Kornfield / @emkornfield:
we've discussed it and there has been some work done but still more to do. Its been a slow process due to personal bandwidth constraints.

asfimport · 2020-07-27T22:18:26Z

Wes McKinney / @wesm:
I know that it's going to be a priority for my colleagues and I between now and the next release (2.0.0) to help this get done, some people are having vacations right now but we should have more bandwidth to assist with writing test cases and benchmarks (and helping with the implementation) in August.

asfimport · 2020-08-07T21:20:23Z

Hui Gao:
Any update? We are really blocked by this. The write path works fine. But can't read any struct/map type from the file. Thanks.

asfimport · 2020-08-09T03:41:19Z

Micah Kornfield / @emkornfield:
[~fulluey] I'm glad the write path works fine, there are no more updates. Please follow pull requests for child issues here. I hope you are confirming the ability to read back your data in another system. A bug was recently discovered if you have nullable structs as direct ancestors of leaf columns (ARROW-9598).

asfimport · 2020-08-21T07:10:33Z

shadowdsp:
[~fulluey] Hello, I want to know how you solved this problem? Thanks.

I found spark is using parquet-mr , I tested successfully on it. But it is based on Java, if I want to use it in c++, I should use JNI to call it, and this is not efficient enough.

asfimport · 2020-08-27T19:55:53Z

Martin Durant / @martindurant:
Just to chime in here, nice and late, to say that this feature would be immense when implemented. Given the similarity of the parquet storage layout and the arrow model, it ought not to be too hard - but C++ is never easy. As a python-first programmer, is there any way I can help?

(note that I have great hopes for awkward-arrays https://github.com/scikit-hep/awkward-1.0 to be a perfect pure-python way to iterate/filter/aggregate over deeply nested data loaded from parquet at C speeds)

asfimport · 2020-08-28T04:33:12Z

Micah Kornfield / @emkornfield:
@martindurant can https://issues.apache.org/jira/browse/ARROW-8492 potentially could be done at the python layer. If you are interested maybe chime in there and we can discuss if there is a hard requirement for C++ (the nice thing about C++ tests is not having to deal with dual language development)

asfimport · 2020-08-28T15:53:42Z

Hui Gao:
[~shadowdsp], We still don't have a solution to the problem. Right now, we write to disk and store the data in hive. It works. But we can't read from local file. I checked the C++ code base. It is really complex. Not seeming to be an easy update from outside developers.

Parquet read/write from Spark works. But it is super slow when comparing with arrow. We also hit memory consumption issues and frequent OOM errors. I would NOT recommend unless using Java already.

asfimport · 2020-10-17T15:28:19Z

Wes McKinney / @wesm:
Can this issue be resolved for 2.0.0, or perhaps it should be renamed to be "Parquet nested parent JIRA"?

asfimport · 2020-10-22T03:13:12Z

Micah Kornfield / @emkornfield:
yes, I pushed all remaining unresolved subtasks to issues.

asfimport · 2021-09-25T05:43:14Z

jiang,longshan:
[~fulluey], Is there any solution to read MapArray parquet file in c+? This issue is closed, but I still don't know the method, and it blocks our project in which we want to use c+ replace of Java application. Is there any advice? thanks a lot.

asfimport closed this as completed Oct 22, 2020

asfimport assigned emkornfield Jan 10, 2023

asfimport added this to the 2.0.0 milestone Jan 11, 2023

[C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels #17654

[C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels #17654

Comments

asfimport commented Oct 5, 2017 • edited Loading

Related issues:

asfimport commented Oct 5, 2017

asfimport commented Oct 5, 2017

asfimport commented Oct 5, 2017

asfimport commented Oct 5, 2017

asfimport commented May 12, 2018

asfimport commented Aug 29, 2018

asfimport commented Dec 20, 2018

asfimport commented Dec 20, 2018

asfimport commented Feb 11, 2019

asfimport commented Mar 5, 2019

asfimport commented Jul 3, 2019

asfimport commented Jul 3, 2019

asfimport commented Jul 3, 2019

asfimport commented Jul 22, 2019

asfimport commented Aug 19, 2019

asfimport commented Aug 19, 2019

asfimport commented Aug 20, 2019

asfimport commented Aug 20, 2019

asfimport commented Aug 20, 2019

asfimport commented Aug 20, 2019

asfimport commented Nov 7, 2019

asfimport commented Nov 8, 2019

asfimport commented Nov 21, 2019

asfimport commented Nov 21, 2019

asfimport commented Nov 21, 2019

asfimport commented Nov 21, 2019

asfimport commented Jan 30, 2020

asfimport commented Jan 30, 2020

asfimport commented Apr 17, 2020

asfimport commented Apr 17, 2020

asfimport commented Jun 5, 2020

asfimport commented Jun 5, 2020

asfimport commented Jul 20, 2020

asfimport commented Jul 21, 2020

asfimport commented Jul 27, 2020

asfimport commented Aug 7, 2020

asfimport commented Aug 9, 2020

asfimport commented Aug 21, 2020

asfimport commented Aug 27, 2020

asfimport commented Aug 28, 2020

asfimport commented Aug 28, 2020

asfimport commented Oct 17, 2020

asfimport commented Oct 22, 2020

asfimport commented Sep 25, 2021

asfimport commented Oct 5, 2017 •

edited

Loading