Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels #17654

Closed
asfimport opened this issue Oct 5, 2017 · 44 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Oct 5, 2017

Reporter: Uwe Korn / @xhochy

Related issues:

Note: This issue was originally created as PARQUET-911. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
At the moment we do not have mixed-nesting reading and writing implemented. If the nesting levels are all repeated (lists) or all groups (structs) vs. a mix (structs and lists/repeated fields) then we can read and write them. I recently wrote an important patch to help with this (PARQUET-1100 apache/parquet-cpp@4b09ac7), but we could really use some help with the encoding and decoding of nested data. I will eventually get to it if no one else does but that could be anytime from 1 month from now to 6 months from now given how many other projects I have before me.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I updated the issue metadata

@asfimport
Copy link
Collaborator Author

DB Tsai:
@wesm Thanks for the detail reply. Are you saying that in parquet-cpp side, it already supports it but we require some work on the python side? It will be really nice to see it's supported in pyarrow since most of the data in my team is in this mixed-nesting format.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
The work is pretty much all on the parquet-cpp side, so strictly an Arrow <-> Parquet nested encoding conversion problem in C++. We'll want to have unit tests in pyarrow to verify that we can faithfully round trip the data of course. I think Python will be useful for generating schemas and synthetic data sets to find edge cases

@asfimport
Copy link
Collaborator Author

Joshua Storck / @joshuastorck:
The reading half of this issue is addressed by this: apache/parquet-cpp#462. Perhaps we should split this into two separate issues?

@asfimport
Copy link
Collaborator Author

Charles Pritchard:
I came across this use case with some code that's using nested data with namedtuple, looking to use pyarrow as part of serialization/deserialization to Parquet. context.register of course works for namedtuple but there's no happy way to get to pa.Table.

 

 

@asfimport
Copy link
Collaborator Author

Francisco Sanchez:
Any future plans for this feature?

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
It's in the issue tracker. The work will eventually get done but it's hard to say when

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
It's dubious whether this will be completed in 0.13 but I plan to start working on it with target completion in time for 0.14 (~end of May 2019)

@asfimport
Copy link
Collaborator Author

David Lee / @davlee1972:
I've been able to write parquet columns which are lists, but I haven't been able to write a column which is a list of struct(s)

This works:

schema = pa.schema([
    pa.field('test_id', pa.string()),
    pa.field('a', pa.list_(pa.string())),
    pa.field('b', pa.list_(pa.int32()))
])

This structure isn't supported yet

schema = pa.schema([
    pa.field('test_id', pa.string()),
    pa.field('testlist', pa.list_(pa.struct([('a', pa.string()), ('b', pa.int32())])))
])

new_records = list()
new_records.append({'test_id': '123', 'testlist': [{'a': 'xyz', 'b': 22}]})
new_records.append({'test_id': '789', 'testlist': [{'a': 'aaa', 'b': 33}]})

arrow_columns = list()

for column in schema.names:
    arrow_columns.append(pa.array([v[column] for v in new_records], type=schema.types[schema.get_field_index(column)]))

arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)

arrow_table
arrow_table[0]
arrow_table[1]
arrow_table[1][0]
arrow_table[1][1]

>>> pq.write_table(arrow_table, "test.parquet")
Traceback (most recent call last):
packages/pyarrow/parquet.py", line 1160, in write_table
writer.write_table(table, row_group_size=row_group_size)
self.writer.write_table(table, row_group_size=row_group_size)
File "pyarrow/_parquet.pyx", line 924, in pyarrow._parquet.ParquetWriter.write_table
File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children

Supporting structs is the missing piece to being able to save structured JSON as columnar parquet which would make json searchable.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
[~[email protected]] please keep in mind there are multiple styles of nested data encoding (1-, 2-, and 3-level list encoding), this can be known from the schema so we'll probably need to support all 3 kinds

@asfimport
Copy link
Collaborator Author

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Yes, essentially.

1-level encoding

group schema {
  optional INT32 some_other_value
  repeated T list_item;
}

2-level encoding (I think? need to confirm my understanding)

group schema {
  optional INT32 some_other_value
  repeated group list_value {
    optional/required T list_item;
  }
}

3-level encoding

group schema {
  optional INT32 some_other_value
  optional/required group list_value LIST {
    repeated group box {
      optional/required T list_item;
    }
  }
}

note

  • The 1-level encoding can only encode array<item NOT NULL> NOT NULL.

  • The 2-level encoding can only encode array<item> NOT NULL

  • The 3-level encoding can encode either nullity of list items or the lists themselves array<item [nullable?]> [nullable?]

    The decode path is slightly different for the 1/2 level cases

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
Since there has been some interest on the old PR, I'll give a quick status update.

I'm about 50% done getting the write path finished, I hope to have this done by end of this week or next.  I'll then be starting on the read path.  It is likely I will try to leverage some code from the old PR or #4066 but will have a better idea once I take a close look.

@asfimport
Copy link
Collaborator Author

Brendan Hogan:
Hello, I'm interested in the latest status here and when it may be available for use.  I know there is work-in-progress by [~[email protected]] and maybe others, although as it's been four weeks since an update I thought it was worth a check.  Thanks in advance for any info (and the work on this)!

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I'm not aware of any updates; there are no patches available yet

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
[~bhogan-mitre]  there isn't much an update.  I put this off a little because @wesm was doing some major refactoring.  If you want to contribute, we can probably divide the work for read and write ( @wesm are you planning on anything else major with the arrow parquet code?)

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Nope, the large projects I had planned are done. The only further work I'd be interested in would be expanding the encoders / low-level column reader/writer classes to handle more dictionary-encoded types. None of that should affect the nested data disassembly / reassembly logic. One of my goals with these recent refactors was actually to move the "flat" serialization/deserialization code "out of the way" (since the prior effort on this caused performance regressions on flat data)

@asfimport
Copy link
Collaborator Author

Brian Phillips:
My main use case for (py)arrow is converting very nested protobuf data to parquet for storage. Currently I'm forced to store as json instead because there is no nested data support. Would love to see this implemented, but unfortunately can't be much help as I don't know C++.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Note that contributing to other parts of the project helps free up developers to work on larger projects like this.

@asfimport
Copy link
Collaborator Author

William Young:
Are there plans to merge this code? I have a use-case.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
The code isn't really super useable since it is based on the old repo and a lot of changes have been made (and it had a performance regression).  I haven't had time to work on this, but still hope to get some bandwidth in the next month or so.  But if there are motivated parties I'm happy to remove my name from the assignment.

@asfimport
Copy link
Collaborator Author

Rinke Hoekstra:
I was just trying this with the example found in the pyarrow docs at http://arrow.apache.org/docs/python/json.html

The documented example does not work. Is this related to this issue, or is it another matter?

It says to load the following JSON file:

{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}}
{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}

I fixed this to make it valid (but that's another issue):

[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03",}}
{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"]}}

Then reading the JSON from a file called my_data.json:

from pyarrow import json
table = json.read_json("my_data.json")

Gives the following error:

{{---------------------------------------------------------------------------}}
ArrowInvalid Traceback (most recent call last)
<ipython-input-69-f974c21f0941> in <module>()
1 from pyarrow import json}}
----> 2 table = json.read_json('test.json')

~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx in pyarrow._json.read_json()

~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: JSON parse error: A column changed from object to array

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
[~RinkeHoekstra] that looks unrelated (the json reader is mostly independent from the parquet IO). Can you open a separate JIRA ticket?

@asfimport
Copy link
Collaborator Author

David Lee / @davlee1972:
The format is valid. http://jsonlines.org
Line delimited json is a better format for data since you can leverage threads to speed up read operations.

You also added a comma and bracket incorrectly which turned valid jsonl to invalid json. They should be outside the curly braces.

[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}},
{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}} ]

@asfimport
Copy link
Collaborator Author

Rinke Hoekstra:
[~[email protected]] good point about the multi-threaded loading, but (at the risk of being pedantic) it is valid JSON Lines, but not valid JSON: most if not all JSON parsers will raise an exception at the missing enclosing brackets and missing comma.

In any case, the issue is now raised at: https://issues.apache.org/jira/browse/ARROW-7226

@asfimport
Copy link
Collaborator Author

Zack Gancarz:
Hi Wes, any progress on this one? Seems like it's a common use case as a lot of people have the use case to save nested protobuffs to .parquet.

Thank you kindly

@asfimport
Copy link
Collaborator Author

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
For anyone that is interested in reading mixed level data, if there is an ability to provide sample parquet files (probably no more then 5-10MB of data) to run microbenchmarks against it would help ensure we are writing code with the right trade-off.

@asfimport
Copy link
Collaborator Author

Dmitry Kalinkin / @veprbl:
Here: https://transfer.sh/w4IQ0/test_nested.parquet
This is written with rust/parquet and tested to be readable with parquet-tools

@asfimport
Copy link
Collaborator Author

Eric Czech:
This may be another useful example: https://storage.googleapis.com/open-targets-data-releases/20.04/input/evidence-files/progeny-2018-07-23.json.gz

It's a 620K (uncompressed) set of json records with gene pathways that regulate various types of cancer.  It has a good mix of structs within structs, arrays with structs, arrays of structs that are themselves in other structs, etc.  

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Thanks. We should follow up on the mailing list discussion and see what is the latest game plan for implementing the Parquet nested reader. Some of my colleagues should be able to help

@asfimport
Copy link
Collaborator Author

Luke Higgins:
Has there been any more discussion on this in the mailing list? This as a feature would cause me to move a lot more to pyarrow and parquet that are sadly json today. Thanks!

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
we've discussed it and there has been some work done but still more to do.  Its been a slow process due to personal bandwidth constraints.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I know that it's going to be a priority for my colleagues and I between now and the next release (2.0.0) to help this get done, some people are having vacations right now but we should have more bandwidth to assist with writing test cases and benchmarks (and helping with the implementation) in August.

@asfimport
Copy link
Collaborator Author

Hui Gao:
Any update? We are really blocked by this. The write path works fine. But can't read any struct/map type from the file. Thanks.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
  [~fulluey]  I'm glad the write path works fine, there are no more updates.  Please follow pull requests for child issues here.  I hope you are confirming the ability to read back your data in another system.  A bug was recently discovered if you have nullable structs as direct ancestors of leaf columns (ARROW-9598).  

@asfimport
Copy link
Collaborator Author

shadowdsp:
[~fulluey] Hello, I want to know how you solved this problem? Thanks.

I found spark is using parquet-mr , I tested successfully on it. But it is based on Java, if I want to use it in c++, I should use JNI to call it, and this is not efficient enough.

@asfimport
Copy link
Collaborator Author

Martin Durant / @martindurant:
Just to chime in here, nice and late, to say that this feature would be immense when implemented. Given the similarity of the parquet storage layout and the arrow model, it ought not to be too hard - but C++ is never easy. As a python-first programmer, is there any way I can help?

 

(note that I have great hopes for awkward-arrays https://github.com/scikit-hep/awkward-1.0 to be a perfect pure-python way to iterate/filter/aggregate over deeply nested data loaded from parquet at C speeds)

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
@martindurant can https://issues.apache.org/jira/browse/ARROW-8492 potentially could be done at the python layer.  If you are interested maybe chime in there and we can discuss if there is a hard requirement for C++ (the nice thing about C++ tests is not having to deal with dual language development)

@asfimport
Copy link
Collaborator Author

Hui Gao:
[~shadowdsp], We still don't have a solution to the problem. Right now, we write to disk and store the data in hive. It works. But we can't read from local file. I checked the C++ code base. It is really complex. Not seeming to be an easy update from outside developers. 

Parquet read/write from Spark works. But it is super slow when comparing with arrow. We also hit memory consumption issues  and frequent OOM errors. I would NOT recommend unless using Java already. 

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Can this issue be resolved for 2.0.0, or perhaps it should be renamed to be "Parquet nested parent JIRA"?

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
yes, I pushed all remaining unresolved subtasks to issues.

@asfimport
Copy link
Collaborator Author

jiang,longshan:
[~fulluey], Is there any solution to read MapArray parquet file in c+? This issue is closed, but I still don't know the method, and it blocks our project in which we want to use c+ replace of Java application. Is there any advice? thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants