ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow #462

joshuastorck · 2018-05-10T19:49:31Z

I've implemented deserialization of Parquet into Arrow, supporting arbitrary nested groups. This is basically a re-write of parquet/arrow/reader.cc, with most of the code moved into a separate file named deserializer.cc.

All of the unit tests passed on my local build for arrow-read-write-test (in conjunction with these PRs: apache/arrow#2034, apache/arrow#2036), as well as an additional one that reads a nested parquet file that is based on the message from the Dremel paper.

…s from Parquet to Arrow

…ger types for definition/repetition levels, and strong type for primitive serializer type

wesm · 2018-05-12T00:17:52Z

Thanks @joshuastorck -- can you create a JIRA for this? It will take me some time to review, so bear with me

…hod in Decimal128 class

wesm · 2018-06-13T18:58:41Z

I have been on the road a lot lately but I hope to spend some time reviewing this in the next 10 days. I'm sorry for the hold up

snir · 2018-07-03T13:28:44Z

We are very interested in this PR - and we've been testing it with our tools+data, with good result :)
Functionality wise, it works great and we didn't encounter any issues.
However, we noticed an odd performance issue - It takes about 2x time to load, both simple and complex, parquet files.
For example the following code reproduces the slowdown (compared with pyarrow 0.9.0):

import pyarrow as pa
import pyarrow.parquet as pq
import os, time

t = pa.Table.from_arrays([pa.array(range(0,10000000))], ["col"])
with pq.ParquetWriter("./p.parquet", t.schema) as p:
    p.write_table(t)

start=time.time()
with open("./p.parquet") as p:
    pq.ParquetFile(p).read(nthreads=4)
print time.time() - start

We also noticed when running with perf that it had 2x page faults, so it might be related.

Thanks!

…eads. Refactored parts of TableDeserializer into a sub-class so that a new ParallelTableDeserializer could be created. Fixed a bug in FileArrayDeserializer that was using an unnecessary and uninitialized member variable.

joshuastorck · 2018-07-05T13:18:45Z

Since this was a complete re-write, I was focused on accuracy and not speed and forgot to add in the multi-threaded reading. My latest commit supports multi-threaded reading. @snir, please let me know what the performance looks like with the latest code. Since the code has to handle the arbitrary nested case, the code can't translate chunks of data as easily as the previous implementation and relies on the builder classes. That may be the cause of more page faults. It would be possible to identify cases where there are only primitive columns and swap out for a more optimized implementation. However, there's already a lot of code in this PR to be reviewed, so I wanted to minimize the surface area.

snir · 2018-07-09T10:03:18Z

The benchmark I used had 1 column only so it didn't change, Now when running with multiple threads and (flat) columns the slowdown seems to converge to about 30% rather than 100% with 1 thread. The page fault count stays at about 2x regardless.

wesm · 2018-07-16T03:44:44Z

I'm sorry I haven't been able to review yet; I've been buried with Arrow 0.10.0 stuff. That's going to last for another 1-2 weeks, but I would like to make getting this in a priority after that

…r's children from unique_ptr to shared_ptr so that it can support deserialization from Parquet to Arrow with arbitrary nesting This allows the class that is responsible for deserializing to hold a shared_ptr to the concrete type without having to static_cast from the (List|Struct)Builder's getters. This was needed for apache/parquet-cpp#462. Author: Joshua Storck <[email protected]> Closes #2034 from joshuastorck/shared_ptr_in_builders and squashes the following commits: d5aac22 <Joshua Storck> Fixing format errors d6c6945 <Joshua Storck> Changing the type of ListBuilder's and StructBuilder's children from unique_ptr to shared_ptr so that it can support deserialization from Parquet to Arrow with arbitrary nesting. This allows the class that is responsible for deserializing to hold a shared_ptr to the concrete type without having to static_cast from the (List|Struct)Builder's getters.

yupbank · 2018-08-16T20:48:11Z

great job !!!

joshuastorck · 2018-08-16T22:28:13Z

The last bit of work on this is to stop using the builder classes since there is an unnecessary copy made in the cases where the size and types of the arrow and parquet types match. I'm on vacation, and Wes and I agreed that we'll get this in after the monorepo merge.

…

On Thu, Aug 16, 2018 at 4:48 PM Peng Yu ***@***.***> wrote: great job !!! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#462 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA5SPW0jXIug7Mgt23lssJxVsEiz360Xks5uRdqNgaJpZM4T6crM> .

wesm · 2018-09-09T19:17:37Z

This PR needs to be moved to https://github.com/apache/arrow (and rebased) -- please let us know if you require assistance

fj-sanchez · 2018-12-20T17:55:20Z

Any updates here? I think that it is a very important feature...

wesm · 2018-12-20T18:20:22Z

No updates. If you have funding available to support this work, please get in touch with me offline. Short of that I would say the timeline for this work getting done is sometime in 2019

emkornfield · 2019-01-17T04:19:43Z

@wesm @joshuastorck what is blocking this? I would be happy to try to help moving this forward if it would be helpful.

wesm · 2019-01-17T04:40:24Z

Well, it needs to be rebased on the merged codebase. It also has performance regressions; it might have overreached in some of its refactoring. I haven't looked closely yet but planned to invest some time in this in this quarter. I estimate there's a solid ~50 hours of work involved with getting both read and write of nested data working and with good performance and thorough unit testing

emkornfield · 2019-01-17T05:14:59Z

OK, I think this potentially breaks down into at least somethings I could manage (if these makes sense I will add subtasks to the JIRA and at least hopefully get to a few of them):

Performance test so we can measure regression of code.
Setup code in mono-repo to allow for different implementation of column reading, so this code path can be enabled experimentally to verify performance
incorporate/reabse this PR into new repo using 2.
Does this sound reasonable at least for the read end?

wesm · 2019-01-17T05:23:42Z

Yes, that sounds reasonable. I'm indisposed this week with a conference and the forthcoming 0.12 release but I would like to find some time to hack on this later in the month

emkornfield · 2019-01-17T05:32:50Z

Sounds good, I'll only assign things to myself that I can commit to (probably just a performance test for the time being) and we can coordinate from there.

…

On Wed, Jan 16, 2019 at 9:23 PM Wes McKinney ***@***.***> wrote: Yes, that sounds reasonable. I'm indisposed this week with a conference and the forthcoming 0.12 release but I would like to find some time to hack on this later in the month — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#462 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARCsDjKjDmVKW8mtgPgN3NccQJ9wgHiZks5vEAjggaJpZM4T6crM> .

wesm · 2019-01-20T01:13:54Z

I am thinking I might put 2-3 full days into this before the end of the month -- I wanted to make sure I won't be stepping on your toes. Let me know

emkornfield · 2019-01-20T03:12:22Z

@wesm
I haven't started anything of substance yet. I can commit to getting the benchmark sub task I created in jira done by mid-week if that still works with your timeline. Depending on when this month you have time I might be able to help with the rebase before you get started but it sounds like maybe you should just own that piece?

joshuastorck · 2019-01-20T21:28:52Z

I think I can get to this now. I know how to me the performance regression, which is related to the use of the builder classes.

…

On Sat, Jan 19, 2019 at 10:12 PM emkornfield ***@***.***> wrote: @wesm <https://github.com/wesm> I haven't started anything of substance yet. I can commit to getting the benchmark sub task I created in jira done by mid-week if that still works with your timeline. Depending on when this month you have time I might be able to help with the rebase before you get started but it sounds like maybe you should just own that piece? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#462 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA5SPR2WxAgDMasEVMI9gCErm5_6bjAzks5vE96YgaJpZM4T6crM> .

joshuastorck · 2019-01-20T21:29:28Z

Sorry for the typo. I know how to fix the performance regression. On Sun, Jan 20, 2019 at 4:28 PM Joshua Storck <[email protected]> wrote:

…

I think I can get to this now. I know how to me the performance regression, which is related to the use of the builder classes. On Sat, Jan 19, 2019 at 10:12 PM emkornfield ***@***.***> wrote: > @wesm <https://github.com/wesm> > I haven't started anything of substance yet. I can commit to getting the > benchmark sub task I created in jira done by mid-week if that still works > with your timeline. Depending on when this month you have time I might be > able to help with the rebase before you get started but it sounds like > maybe you should just own that piece? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#462 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AA5SPR2WxAgDMasEVMI9gCErm5_6bjAzks5vE96YgaJpZM4T6crM> > . >

wesm · 2019-01-21T04:07:39Z

I was thinking I would start by implementing the write path for nested data and use that to drive the testing process. I don't want to step on anyone's toes but I'd like to have this completely done and a nail in the coffin by end of Q1 this year. I started working on Parquet in C++ in January 2016 and I've been feeling more and more embarrassed that this part of the project still is not done =/

We have tools (that @pitrou wrote) now for converting JSON to Arrow (including arbitrarily nested types) so that should make the test cases way easier to write now. We might have to tweak the JSON converter a little bit so it can handle non-nullable fields (since we have to exercise the non-nullable paths here)

emkornfield · 2019-07-20T06:03:20Z

I've been looking at it, probably about 1 month out I would guesstimate based on my current bandwidth

…

On Fri, Jul 19, 2019 at 1:26 PM Austin ***@***.***> wrote: @wesm <https://github.com/wesm>, is there a new timeline for this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#462?email_source=notifications&email_token=AEIKYDXMXYMWZKVMLF73BQ3QAIPPBA5CNFSM4E7JZLGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2MVSVA#issuecomment-513366356>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEIKYDVLYU4UMPDGMKIAA5LQAIPPBANCNFSM4E7JZLGA> .

itamarst · 2019-07-22T13:16:38Z

This branch is of interest to G-Research, who I'm helping with some open source work, so I tried rebasing against the Arrow version. Since I am new to this codebase and there were quite a few conflicts, however, I don't really trust the resulting code...

If you feel it'd be helpful I can finish fixing the compilation bugs, at least, but I suspect a better approach than rebasing might be copy/pasting based on knowledge of existing code base. Or maybe a rebase by someone who understands the code would be fine.

wesm · 2019-07-22T13:25:29Z

I think @emkornfield is looking at this but likely starting from scratch. I'm closing this -- let's discuss on JIRA or the Arrow mailing list

davecap · 2019-07-22T14:05:39Z

Can you point to the JIRA ticket for this?

wesm · 2019-07-22T14:08:44Z

It's ARROW-1644

Joshua Storck added 2 commits May 10, 2018 15:43

Initial cut of implementing deserialization of arbitrary nested group…

a576ebf

…s from Parquet to Arrow

Adding documentation, use of memory pools, more error checking, stron…

c653510

…ger types for definition/repetition levels, and strong type for primitive serializer type

joshuastorck mentioned this pull request May 11, 2018

ARROW-2586: [C++] Changing the type of ListBuilder's and StructBuilder's children from unique_ptr to shared_ptr so that it can support deserialization from Parquet to Arrow with arbitrary nesting apache/arrow#2034

Closed

joshuastorck changed the title ~~[WIP]: Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow~~ Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow May 11, 2018

joshuastorck changed the title ~~Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow~~ ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow May 12, 2018

Joshua Storck added 3 commits May 13, 2018 00:52

Moving conversion of Decimal128 from big endian bytes into static met…

a5bdee5

…hod in Decimal128 class

Removing unused code and fixing formatting

8345f9f

Renaming PrimitiveSerializerType to PrimitiveDeserializerType

720192f

joshuastorck mentioned this pull request May 18, 2018

ARROW-2585: [C++] Add Decimal::FromBigEndian, which was formerly a static method in parquet-cpp/src/parquet/arrow/reader.cc apache/arrow#2036

Closed

davlee1972 mentioned this pull request Jun 26, 2018

Enhancement Request: Apache Arrow sissaschool/xmlschema#49

Closed

xhochy mentioned this pull request Jul 14, 2018

pyarrow latest parquet map column type isn't supported apache/arrow#2262

Closed

wesm mentioned this pull request Jul 19, 2018

[Firehose Partition] apache/arrow#2286

Closed

MasterSalami mentioned this pull request Mar 28, 2019

[Python] Schema definition for dictionaries apache/arrow#4045

Closed

wesm closed this Jul 22, 2019

This was referenced Sep 25, 2021

[C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels apache/arrow#17654

Closed

[C++] Change parquet::arrow::FileReader::ReadRowGroups to read into contiguous arrays apache/arrow#15869

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow #462

ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow #462

joshuastorck commented May 10, 2018 •

edited

Loading

wesm commented May 12, 2018

wesm commented Jun 13, 2018

snir commented Jul 3, 2018

joshuastorck commented Jul 5, 2018

snir commented Jul 9, 2018

wesm commented Jul 16, 2018

yupbank commented Aug 16, 2018

joshuastorck commented Aug 16, 2018 via email

wesm commented Sep 9, 2018

fj-sanchez commented Dec 20, 2018

wesm commented Dec 20, 2018

emkornfield commented Jan 17, 2019

wesm commented Jan 17, 2019

emkornfield commented Jan 17, 2019

wesm commented Jan 17, 2019

emkornfield commented Jan 17, 2019 via email

wesm commented Jan 20, 2019

emkornfield commented Jan 20, 2019

joshuastorck commented Jan 20, 2019 via email

joshuastorck commented Jan 20, 2019 via email

wesm commented Jan 21, 2019

emkornfield commented Jul 20, 2019 via email

itamarst commented Jul 22, 2019 •

edited

Loading

wesm commented Jul 22, 2019

davecap commented Jul 22, 2019

wesm commented Jul 22, 2019

ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow #462

ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow #462

Conversation

joshuastorck commented May 10, 2018 • edited Loading

wesm commented May 12, 2018

wesm commented Jun 13, 2018

snir commented Jul 3, 2018

joshuastorck commented Jul 5, 2018

snir commented Jul 9, 2018

wesm commented Jul 16, 2018

yupbank commented Aug 16, 2018

joshuastorck commented Aug 16, 2018 via email

wesm commented Sep 9, 2018

fj-sanchez commented Dec 20, 2018

wesm commented Dec 20, 2018

emkornfield commented Jan 17, 2019

wesm commented Jan 17, 2019

emkornfield commented Jan 17, 2019

wesm commented Jan 17, 2019

emkornfield commented Jan 17, 2019 via email

wesm commented Jan 20, 2019

emkornfield commented Jan 20, 2019

joshuastorck commented Jan 20, 2019 via email

joshuastorck commented Jan 20, 2019 via email

wesm commented Jan 21, 2019

emkornfield commented Jul 20, 2019 via email

itamarst commented Jul 22, 2019 • edited Loading

wesm commented Jul 22, 2019

davecap commented Jul 22, 2019

wesm commented Jul 22, 2019

joshuastorck commented May 10, 2018 •

edited

Loading

itamarst commented Jul 22, 2019 •

edited

Loading