-
Notifications
You must be signed in to change notification settings - Fork 193
ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow #462
Conversation
…s from Parquet to Arrow
…ger types for definition/repetition levels, and strong type for primitive serializer type
Thanks @joshuastorck -- can you create a JIRA for this? It will take me some time to review, so bear with me |
I have been on the road a lot lately but I hope to spend some time reviewing this in the next 10 days. I'm sorry for the hold up |
We are very interested in this PR - and we've been testing it with our tools+data, with good result :) import pyarrow as pa
import pyarrow.parquet as pq
import os, time
t = pa.Table.from_arrays([pa.array(range(0,10000000))], ["col"])
with pq.ParquetWriter("./p.parquet", t.schema) as p:
p.write_table(t)
start=time.time()
with open("./p.parquet") as p:
pq.ParquetFile(p).read(nthreads=4)
print time.time() - start We also noticed when running with Thanks! |
…eads. Refactored parts of TableDeserializer into a sub-class so that a new ParallelTableDeserializer could be created. Fixed a bug in FileArrayDeserializer that was using an unnecessary and uninitialized member variable.
Since this was a complete re-write, I was focused on accuracy and not speed and forgot to add in the multi-threaded reading. My latest commit supports multi-threaded reading. @snir, please let me know what the performance looks like with the latest code. Since the code has to handle the arbitrary nested case, the code can't translate chunks of data as easily as the previous implementation and relies on the builder classes. That may be the cause of more page faults. It would be possible to identify cases where there are only primitive columns and swap out for a more optimized implementation. However, there's already a lot of code in this PR to be reviewed, so I wanted to minimize the surface area. |
The benchmark I used had 1 column only so it didn't change, Now when running with multiple threads and (flat) columns the slowdown seems to converge to about 30% rather than 100% with 1 thread. The page fault count stays at about 2x regardless. |
I'm sorry I haven't been able to review yet; I've been buried with Arrow 0.10.0 stuff. That's going to last for another 1-2 weeks, but I would like to make getting this in a priority after that |
…r's children from unique_ptr to shared_ptr so that it can support deserialization from Parquet to Arrow with arbitrary nesting This allows the class that is responsible for deserializing to hold a shared_ptr to the concrete type without having to static_cast from the (List|Struct)Builder's getters. This was needed for apache/parquet-cpp#462. Author: Joshua Storck <[email protected]> Closes #2034 from joshuastorck/shared_ptr_in_builders and squashes the following commits: d5aac22 <Joshua Storck> Fixing format errors d6c6945 <Joshua Storck> Changing the type of ListBuilder's and StructBuilder's children from unique_ptr to shared_ptr so that it can support deserialization from Parquet to Arrow with arbitrary nesting. This allows the class that is responsible for deserializing to hold a shared_ptr to the concrete type without having to static_cast from the (List|Struct)Builder's getters.
great job !!! |
The last bit of work on this is to stop using the builder classes since
there is an unnecessary copy made in the cases where the size and types of
the arrow and parquet types match. I'm on vacation, and Wes and I agreed
that we'll get this in after the monorepo merge.
…On Thu, Aug 16, 2018 at 4:48 PM Peng Yu ***@***.***> wrote:
great job !!!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#462 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA5SPW0jXIug7Mgt23lssJxVsEiz360Xks5uRdqNgaJpZM4T6crM>
.
|
This PR needs to be moved to https://github.com/apache/arrow (and rebased) -- please let us know if you require assistance |
Any updates here? I think that it is a very important feature... |
No updates. If you have funding available to support this work, please get in touch with me offline. Short of that I would say the timeline for this work getting done is sometime in 2019 |
@wesm @joshuastorck what is blocking this? I would be happy to try to help moving this forward if it would be helpful. |
Well, it needs to be rebased on the merged codebase. It also has performance regressions; it might have overreached in some of its refactoring. I haven't looked closely yet but planned to invest some time in this in this quarter. I estimate there's a solid ~50 hours of work involved with getting both read and write of nested data working and with good performance and thorough unit testing |
OK, I think this potentially breaks down into at least somethings I could manage (if these makes sense I will add subtasks to the JIRA and at least hopefully get to a few of them):
|
Yes, that sounds reasonable. I'm indisposed this week with a conference and the forthcoming 0.12 release but I would like to find some time to hack on this later in the month |
Sounds good, I'll only assign things to myself that I can commit to
(probably just a performance test for the time being) and we can coordinate
from there.
…On Wed, Jan 16, 2019 at 9:23 PM Wes McKinney ***@***.***> wrote:
Yes, that sounds reasonable. I'm indisposed this week with a conference
and the forthcoming 0.12 release but I would like to find some time to hack
on this later in the month
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#462 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARCsDjKjDmVKW8mtgPgN3NccQJ9wgHiZks5vEAjggaJpZM4T6crM>
.
|
I am thinking I might put 2-3 full days into this before the end of the month -- I wanted to make sure I won't be stepping on your toes. Let me know |
@wesm |
I think I can get to this now. I know how to me the performance regression,
which is related to the use of the builder classes.
…On Sat, Jan 19, 2019 at 10:12 PM emkornfield ***@***.***> wrote:
@wesm <https://github.com/wesm>
I haven't started anything of substance yet. I can commit to getting the
benchmark sub task I created in jira done by mid-week if that still works
with your timeline. Depending on when this month you have time I might be
able to help with the rebase before you get started but it sounds like
maybe you should just own that piece?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#462 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA5SPR2WxAgDMasEVMI9gCErm5_6bjAzks5vE96YgaJpZM4T6crM>
.
|
Sorry for the typo. I know how to fix the performance regression.
On Sun, Jan 20, 2019 at 4:28 PM Joshua Storck <[email protected]>
wrote:
… I think I can get to this now. I know how to me the performance
regression, which is related to the use of the builder classes.
On Sat, Jan 19, 2019 at 10:12 PM emkornfield ***@***.***>
wrote:
> @wesm <https://github.com/wesm>
> I haven't started anything of substance yet. I can commit to getting the
> benchmark sub task I created in jira done by mid-week if that still works
> with your timeline. Depending on when this month you have time I might be
> able to help with the rebase before you get started but it sounds like
> maybe you should just own that piece?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#462 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AA5SPR2WxAgDMasEVMI9gCErm5_6bjAzks5vE96YgaJpZM4T6crM>
> .
>
|
I was thinking I would start by implementing the write path for nested data and use that to drive the testing process. I don't want to step on anyone's toes but I'd like to have this completely done and a nail in the coffin by end of Q1 this year. I started working on Parquet in C++ in January 2016 and I've been feeling more and more embarrassed that this part of the project still is not done =/ We have tools (that @pitrou wrote) now for converting JSON to Arrow (including arbitrarily nested types) so that should make the test cases way easier to write now. We might have to tweak the JSON converter a little bit so it can handle non-nullable fields (since we have to exercise the non-nullable paths here) |
I've been looking at it, probably about 1 month out I would guesstimate
based on my current bandwidth
…On Fri, Jul 19, 2019 at 1:26 PM Austin ***@***.***> wrote:
@wesm <https://github.com/wesm>, is there a new timeline for this?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#462?email_source=notifications&email_token=AEIKYDXMXYMWZKVMLF73BQ3QAIPPBA5CNFSM4E7JZLGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2MVSVA#issuecomment-513366356>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEIKYDVLYU4UMPDGMKIAA5LQAIPPBANCNFSM4E7JZLGA>
.
|
This branch is of interest to G-Research, who I'm helping with some open source work, so I tried rebasing against the Arrow version. Since I am new to this codebase and there were quite a few conflicts, however, I don't really trust the resulting code... If you feel it'd be helpful I can finish fixing the compilation bugs, at least, but I suspect a better approach than rebasing might be copy/pasting based on knowledge of existing code base. Or maybe a rebase by someone who understands the code would be fine. |
I think @emkornfield is looking at this but likely starting from scratch. I'm closing this -- let's discuss on JIRA or the Arrow mailing list |
Can you point to the JIRA ticket for this? |
It's ARROW-1644 |
I've implemented deserialization of Parquet into Arrow, supporting arbitrary nested groups. This is basically a re-write of parquet/arrow/reader.cc, with most of the code moved into a separate file named deserializer.cc.
All of the unit tests passed on my local build for arrow-read-write-test (in conjunction with these PRs: apache/arrow#2034, apache/arrow#2036), as well as an additional one that reads a nested parquet file that is based on the message from the Dremel paper.