Skip to content
This repository has been archived by the owner on May 10, 2024. It is now read-only.

ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow #462

Conversation

joshuastorck
Copy link
Contributor

@joshuastorck joshuastorck commented May 10, 2018

I've implemented deserialization of Parquet into Arrow, supporting arbitrary nested groups. This is basically a re-write of parquet/arrow/reader.cc, with most of the code moved into a separate file named deserializer.cc.

All of the unit tests passed on my local build for arrow-read-write-test (in conjunction with these PRs: apache/arrow#2034, apache/arrow#2036), as well as an additional one that reads a nested parquet file that is based on the message from the Dremel paper.

Joshua Storck added 2 commits May 10, 2018 15:43
…ger types for definition/repetition levels, and strong type for primitive serializer type
@joshuastorck joshuastorck changed the title [WIP]: Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow May 11, 2018
@wesm
Copy link
Member

wesm commented May 12, 2018

Thanks @joshuastorck -- can you create a JIRA for this? It will take me some time to review, so bear with me

@joshuastorck joshuastorck changed the title Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow May 12, 2018
@wesm
Copy link
Member

wesm commented Jun 13, 2018

I have been on the road a lot lately but I hope to spend some time reviewing this in the next 10 days. I'm sorry for the hold up

@snir
Copy link

snir commented Jul 3, 2018

We are very interested in this PR - and we've been testing it with our tools+data, with good result :)
Functionality wise, it works great and we didn't encounter any issues.
However, we noticed an odd performance issue - It takes about 2x time to load, both simple and complex, parquet files.
For example the following code reproduces the slowdown (compared with pyarrow 0.9.0):

import pyarrow as pa
import pyarrow.parquet as pq
import os, time

t = pa.Table.from_arrays([pa.array(range(0,10000000))], ["col"])
with pq.ParquetWriter("./p.parquet", t.schema) as p:
    p.write_table(t)

start=time.time()
with open("./p.parquet") as p:
    pq.ParquetFile(p).read(nthreads=4)
print time.time() - start

We also noticed when running with perf that it had 2x page faults, so it might be related.

Thanks!

…eads. Refactored parts of TableDeserializer into a sub-class so that a new ParallelTableDeserializer could be created. Fixed a bug in FileArrayDeserializer that was using an unnecessary and uninitialized member variable.
@joshuastorck
Copy link
Contributor Author

Since this was a complete re-write, I was focused on accuracy and not speed and forgot to add in the multi-threaded reading. My latest commit supports multi-threaded reading. @snir, please let me know what the performance looks like with the latest code. Since the code has to handle the arbitrary nested case, the code can't translate chunks of data as easily as the previous implementation and relies on the builder classes. That may be the cause of more page faults. It would be possible to identify cases where there are only primitive columns and swap out for a more optimized implementation. However, there's already a lot of code in this PR to be reviewed, so I wanted to minimize the surface area.

@snir
Copy link

snir commented Jul 9, 2018

The benchmark I used had 1 column only so it didn't change, Now when running with multiple threads and (flat) columns the slowdown seems to converge to about 30% rather than 100% with 1 thread. The page fault count stays at about 2x regardless.

@wesm
Copy link
Member

wesm commented Jul 16, 2018

I'm sorry I haven't been able to review yet; I've been buried with Arrow 0.10.0 stuff. That's going to last for another 1-2 weeks, but I would like to make getting this in a priority after that

wesm pushed a commit to apache/arrow that referenced this pull request Jul 24, 2018
…r's children from unique_ptr to shared_ptr so that it can support deserialization from Parquet to Arrow with arbitrary nesting

This allows the class that is responsible for deserializing to hold a shared_ptr to the concrete type without having to static_cast from the (List|Struct)Builder's getters.

This was needed for apache/parquet-cpp#462.

Author: Joshua Storck <[email protected]>

Closes #2034 from joshuastorck/shared_ptr_in_builders and squashes the following commits:

d5aac22 <Joshua Storck> Fixing format errors
d6c6945 <Joshua Storck> Changing the type of ListBuilder's and StructBuilder's children from unique_ptr to shared_ptr so that it can support deserialization from Parquet to Arrow with arbitrary nesting. This allows the class that is responsible for deserializing to hold a shared_ptr to the concrete type without having to static_cast from the (List|Struct)Builder's getters.
@yupbank
Copy link

yupbank commented Aug 16, 2018

great job !!!

@joshuastorck
Copy link
Contributor Author

joshuastorck commented Aug 16, 2018 via email

@wesm
Copy link
Member

wesm commented Sep 9, 2018

This PR needs to be moved to https://github.com/apache/arrow (and rebased) -- please let us know if you require assistance

@fj-sanchez
Copy link

Any updates here? I think that it is a very important feature...

@wesm
Copy link
Member

wesm commented Dec 20, 2018

No updates. If you have funding available to support this work, please get in touch with me offline. Short of that I would say the timeline for this work getting done is sometime in 2019

@emkornfield
Copy link

@wesm @joshuastorck what is blocking this? I would be happy to try to help moving this forward if it would be helpful.

@wesm
Copy link
Member

wesm commented Jan 17, 2019

Well, it needs to be rebased on the merged codebase. It also has performance regressions; it might have overreached in some of its refactoring. I haven't looked closely yet but planned to invest some time in this in this quarter. I estimate there's a solid ~50 hours of work involved with getting both read and write of nested data working and with good performance and thorough unit testing

@emkornfield
Copy link

OK, I think this potentially breaks down into at least somethings I could manage (if these makes sense I will add subtasks to the JIRA and at least hopefully get to a few of them):

  1. Performance test so we can measure regression of code.
  2. Setup code in mono-repo to allow for different implementation of column reading, so this code path can be enabled experimentally to verify performance
  3. incorporate/reabse this PR into new repo using 2.
    Does this sound reasonable at least for the read end?

@wesm
Copy link
Member

wesm commented Jan 17, 2019

Yes, that sounds reasonable. I'm indisposed this week with a conference and the forthcoming 0.12 release but I would like to find some time to hack on this later in the month

@emkornfield
Copy link

emkornfield commented Jan 17, 2019 via email

@wesm
Copy link
Member

wesm commented Jan 20, 2019

I am thinking I might put 2-3 full days into this before the end of the month -- I wanted to make sure I won't be stepping on your toes. Let me know

@emkornfield
Copy link

@wesm
I haven't started anything of substance yet. I can commit to getting the benchmark sub task I created in jira done by mid-week if that still works with your timeline. Depending on when this month you have time I might be able to help with the rebase before you get started but it sounds like maybe you should just own that piece?

@joshuastorck
Copy link
Contributor Author

joshuastorck commented Jan 20, 2019 via email

@joshuastorck
Copy link
Contributor Author

joshuastorck commented Jan 20, 2019 via email

@wesm
Copy link
Member

wesm commented Jan 21, 2019

I was thinking I would start by implementing the write path for nested data and use that to drive the testing process. I don't want to step on anyone's toes but I'd like to have this completely done and a nail in the coffin by end of Q1 this year. I started working on Parquet in C++ in January 2016 and I've been feeling more and more embarrassed that this part of the project still is not done =/

We have tools (that @pitrou wrote) now for converting JSON to Arrow (including arbitrarily nested types) so that should make the test cases way easier to write now. We might have to tweak the JSON converter a little bit so it can handle non-nullable fields (since we have to exercise the non-nullable paths here)

@emkornfield
Copy link

emkornfield commented Jul 20, 2019 via email

@itamarst
Copy link

itamarst commented Jul 22, 2019

This branch is of interest to G-Research, who I'm helping with some open source work, so I tried rebasing against the Arrow version. Since I am new to this codebase and there were quite a few conflicts, however, I don't really trust the resulting code...

If you feel it'd be helpful I can finish fixing the compilation bugs, at least, but I suspect a better approach than rebasing might be copy/pasting based on knowledge of existing code base. Or maybe a rebase by someone who understands the code would be fine.

@wesm
Copy link
Member

wesm commented Jul 22, 2019

I think @emkornfield is looking at this but likely starting from scratch. I'm closing this -- let's discuss on JIRA or the Arrow mailing list

@wesm wesm closed this Jul 22, 2019
@davecap
Copy link

davecap commented Jul 22, 2019

Can you point to the JIRA ticket for this?

@wesm
Copy link
Member

wesm commented Jul 22, 2019

It's ARROW-1644

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants