Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Mar 31, 2020

The current state is that this is a quick PoC representing an hours work. I have a test that writes an Arrow batch to a parquet file. I can successfully read the Parquet file from Apache Spark.

scala> val df = spark.read.parquet("/home/andy/git/andygrove/arrow/rust/parquet/test.parquet")
df: org.apache.spark.sql.DataFrame = [a: int, b: int]                           

scala> df.show
+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
|  5|  5|
+---+---+

Next steps:

  • Fix error handling (return results etc)
  • Add support for all primitive columns
  • Add unit tests
  • Add integration tests (separate PR most likely)

@github-actions
Copy link

.write_batch(array.value_slice(0, array.len()), None, None)
.unwrap() as i64;
}
//TODO add other types
Copy link
Member

@wesm wesm Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of initial things to keep in mind

  • Writes of both Nullable (OPTIONAL) and non-nullable (REQUIRED) fields
  • You can optimize the special case where a nullable field's data has no nulls
  • A good amount of code is required to handle converting from the Arrow physical form of various logical types to the Parquet equivalent one, see https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc for details
  • It would be worth thinking up front about how dictionary-encoded data is handled both on the Arrow write and Arrow read paths. In parquet-cpp we initially discarded Arrow DictionaryArrays on write (casting e.g. Dictionary to dense String), and through real world need I was forced to revisit this (quite painfully) to enable Arrow dictionaries to survive roundtrips to Parquet format, and also achieve better performance and memory use in both reads and writes. You can certainly do a dictionary-to-dense conversion like we did, but you may someday find yourselves doing the same painful refactor that I did to make dictionary write and read not only more efficient but also dictionary order preserving.

}
}

struct ArrowWriter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is more appropriate to put this in src/arrow/writer?

@sunchao
Copy link
Member

sunchao commented Apr 1, 2020

Thanks @andygrove . Do you intend to continue working on this or you intend to commit this as the first step? I see this is more like an example but still lots of pieces missing (as you already pointed out).

@andygrove
Copy link
Member Author

@sunchao I don't really have a plan. I just wanted to get the conversation started. Maybe it makes sense to get this to the point where it can accurately write parquet files with support for i32 only and then we create JIRAs for all other data types?

@paddyhoran
Copy link
Contributor

@sunchao I don't really have a plan. I just wanted to get the conversation started. Maybe it makes sense to get this to the point where it can accurately write parquet files with support for i32 only and then we create JIRAs for all other data types?

Although it's not good to have "half-support" I agree with this. It's hard to get larger PR's reviewed and merged. It would also be easier for others to jump in. I don't know parquet well enough to take this on as is but I might be able to help out if the tasks are broken down.

@sunchao would you be up for breaking the work down on JIRA into bite size pieces?

@sunchao
Copy link
Member

sunchao commented Apr 1, 2020

Yes agree. I will also be happy to take a stab on this, and also help to break the work down (naively we could start with the basic primitive types). We can start by first looking at other implementations (i.e., C++) and get familiar with the scope.

@andygrove
Copy link
Member Author

Thanks @sunchao . That sounds great. Let me know how I (or other committers) can help.

@sunchao
Copy link
Member

sunchao commented Apr 8, 2020

Thanks @andygrove . I roughly skimmed through the C++ implementation and think on the high level we need to do the following:

  1. implement a method similar to WriteArrow in column_writer.cc. We can further break this up into smaller pieces such as: dictionary/non-dictionary, primitive types, booleans, timestamps, dates, so on and so forth.
  2. implement an arrow writer in the parquet crate here. This needs to offer similar APIs as writer.h.

We also should pay good attention to performance on this part. I'll spend some more time on this and after that I'll create JIRAs under the parent JIRA you created.

@andygrove
Copy link
Member Author

I went ahead and created a new parent JIRA with some subtasks: https://issues.apache.org/jira/browse/ARROW-8421

@nevi-me
Copy link
Contributor

nevi-me commented May 19, 2020

I'll push the minor changes that I made here, to use the schema conversion. I got distracted by trying to figure out how to compute repetition levels to support nulls, but I can try that as part of a separate JIRA.

@andygrove
Copy link
Member Author

Closing this since it is replaced by #7319

@andygrove andygrove closed this Jun 4, 2020
nevi-me added a commit that referenced this pull request Aug 13, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Aug 18, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Aug 21, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Aug 25, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Sep 13, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Sep 16, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Sep 25, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 3, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 7, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 12, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 16, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit to nevi-me/arrow that referenced this pull request Oct 17, 2020
**Note**: I started making changes to apache#6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes apache#7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 25, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 27, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 28, 2020
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants