Skip to content

Conversation

@nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Aug 8, 2020

This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using a specific version as we already transitively depend on an older version, but I'll change this when the feature changes are merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vertexclique arrow and base64 would have to go together, should we create a feature that refers to both crates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be super nice. Meanwhile I was looking to parquet in this PR I realized that "snap", "brotli", "flate2", "lz4", "zstd" part is not always needed. Since these compressions are not additive features(I am not sure, we might need to check this).

What about a feature gating like this? wdyt?:

default = ["core"]
core = ["arrow", "base64"]
all = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the various compression types could depend on what the parquet source being read supports. In that case, could a user run the risk of not being able to read in a file because it supports compression? I like the core, arrow would have been more intuitive, but I suppose it clashes with the arrow crate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the various compression types could depend on what the parquet source being read supports. In that case, could a user run the risk of not being able to read in a file because it supports compression?

Yes, that's true, for having less build dependencies for default set yet another idea (FWIW bikeshedding):

default = ["core"]
core = ["arrow", "base64"]

comp_snap = ["snap"]
comp_brotli = ["brotli"]
comp_flate = ["flate2"]
comp_lz4 = ["lz4"]
comp_zstd = ["zstd"]

and users add their desired compression to enable. Just tested this, it passes parquet test on master as like that (without base64).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine, if users expect to read parquet data which they didn't write (and thus might not know if it's compressed), they can turn on the various compression features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine. What will the be error message like in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, removing the compression features and running tests doesn't throw any failures, so I can't see what the behaviour should be.

@github-actions
Copy link

github-actions bot commented Aug 8, 2020

**Note**: I started making changes to apache#6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes apache#7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.
let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(&arrow_schema);
let schema_len = serialized_schema.len();
let mut len_prefix_schema = Vec::with_capacity(schema_len + 8);
len_prefix_schema.append(&mut vec![255u8, 255, 255, 255]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our IPC implementation is still based on the legacy format, so we have to manually prepend the continuation marker and message length.

- fix failing writer test
- only convert schema if there is no projection
- fix fsl conversion error
- return descriptive error if empty arrow struct encountered
@nevi-me nevi-me marked this pull request as ready for review August 14, 2020 14:12
Copy link
Contributor Author

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao this is ready for review

We can round-trip all Rust-supported Arrow data types except for Duration and Dictionary types.
The reader and writer don't yet support all types, so I tested with an empty file.

}
}

/// Convert parquet schema to arrow schema including optional metadata, only preserving some leaf columns.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao @liurenjie1024 I think I'll need help with correctly filtering an Arrow schema if the user supplies column indices. Not sure if there's a 1:1 mapping between Arrow's fields and Parquet leaves.

We can either do it as part of this PR or as a follow-up

@nevi-me
Copy link
Contributor Author

nevi-me commented Aug 14, 2020

Current failures will be resolved by rust-lang/socket2#96

@sunchao
Copy link
Member

sunchao commented Aug 14, 2020

@nevi-me sg - I think this is stacked on another PR though: do we need to get the other one reviewed first?

@nevi-me
Copy link
Contributor Author

nevi-me commented Aug 14, 2020

... I think this is stacked on another PR though: do we need to get the other one reviewed first?

@sunchao it's a bit messy, but it's one PR.

The changes are below:


Let me refactor the above quickly, as I now see that I can move both serialisation and deserialisation to schema.rs

@nevi-me
Copy link
Contributor Author

nevi-me commented Aug 18, 2020

@sunchao I forgot to mention, I made the changes, so this is ready for review

@sunchao
Copy link
Member

sunchao commented Aug 18, 2020

@nevi-me ah OK - I'll take a look soon!


/// Mutates writer metadata by encoding the Arrow schema and storing it in the metadata.
/// If there is an existing Arrow schema metadata, it is replaced.
pub fn add_encoded_arrow_schema_to_metadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this should be public.

It might be useful to split this into util methods:

  • encode a Schema
  • replace occurrences of a key with value in a vector.

But perhaps a better way is to replace Vec<KeyValue> with HashMap in WriterProperties, but that is another story.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change the visibility to only the crate. We'll likely only ever use the function in one place, so that's why I hadn't split it.

I'd also prefer replacing Vec<KeyValue> with a HashMap partly because it's a parquet_format detail, and it's more convenient to work with hashmaps. I can open a JIRA for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np - my comments are mostly cosmetic - it will be good to change the visibility though.

let schema_meta = meta
.iter()
.enumerate()
.find(|(_, kv)| kv.key.as_str() == super::ARROW_SCHEMA_META_KEY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only find/replace the first occurrence but I guess the chance is pretty low to have multiple KeyValues with ARROW:schema in the vector...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm making the presumption that it'd exist once. I don't expect that it'd ever be populated, so this is redundant. If we change Vec<KeyValue> to a HashMap it'll address this.

@nevi-me nevi-me changed the title ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata Aug 18, 2020
@github-actions
Copy link

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks @nevi-me !

nevi-me added a commit that referenced this pull request Aug 18, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
@nevi-me nevi-me force-pushed the rust-parquet-arrow-writer branch from 2bfa2d3 to 7afa648 Compare August 18, 2020 16:44
@nevi-me
Copy link
Contributor Author

nevi-me commented Aug 18, 2020

merged in the parquet branch

@nevi-me nevi-me closed this Aug 18, 2020
nevi-me added a commit that referenced this pull request Aug 21, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Aug 25, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Sep 13, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Sep 16, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Sep 25, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 3, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 7, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 12, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 16, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit to nevi-me/arrow that referenced this pull request Oct 17, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes apache#7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
@nevi-me nevi-me deleted the ARROW-8243 branch October 17, 2020 19:19
nevi-me added a commit that referenced this pull request Oct 25, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 27, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 28, 2020
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants