-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Support writing GeospatialStatistics in Parquet writer #8524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @paleolimbot, looks pretty good on a first pass. I just want to make sure that the size statistics are written properly when geo stats are enabled.
parquet/src/column/writer/encoder.rs
Outdated
| if let Some(var_bytes) = T::T::variable_length_bytes(slice) { | ||
| *self.variable_length_bytes.get_or_insert(0) += var_bytes; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should execute regardless of whether geo stats are enabled. The variable_length_bytes are ultimately written to the SizeStatistics which are useful even without min/max statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
parquet/tests/geospatial.rs
Outdated
| drop(file_writer); | ||
|
|
||
| // Check that statistics exist in thrift output | ||
| thrift_metadata.row_groups[0].columns[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up that when the thrift stuff merges this will no longer be a format::FileMetaData but file::metadata::ParquetMetaData.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I removed these assertions so that they won't break when the thrift stuff merges (although there will be a few logical type constructors that will need to be updated).
|
Thank you for the review! I will clean this up on Monday and add a few more tests. |
|
@paleolimbot I took a stab at resolving the merge conflicts. They are mostly trivial, but I wasn't sure how to resolve the tests. I'll leave that up to you 😄. |
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @paleolimbot and @etseidl
I reviewed this PR for test coverage and structure, and from my perspective it is good to go. I had a few minor comments / suggestions, but nothing I think would prevent merging
| } | ||
| } | ||
|
|
||
| /// Explicitly specify the Parquet schema to be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice API addition I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this API actually ends up being a bit problematic, the reason being the type inference and coercion machinery are supposed to mirror each other.
With this change:
- You can write files that won't roundtrip correctly, as the reader doesn't understand the types in the arrow schema (and will just ignore them)
- You can end up with incorrect type coercion for types, e.g. unsigned types not being handled correctly
Further this interferes with removing arrow_cast as a dependency - #9077
I'm not sure what the intention of this API is, why can't the arrays just be cast before being written, why does this logic need to live within the parquet writer itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one rationale was to put the appropriate metadata on the field (so the parquet writer knew what logical type to add), but I make be mistaken
I don't fully understand the concerns about type coercion, but at least part of this API I think is designed to allow interoperability between other arrow implementations (aka not just reading back arrays that were written in Rust, but writing arrays that other writers will accept)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put a note on the other PR as well, but the intention was really just to be able to add the test that I needed to add at the time.
I don't have opinions about how this kind of thing should work here in particular, but a schema request across a type boundary (e.g. pyarrow.table(xxx, schema=xxx) is quite common and nicely separates the destination type inference (usually lossy with some choices to be made) from the conversion (either write the source type or error if this is not possible). The API here was basically an escape hatch in the event that the built-in Parquet schema inference did the wrong thing (which it did for spatial types at the time that I added it).
| /// ``` | ||
| #[derive(Clone, Debug, PartialEq, Default)] | ||
| pub struct GeospatialStatistics { | ||
| /// Optional bounding defining the spatial extent, where None represents a lack of information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why remove these comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved them to the accessor methods in a previous change...I'm not sure why they're showing up in this diff. My theory was that they'd be more likely to be read there but I don't mind copying them back.
| fn new_accumulator(&self, descr: &ColumnDescPtr) -> Box<dyn GeoStatsAccumulator>; | ||
| } | ||
|
|
||
| /// Dynamic [`GeospatialStatistics``] accumulator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice API for optional statistics encoding
| # Enable parquet variant support | ||
| variant_experimental = ["arrow", "parquet-variant", "parquet-variant-json", "parquet-variant-compute"] | ||
| # Enable geospatial support | ||
| geospatial = ["parquet-geospatial"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also add the new feature flag to the main crate readme as well?
https://github.com/apache/arrow-rs/blob/main/parquet/README.md#feature-flags
|
🤖: Benchmark completed Details
|
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks @paleolimbot.
One question I have has to deal with the column chunk Statistics and the column index. Am I correct that if geo stats are written, the column chunk stats should be None? And should the column index for such a column also be None? If so, could you add a test that verifies this? 🙏 Could be in a later PR.
Co-authored-by: Ed Seidl <[email protected]>
The min/max value should be absent but the null count should still be there. I added a test!
I actually have no idea what a column index, which suggests to me that it should be None 🙂 |
It's a version of the page statistics available without having to parse the individual page headers. It has the unfortunate(*) property that min and max are mandatory, so if either min or max are (*) Unfortunate because there is other information in the column index beyond min and max statistics that can still be of use for page pruning. Null pages and level histograms among them. |
|
I think this is ready to merge. @alamb have your concerns been addressed? |
| pub fn wkb_point_xy(x: f64, y: f64) -> Vec<u8> { | ||
| let mut item: [u8; 21] = [0; 21]; | ||
| item[0] = 0x01; | ||
| item[1] = 0x01; | ||
| item[5..13].copy_from_slice(x.to_le_bytes().as_slice()); | ||
| item[13..21].copy_from_slice(y.to_le_bytes().as_slice()); | ||
| item.to_vec() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a huge deal for XY and XYZM points, but if we want more complex helpers for more complex geometries, I think it would be more maintainable and more understandable for future people to use an existing crate to generate the WKB buffers. (In my own projects I use wkt::types as simple extended-dimension geometry types that I pass into wkb APIs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely! The GeometryBounder is tested with those here (where we have wkt as a dev dependency). I don't mind how these are implemented (I just needed something for the parquet/geospatial tests).
|
The only thing I want to make sure is that this doesn't impact writing performance. The benchmark results above seem to suggest there might be. However, I tried to reproduce this locally and it looks fine to me. cargo bench --bench arrow_writer -- "list_primitive/parquet_2"list_primitive/parquet_2
time: [994.65 µs 1.0011 ms 1.0083 ms]
thrpt: [2.0649 GiB/s 2.0797 GiB/s 2.0931 GiB/s]
change:
time: [+0.0649% +2.0194% +3.9738%] (p = 0.04 < 0.05)
thrpt: [-3.8219% -1.9795% -0.0649%]
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe |
|
Three approvals so let's get this one in and we can iterate if necessary in follow on PRs! |
|
Thanks again @paleolimbot @etseidl and @kylebarron ! |
|
🤖 |
|
🤖: Benchmark completed Details
|
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Arrow_cast is fairly heavy dependency, especially now that it bundles in arrow-ord for RunEndEncodedArrays (#8708). Removing this dependency has been discussed as far back as 2024, let's finally actually do it #4764. # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? Yes, unfortunately #8524 added an API that allows overriding the inferred schema, which in turn allows the coercion machinery to traverse somewhat unintended paths. I personally think this API shouldn't exist, but... <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Arrow_cast is fairly heavy dependency, especially now that it bundles in arrow-ord for RunEndEncodedArrays (apache#8708). Removing this dependency has been discussed as far back as 2024, let's finally actually do it apache#4764. # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? Yes, unfortunately apache#8524 added an API that allows overriding the inferred schema, which in turn allows the coercion machinery to traverse somewhat unintended paths. I personally think this API shouldn't exist, but... <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Rationale for this change
One of the primary reasons the GeoParquet community was excited about first-class Parquet Geometry/Geography support was the built-in column chunk statistics (we had a workaround that involved adding a struct column, but it was difficult for non-spatial readers to use it and very difficult for non-spatial writers to write it). This PR ensures it is possible for arrow-rs to write files that include those statistics.
What changes are included in this PR?
This PR inserts the minimum required change to enable this support.
Are these changes tested?
Yes!
Are there any user-facing changes?
There are several new functions (which include documentation). Previously it was difficult or impossible to actually write Geometry or Geography logical types, and so it is unlikely any previous usage would be affected.