Support writing GeospatialStatistics in Parquet writer #8524

paleolimbot · 2025-10-01T04:12:27Z

Which issue does this PR close?

Closes Support writing GeospatialStatistics in Parquet writer #8523.

Rationale for this change

One of the primary reasons the GeoParquet community was excited about first-class Parquet Geometry/Geography support was the built-in column chunk statistics (we had a workaround that involved adding a struct column, but it was difficult for non-spatial readers to use it and very difficult for non-spatial writers to write it). This PR ensures it is possible for arrow-rs to write files that include those statistics.

What changes are included in this PR?

This PR inserts the minimum required change to enable this support.

Are these changes tested?

Yes!

Are there any user-facing changes?

There are several new functions (which include documentation). Previously it was difficult or impossible to actually write Geometry or Geography logical types, and so it is unlikely any previous usage would be affected.

paleolimbot

It works!

@alamb @etseidl I'm aware this would need some tests/improved documentation at a lower level; however, I'd love some feedback on the approach before I go through and clean this up more thoroughly (whenever time allows!)

parquet/src/arrow/arrow_writer/mod.rs

parquet/src/column/writer/encoder.rs

parquet/src/geospatial/accumulator.rs

parquet/src/geospatial/statistics.rs

parquet/tests/geospatial.rs

etseidl

Thanks @paleolimbot, looks pretty good on a first pass. I just want to make sure that the size statistics are written properly when geo stats are enabled.

etseidl · 2025-10-02T22:20:11Z

parquet/src/column/writer/encoder.rs

+                if let Some(var_bytes) = T::T::variable_length_bytes(slice) {
+                    *self.variable_length_bytes.get_or_insert(0) += var_bytes;
+                }


I think this should execute regardless of whether geo stats are enabled. The variable_length_bytes are ultimately written to the SizeStatistics which are useful even without min/max statistics.

etseidl · 2025-10-02T22:27:07Z

parquet/tests/geospatial.rs

+        drop(file_writer);
+
+        // Check that statistics exist in thrift output
+        thrift_metadata.row_groups[0].columns[0]


Heads up that when the thrift stuff merges this will no longer be a format::FileMetaData but file::metadata::ParquetMetaData.

Got it! I removed these assertions so that they won't break when the thrift stuff merges (although there will be a few logical type constructors that will need to be updated).

paleolimbot · 2025-10-03T21:41:23Z

Thank you for the review! I will clean this up on Monday and add a few more tests.

…oder

etseidl · 2025-10-08T16:11:33Z

@paleolimbot I took a stab at resolving the merge conflicts. They are mostly trivial, but I wasn't sure how to resolve the tests. I'll leave that up to you 😄.

alamb · 2025-10-08T20:22:34Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing spatial-stats-write (f9112f7) to d5df352 diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=spatial-stats-write
Results will be posted here when complete

alamb

Thank you @paleolimbot and @etseidl

I reviewed this PR for test coverage and structure, and from my perspective it is good to go. I had a few minor comments / suggestions, but nothing I think would prevent merging

alamb · 2025-10-08T20:19:51Z

parquet/src/arrow/arrow_writer/mod.rs

        }
    }
+
+    /// Explicitly specify the Parquet schema to be used


this is a nice API addition I think

So this API actually ends up being a bit problematic, the reason being the type inference and coercion machinery are supposed to mirror each other.

With this change:

You can write files that won't roundtrip correctly, as the reader doesn't understand the types in the arrow schema (and will just ignore them)

You can end up with incorrect type coercion for types, e.g. unsigned types not being handled correctly

Further this interferes with removing arrow_cast as a dependency - #9077

I'm not sure what the intention of this API is, why can't the arrays just be cast before being written, why does this logic need to live within the parquet writer itself?

I think one rationale was to put the appropriate metadata on the field (so the parquet writer knew what logical type to add), but I make be mistaken

I don't fully understand the concerns about type coercion, but at least part of this API I think is designed to allow interoperability between other arrow implementations (aka not just reading back arrays that were written in Rust, but writing arrays that other writers will accept)

I put a note on the other PR as well, but the intention was really just to be able to add the test that I needed to add at the time.

I don't have opinions about how this kind of thing should work here in particular, but a schema request across a type boundary (e.g. pyarrow.table(xxx, schema=xxx) is quite common and nicely separates the destination type inference (usually lossy with some choices to be made) from the conversion (either write the source type or error if this is not possible). The API here was basically an escape hatch in the event that the built-in Parquet schema inference did the wrong thing (which it did for spatial types at the time that I added it).

alamb · 2025-10-08T20:23:12Z

parquet/src/geospatial/statistics.rs

 /// ```
 #[derive(Clone, Debug, PartialEq, Default)]
 pub struct GeospatialStatistics {
-    /// Optional bounding defining the spatial extent, where None represents a lack of information.


I wonder why remove these comments?

I moved them to the accessor methods in a previous change...I'm not sure why they're showing up in this diff. My theory was that they'd be more likely to be read there but I don't mind copying them back.

alamb · 2025-10-08T20:25:21Z

parquet/src/geospatial/accumulator.rs

+    fn new_accumulator(&self, descr: &ColumnDescPtr) -> Box<dyn GeoStatsAccumulator>;
+}
+
+/// Dynamic [`GeospatialStatistics``] accumulator


this is a nice API for optional statistics encoding

alamb · 2025-10-08T20:27:04Z

parquet/Cargo.toml

 # Enable parquet variant support
 variant_experimental = ["arrow", "parquet-variant", "parquet-variant-json", "parquet-variant-compute"]
+# Enable geospatial support
+geospatial = ["parquet-geospatial"]


Could you please also add the new feature flag to the main crate readme as well?

https://github.com/apache/arrow-rs/blob/main/parquet/README.md#feature-flags

alamb · 2025-10-08T20:44:00Z

🤖: Benchmark completed

Details

group                                     main                                   spatial-stats-write
-----                                     ----                                   -------------------
bool/bloom_filter                         1.00    129.9±0.83µs     8.2 MB/sec    1.01    130.8±0.41µs     8.1 MB/sec
bool/default                              1.00     53.2±0.18µs    19.9 MB/sec    1.03     54.8±0.15µs    19.4 MB/sec
bool/parquet_2                            1.00     68.2±0.15µs    15.6 MB/sec    1.05     71.5±0.19µs    14.8 MB/sec
bool/zstd                                 1.00     64.0±0.25µs    16.6 MB/sec    1.02     65.2±0.32µs    16.3 MB/sec
bool/zstd_parquet_2                       1.00     78.6±0.36µs    13.5 MB/sec    1.04     82.0±0.57µs    12.9 MB/sec
bool_non_null/bloom_filter                1.01    106.1±0.39µs     5.4 MB/sec    1.00    105.3±0.60µs     5.4 MB/sec
bool_non_null/default                     1.00     19.8±0.06µs    28.8 MB/sec    1.00     19.9±0.38µs    28.7 MB/sec
bool_non_null/parquet_2                   1.00     37.8±0.49µs    15.1 MB/sec    1.01     38.1±0.13µs    15.0 MB/sec
bool_non_null/zstd                        1.01     28.6±0.18µs    20.0 MB/sec    1.00     28.4±0.14µs    20.1 MB/sec
bool_non_null/zstd_parquet_2              1.00     47.2±0.32µs    12.1 MB/sec    1.01     47.9±0.32µs    12.0 MB/sec
float_with_nans/bloom_filter              1.00   940.8±12.22µs    58.4 MB/sec    1.00    943.2±6.94µs    58.3 MB/sec
float_with_nans/default                   1.00    572.4±2.04µs    96.0 MB/sec    1.00    573.9±4.01µs    95.8 MB/sec
float_with_nans/parquet_2                 1.00    825.7±3.25µs    66.6 MB/sec    1.00    825.7±3.48µs    66.6 MB/sec
float_with_nans/zstd                      1.01    752.6±8.88µs    73.0 MB/sec    1.00    746.3±1.57µs    73.6 MB/sec
float_with_nans/zstd_parquet_2            1.00  1005.0±13.88µs    54.7 MB/sec    1.00   1001.6±4.21µs    54.9 MB/sec
list_primitive/bloom_filter               1.00      4.0±0.06ms   535.5 MB/sec    1.13      4.5±0.17ms   472.7 MB/sec
list_primitive/default                    1.00  1745.4±14.39µs  1221.4 MB/sec    1.06  1846.9±15.71µs  1154.3 MB/sec
list_primitive/parquet_2                  1.00      2.3±0.01ms   908.3 MB/sec    1.14      2.7±0.04ms   794.4 MB/sec
list_primitive/zstd                       1.00      4.2±0.04ms   511.7 MB/sec    1.10      4.6±0.09ms   464.3 MB/sec
list_primitive/zstd_parquet_2             1.00      4.2±0.04ms   510.1 MB/sec    1.04      4.4±0.11ms   489.8 MB/sec
list_primitive_non_null/bloom_filter      1.00      4.8±0.08ms   438.9 MB/sec    1.00      4.9±0.12ms   438.3 MB/sec
list_primitive_non_null/default           1.00  1836.3±11.47µs  1158.5 MB/sec    1.01  1848.6±12.75µs  1150.8 MB/sec
list_primitive_non_null/parquet_2         1.12      3.3±0.04ms   639.9 MB/sec    1.00      3.0±0.02ms   718.3 MB/sec
list_primitive_non_null/zstd              1.00      5.5±0.05ms   385.9 MB/sec    1.00      5.5±0.04ms   387.7 MB/sec
list_primitive_non_null/zstd_parquet_2    1.00      5.9±0.07ms   360.4 MB/sec    1.00      5.9±0.07ms   361.7 MB/sec
primitive/bloom_filter                    1.00      4.3±0.11ms    41.1 MB/sec    1.01      4.3±0.11ms    40.9 MB/sec
primitive/default                         1.00    849.6±2.82µs   207.1 MB/sec    1.03    875.5±7.96µs   200.9 MB/sec
primitive/parquet_2                       1.00   1008.6±5.47µs   174.4 MB/sec    1.02   1033.8±4.40µs   170.2 MB/sec
primitive/zstd                            1.00   1152.6±7.40µs   152.6 MB/sec    1.02   1172.5±6.20µs   150.0 MB/sec
primitive/zstd_parquet_2                  1.00   1343.2±5.55µs   131.0 MB/sec    1.02   1370.7±7.62µs   128.4 MB/sec
primitive_non_null/bloom_filter           1.00      4.3±0.15ms    39.9 MB/sec    1.01      4.4±0.17ms    39.7 MB/sec
primitive_non_null/default                1.00    720.4±3.33µs   239.5 MB/sec    1.01    724.3±3.90µs   238.2 MB/sec
primitive_non_null/parquet_2              1.00   867.1±35.25µs   199.0 MB/sec    1.02    888.2±6.67µs   194.2 MB/sec
primitive_non_null/zstd                   1.00   1000.5±6.52µs   172.4 MB/sec    1.01   1007.3±4.85µs   171.3 MB/sec
primitive_non_null/zstd_parquet_2         1.00   1301.0±7.07µs   132.6 MB/sec    1.01   1311.5±6.67µs   131.5 MB/sec
string/bloom_filter                       1.01      2.4±0.02ms   837.8 MB/sec    1.00      2.4±0.04ms   845.3 MB/sec
string/default                            1.01    776.7±4.10µs     2.6 GB/sec    1.00    772.2±7.42µs     2.6 GB/sec
string/parquet_2                          1.00   1306.4±9.46µs  1567.7 MB/sec    1.00  1308.1±14.37µs  1565.7 MB/sec
string/zstd                               1.00      3.4±0.03ms   594.3 MB/sec    1.00      3.4±0.02ms   596.1 MB/sec
string/zstd_parquet_2                     1.01      3.7±0.04ms   550.3 MB/sec    1.00      3.7±0.03ms   554.7 MB/sec
string_and_binary_view/bloom_filter       1.00    591.3±5.68µs   213.4 MB/sec    1.02    600.4±6.46µs   210.2 MB/sec
string_and_binary_view/default            1.00    351.0±1.94µs   359.5 MB/sec    1.02    356.6±0.91µs   353.9 MB/sec
string_and_binary_view/parquet_2          1.00    383.7±1.66µs   328.9 MB/sec    1.04    398.3±2.63µs   316.9 MB/sec
string_and_binary_view/zstd               1.00    605.6±2.30µs   208.4 MB/sec    1.01    610.5±2.72µs   206.7 MB/sec
string_and_binary_view/zstd_parquet_2     1.00    743.1±2.49µs   169.8 MB/sec    1.00    742.3±4.29µs   170.0 MB/sec
string_dictionary/bloom_filter            1.01    624.0±2.95µs  1653.9 MB/sec    1.00    620.2±6.60µs  1663.9 MB/sec
string_dictionary/default                 1.02    395.7±3.05µs     2.5 GB/sec    1.00    388.6±2.24µs     2.6 GB/sec
string_dictionary/parquet_2               1.01    392.9±1.34µs     2.6 GB/sec    1.00    387.9±3.10µs     2.6 GB/sec
string_dictionary/zstd                    1.01   1163.9±7.26µs   886.7 MB/sec    1.00   1156.6±3.29µs   892.3 MB/sec
string_dictionary/zstd_parquet_2          1.01  1974.5±16.35µs   522.7 MB/sec    1.00  1963.2±18.26µs   525.7 MB/sec
string_non_null/bloom_filter              1.00      3.1±0.04ms   651.7 MB/sec    1.00      3.1±0.05ms   653.9 MB/sec
string_non_null/default                   1.01  1139.8±10.58µs  1796.0 MB/sec    1.00  1127.9±13.53µs  1815.0 MB/sec
string_non_null/parquet_2                 1.02  1885.6±19.25µs  1085.6 MB/sec    1.00  1855.6±19.42µs  1103.2 MB/sec
string_non_null/zstd                      1.00      3.2±0.03ms   633.9 MB/sec    1.00      3.2±0.03ms   636.9 MB/sec
string_non_null/zstd_parquet_2            1.02      5.0±0.09ms   407.2 MB/sec    1.00      4.9±0.06ms   415.1 MB/sec

etseidl

Looks good to me, thanks @paleolimbot.

One question I have has to deal with the column chunk Statistics and the column index. Am I correct that if geo stats are written, the column chunk stats should be None? And should the column index for such a column also be None? If so, could you add a test that verifies this? 🙏 Could be in a later PR.

parquet-geospatial/src/testing.rs

parquet/src/geospatial/accumulator.rs

Co-authored-by: Ed Seidl <[email protected]>

paleolimbot · 2025-10-09T17:32:24Z

Am I correct that if geo stats are written, the column chunk stats should be None?

The min/max value should be absent but the null count should still be there. I added a test!

And should the column index for such a column also be None?

I actually have no idea what a column index, which suggests to me that it should be None 🙂

etseidl · 2025-10-09T18:28:56Z

And should the column index for such a column also be None?

I actually have no idea what a column index, which suggests to me that it should be None 🙂

It's a version of the page statistics available without having to parse the individual page headers. It has the unfortunate(*) property that min and max are mandatory, so if either min or max are None (as is this case here), the column index should not be written (which from the test you added seems to be the case).

(*) Unfortunate because there is other information in the column index beyond min and max statistics that can still be of use for page pruning. Null pages and level histograms among them.

etseidl · 2025-10-09T18:31:29Z

I think this is ready to merge. @alamb have your concerns been addressed?

kylebarron · 2025-10-09T18:58:36Z

parquet-geospatial/src/testing.rs

+pub fn wkb_point_xy(x: f64, y: f64) -> Vec<u8> {
+    let mut item: [u8; 21] = [0; 21];
+    item[0] = 0x01;
+    item[1] = 0x01;
+    item[5..13].copy_from_slice(x.to_le_bytes().as_slice());
+    item[13..21].copy_from_slice(y.to_le_bytes().as_slice());
+    item.to_vec()
+}


It's not a huge deal for XY and XYZM points, but if we want more complex helpers for more complex geometries, I think it would be more maintainable and more understandable for future people to use an existing crate to generate the WKB buffers. (In my own projects I use wkt::types as simple extended-dimension geometry types that I pass into wkb APIs)

Definitely! The GeometryBounder is tested with those here (where we have wkt as a dev dependency). I don't mind how these are implemented (I just needed something for the parquet/geospatial tests).

alamb · 2025-10-09T19:59:12Z

The only thing I want to make sure is that this doesn't impact writing performance. The benchmark results above seem to suggest there might be.

list_primitive/parquet_2                  1.00      2.3±0.01ms   908.3 MB/sec    1.14      2.7±0.04ms   794.4 MB/sec

However, I tried to reproduce this locally and it looks fine to me.

cargo bench --bench arrow_writer -- "list_primitive/parquet_2"

list_primitive/parquet_2
                        time:   [994.65 µs 1.0011 ms 1.0083 ms]
                        thrpt:  [2.0649 GiB/s 2.0797 GiB/s 2.0931 GiB/s]
                 change:
                        time:   [+0.0649% +2.0194% +3.9738%] (p = 0.04 < 0.05)
                        thrpt:  [-3.8219% -1.9795% -0.0649%]
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

alamb · 2025-10-09T19:59:49Z

Three approvals so let's get this one in and we can iterate if necessary in follow on PRs!

alamb · 2025-10-09T20:00:00Z

Thanks again @paleolimbot @etseidl and @kylebarron !

alamb · 2025-10-09T20:27:15Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing spatial-stats-write (d5ba2f2) to d5df352 diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=spatial-stats-write
Results will be posted here when complete

alamb · 2025-10-09T20:50:01Z

🤖: Benchmark completed

Details

group                                     main                                   spatial-stats-write
-----                                     ----                                   -------------------
bool/bloom_filter                         1.00    130.2±1.02µs     8.1 MB/sec    1.00    129.6±0.87µs     8.2 MB/sec
bool/default                              1.00     53.3±0.87µs    19.9 MB/sec    1.00     53.3±0.19µs    19.9 MB/sec
bool/parquet_2                            1.01     68.1±0.11µs    15.6 MB/sec    1.00     67.3±0.26µs    15.7 MB/sec
bool/zstd                                 1.00     63.7±0.35µs    16.6 MB/sec    1.01     64.1±1.01µs    16.5 MB/sec
bool/zstd_parquet_2                       1.00     78.3±0.35µs    13.5 MB/sec    1.00     78.5±2.85µs    13.5 MB/sec
bool_non_null/bloom_filter                1.01    106.7±0.74µs     5.4 MB/sec    1.00    106.0±0.81µs     5.4 MB/sec
bool_non_null/default                     1.00     19.9±0.05µs    28.8 MB/sec    1.01     20.0±0.10µs    28.6 MB/sec
bool_non_null/parquet_2                   1.01     38.0±0.30µs    15.0 MB/sec    1.00     37.7±0.42µs    15.2 MB/sec
bool_non_null/zstd                        1.00     28.7±0.22µs    19.9 MB/sec    1.00     28.8±0.13µs    19.9 MB/sec
bool_non_null/zstd_parquet_2              1.01     47.7±0.27µs    12.0 MB/sec    1.00     47.4±0.45µs    12.1 MB/sec
float_with_nans/bloom_filter              1.00    934.0±7.35µs    58.8 MB/sec    1.02    955.5±9.75µs    57.5 MB/sec
float_with_nans/default                   1.00    574.3±1.96µs    95.7 MB/sec    1.02    583.6±3.25µs    94.2 MB/sec
float_with_nans/parquet_2                 1.00    823.1±1.78µs    66.8 MB/sec    1.01    834.2±5.07µs    65.9 MB/sec
float_with_nans/zstd                      1.00    752.0±2.24µs    73.1 MB/sec    1.01    761.6±2.10µs    72.2 MB/sec
float_with_nans/zstd_parquet_2            1.00   1004.8±5.51µs    54.7 MB/sec    1.02   1020.0±5.99µs    53.9 MB/sec
list_primitive/bloom_filter               1.65      4.0±0.05ms   536.2 MB/sec    1.00      2.4±0.02ms   885.3 MB/sec
list_primitive/default                    1.00   1722.0±6.50µs  1238.0 MB/sec    1.00  1720.7±11.33µs  1239.0 MB/sec
list_primitive/parquet_2                  1.33      2.3±0.01ms   907.8 MB/sec    1.00  1772.3±11.03µs  1202.9 MB/sec
list_primitive/zstd                       1.39      4.2±0.05ms   510.3 MB/sec    1.00      3.0±0.01ms   709.7 MB/sec
list_primitive/zstd_parquet_2             1.32      4.0±0.05ms   535.5 MB/sec    1.00      3.0±0.01ms   705.7 MB/sec
list_primitive_non_null/bloom_filter      1.69      4.8±0.07ms   444.4 MB/sec    1.00      2.8±0.03ms   749.5 MB/sec
list_primitive_non_null/default           1.00   1809.2±7.55µs  1175.9 MB/sec    1.01  1822.6±11.55µs  1167.2 MB/sec
list_primitive_non_null/parquet_2         1.47      2.9±0.03ms   722.2 MB/sec    1.00      2.0±0.01ms  1058.7 MB/sec
list_primitive_non_null/zstd              1.37      5.5±0.05ms   388.5 MB/sec    1.00      4.0±0.03ms   532.4 MB/sec
list_primitive_non_null/zstd_parquet_2    1.41      5.9±0.06ms   361.9 MB/sec    1.00      4.2±0.04ms   511.3 MB/sec
primitive/bloom_filter                    1.00      4.4±0.13ms    40.3 MB/sec    1.01      4.4±0.16ms    39.8 MB/sec
primitive/default                         1.00    853.9±4.99µs   206.0 MB/sec    1.00    851.2±5.72µs   206.7 MB/sec
primitive/parquet_2                       1.03   1006.2±5.65µs   174.8 MB/sec    1.00   977.0±75.91µs   180.1 MB/sec
primitive/zstd                            1.00   1152.3±3.71µs   152.7 MB/sec    1.01   1158.8±6.75µs   151.8 MB/sec
primitive/zstd_parquet_2                  1.21   1343.7±5.43µs   130.9 MB/sec    1.00   1106.4±5.01µs   159.0 MB/sec
primitive_non_null/bloom_filter           2.57      4.5±0.17ms    38.3 MB/sec    1.00  1755.6±24.83µs    98.3 MB/sec
primitive_non_null/default                1.01    733.4±4.05µs   235.2 MB/sec    1.00    723.8±2.76µs   238.4 MB/sec
primitive_non_null/parquet_2              1.25    905.9±7.05µs   190.4 MB/sec    1.00   723.6±10.32µs   238.4 MB/sec
primitive_non_null/zstd                   1.01  1020.4±10.58µs   169.1 MB/sec    1.00   1007.4±3.47µs   171.2 MB/sec
primitive_non_null/zstd_parquet_2         1.31  1337.5±23.18µs   129.0 MB/sec    1.00   1021.4±6.06µs   168.9 MB/sec
string/bloom_filter                       1.91      2.4±0.04ms   847.6 MB/sec    1.00  1262.1±20.14µs  1622.8 MB/sec
string/default                            1.01    776.0±5.81µs     2.6 GB/sec    1.00    767.8±5.29µs     2.6 GB/sec
string/parquet_2                          1.67   1289.6±9.39µs  1588.2 MB/sec    1.00    772.5±4.04µs     2.6 GB/sec
string/zstd                               1.50      3.5±0.04ms   590.7 MB/sec    1.00      2.3±0.01ms   885.4 MB/sec
string/zstd_parquet_2                     1.62      3.8±0.03ms   544.9 MB/sec    1.00      2.3±0.02ms   882.4 MB/sec
string_and_binary_view/bloom_filter       1.00    593.1±4.89µs   212.8 MB/sec    1.00    595.5±9.04µs   211.9 MB/sec
string_and_binary_view/default            1.00    353.6±4.51µs   356.9 MB/sec    1.00    353.1±1.05µs   357.5 MB/sec
string_and_binary_view/parquet_2          1.09    385.9±4.93µs   327.0 MB/sec    1.00    354.5±0.87µs   356.0 MB/sec
string_and_binary_view/zstd               1.01    605.9±8.72µs   208.3 MB/sec    1.00    600.8±1.41µs   210.1 MB/sec
string_and_binary_view/zstd_parquet_2     1.24    738.1±3.41µs   171.0 MB/sec    1.00    596.4±8.46µs   211.6 MB/sec
string_dictionary/bloom_filter            1.00    615.8±4.99µs  1675.9 MB/sec    1.01    624.1±4.65µs  1653.5 MB/sec
string_dictionary/default                 1.00    386.9±3.68µs     2.6 GB/sec    1.00    385.9±2.11µs     2.6 GB/sec
string_dictionary/parquet_2               1.00    386.3±2.62µs     2.6 GB/sec    1.02    393.9±2.07µs     2.6 GB/sec
string_dictionary/zstd                    1.00  1158.9±13.37µs   890.5 MB/sec    1.00   1158.5±5.78µs   890.8 MB/sec
string_dictionary/zstd_parquet_2          1.71  1969.3±19.72µs   524.1 MB/sec    1.00   1151.0±6.43µs   896.6 MB/sec
string_non_null/bloom_filter              1.74      3.2±0.07ms   648.8 MB/sec    1.00  1809.0±30.46µs  1131.6 MB/sec
string_non_null/default                   1.00  1121.4±17.72µs  1825.5 MB/sec    1.01   1128.9±7.51µs  1813.4 MB/sec
string_non_null/parquet_2                 1.59  1847.4±26.60µs  1108.1 MB/sec    1.00  1165.3±11.89µs  1756.8 MB/sec
string_non_null/zstd                      1.01      3.3±0.05ms   625.2 MB/sec    1.00      3.3±0.02ms   629.3 MB/sec
string_non_null/zstd_parquet_2            1.52      4.9±0.07ms   414.2 MB/sec    1.00      3.3±0.03ms   627.9 MB/sec

# Which issue does this PR close?  # Rationale for this change  Arrow_cast is fairly heavy dependency, especially now that it bundles in arrow-ord for RunEndEncodedArrays (#8708). Removing this dependency has been discussed as far back as 2024, let's finally actually do it #4764. # What changes are included in this PR?  # Are these changes tested?  # Are there any user-facing changes? Yes, unfortunately #8524 added an API that allows overriding the inferred schema, which in turn allows the coercion machinery to traverse somewhat unintended paths. I personally think this API shouldn't exist, but...  --------- Co-authored-by: Andrew Lamb <[email protected]>

# Which issue does this PR close?  # Rationale for this change  Arrow_cast is fairly heavy dependency, especially now that it bundles in arrow-ord for RunEndEncodedArrays (apache#8708). Removing this dependency has been discussed as far back as 2024, let's finally actually do it apache#4764. # What changes are included in this PR?  # Are these changes tested?  # Are there any user-facing changes? Yes, unfortunately apache#8524 added an API that allows overriding the inferred schema, which in turn allows the coercion machinery to traverse somewhat unintended paths. I personally think this API shouldn't exist, but...  --------- Co-authored-by: Andrew Lamb <[email protected]>

identify a few spots

55e32d2

github-actions bot added the parquet Changes to the parquet crate label Oct 1, 2025

paleolimbot added 10 commits October 1, 2025 14:21

add a dyn accumulator

9a88f7e

document and set

6a34eaf

maybe pipe into the encoder

d59c644

try

2f268f3

remove some previous work

5c9547e

plausibly working

6516544

test

629afc7

failing test but a bit better

7ae4f17

passing test

523575d

test!

203ea9c

paleolimbot commented Oct 2, 2025

View reviewed changes

etseidl reviewed Oct 2, 2025

View reviewed changes

paleolimbot added 15 commits October 6, 2025 16:51

ensure size stats are written for geometry/geography from generic enc…

5305e4e

…oder

remove tests that will start failing when new thift footer merges

3089b69

more flexible testers

9e12b57

more tests

a3b729b

add roundtrip tests, fix accumulator for the all empty case

f8b58c6

more test files

7472ba6

use code-compatible stats accessor

be7b522

fix test for new accessor

e60cd98

add documentation for new trait member

a34e5c4

explicit schema test

15bbe3c

document the accumulator

0fca11f

tests

cd0f609

rename

92d0d73

once lock thinger

182776e

remove uneeded todo

4c7c52a

paleolimbot added 2 commits October 7, 2025 21:19

try_new_geo_stats_accumulator()

431da25

fix link

6156112

paleolimbot added 4 commits October 8, 2025 13:11

Merge branch 'main' into spatial-stats-write

ed85f90

fix build

9d6d6c3

remove duplicate test

392d949

maybe merge tests better

f9112f7

alamb approved these changes Oct 8, 2025

View reviewed changes

etseidl approved these changes Oct 8, 2025

View reviewed changes

paleolimbot and others added 4 commits October 8, 2025 21:19

Apply suggestions from code review

ec31096

Co-authored-by: Ed Seidl <[email protected]>

document feature flag

eac356e

verify stats/null count

1bb2cd8

test column index

d5ba2f2

kylebarron approved these changes Oct 9, 2025

View reviewed changes

alamb merged commit 56e9c86 into apache:main Oct 9, 2025
20 checks passed

paleolimbot deleted the spatial-stats-write branch October 10, 2025 13:44

alamb mentioned this pull request Oct 17, 2025

Support writing GeospatialStatistics in Parquet writer #8523

Closed

tustvold mentioned this pull request Dec 31, 2025

Remove parquet arrow_cast dependency #9077

Merged

Support writing GeospatialStatistics in Parquet writer #8524

Support writing GeospatialStatistics in Parquet writer #8524

Uh oh!

Conversation

paleolimbot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot commented Oct 3, 2025

Uh oh!

etseidl commented Oct 8, 2025

Uh oh!

alamb commented Oct 8, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 8, 2025

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paleolimbot commented Oct 9, 2025

Uh oh!

etseidl commented Oct 9, 2025

Uh oh!

etseidl commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot commented Oct 1, 2025 •

edited

Loading