Skip to content

Revert to pyarrow v20 for compatibility with stale Kaggle geopandas#4589

Merged
zaneselvans merged 1 commit intomainfrom
downgrade-pyarrow
Sep 2, 2025
Merged

Revert to pyarrow v20 for compatibility with stale Kaggle geopandas#4589
zaneselvans merged 1 commit intomainfrom
downgrade-pyarrow

Conversation

@zaneselvans
Copy link
Copy Markdown
Member

Almost immediately after adding GeoParquet outputs to PUDL, we updated to using pyarrow 21.0, which now provides native support for the GEOMETRY and GEOGRAPHY data types, which is great, since that means the geoparquuet / geoarrow extensions to support the (previously) non-standard data types are no longer necessary.

See:

Unfortunately, Kaggle is stuck on geopandas 0.14.1 (released in April of 2024) due to what was at least at some point an incompatibility with the scikit-learn package.

I created an issue asking them to update to modern geopandas or at least check whether the incompatibility still exists:

Kaggle/docker-python#1491

For the moment I think the easiest way back to working notebooks is to downgrade our pyarrow to v20.0.0.

It might also be the case that we no longer need to add the bespoke b"geo" metadata in our IO manager with pyarrow v21.0.0 and native GeoParquet support? But that would require more investigation.

I tried recreating the GeoParquet outputs locally with pyarrow v20 and then reading them with the stale versions of geopandas from Kaggle and it worked, while those stale versions couldn't read the local geopandas outputs from pyarrow v21.

Almost immediately after adding GeoParquet outputs to PUDL, we updated to using pyarrow
21.0, which now provides native support for the GEOMETRY and GEOGRAPHY data types, which
is great, since that means the geoparquuet / geoarrow extensions to support the
(previously) non-standard data types are no longer necessary.

See:

* apache/arrow#45459
* apache/arrow#45522

Unfortunately, Kaggle is stuck on geopandas 0.14.1 (released in April of 2024) due to
what was at least at some point an incompatibility with the scikit-learn package.

I created an issue asking them to update to modern geopandas or at least check whether
the incompatibility still exists:

Kaggle/docker-python#1491

For the moment I think the easiest way back to working notebooks is to downgrade our
pyarrow to v20.0.0.

It might also be the case that we no longer need to add the bespoke `b"geo"` metadata in
our IO manager with pyarrow v21.0.0 and native GeoParquet support? But that would
require more investigation.

I tried recreating the GeoParquet outputs locally with pyarrow v20 and then reading them
with the stale versions of geopandas from Kaggle and it worked, while those stale
versions couldn't read the local geopandas outputs from pyarrow v21.
@zaneselvans zaneselvans self-assigned this Sep 2, 2025
@zaneselvans zaneselvans added parquet Issues related to the Apache Parquet file format which we use for long tables. geospatial Spatial data and transformations. Anything related to mapping. kaggle Sharing our data and analysis with the Kaggle community dependencies Pull requests that update a dependency file labels Sep 2, 2025
@zaneselvans zaneselvans moved this from New to In review in Catalyst Megaproject Sep 2, 2025
@zaneselvans zaneselvans requested a review from jdangerx September 2, 2025 16:54
@zaneselvans zaneselvans added this pull request to the merge queue Sep 2, 2025
Merged via the queue into main with commit 1b516ef Sep 2, 2025
21 of 23 checks passed
@zaneselvans zaneselvans deleted the downgrade-pyarrow branch September 2, 2025 18:48
@github-project-automation github-project-automation bot moved this from In review to Done in Catalyst Megaproject Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file geospatial Spatial data and transformations. Anything related to mapping. kaggle Sharing our data and analysis with the Kaggle community parquet Issues related to the Apache Parquet file format which we use for long tables.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants