Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

round trip from extension type to polars series of list of lists of array #731

Open
deanm0000 opened this issue Aug 27, 2024 · 1 comment

Comments

@deanm0000
Copy link

I'm hoping to make a polars plugin that would convert geoarrow extension types into (I'm thinking) a struct with a field for each geometry type and maybe put metadata in the field name with each of those fields being represented by something like a list(list(array(f64,2))), depending on the geometry, of course. I'm not sure if any part of this is something you'd like to have in geoarrow or if you might see some obvious improvement in what I'm doing but thought I'd reach out.

What I've done so far is the one-way trip from geoarrow PolygonArray to a Series of list(list(array)). I'll experiment with the other direction later. Of course, that would be necessary for the geo functions. Starting from the PolygonArray generated here

I managed to get a Polygon into a polars Series with the following steps. It's super clunky as it goes through an intermediate Series before it makes the final Series. It does work though.

let poly: PolygonArray<i32, 2>=p_array();
let b: arrow::array::GenericListArray<i32> = poly.into_arrow();
let (_field, offsetbuffer, arcarray, _nullbuffer) =  b.into_parts();
let offsets_32: Vec<i32> = offsetbuffer.to_vec();
let offsets: Vec<i64> = offsets_32.into_iter().map(|x| x as i64).collect();
let s = Series::try_from((
    "polys",
    Box::<dyn polars_arrow::array::Array>::from(arcarray),
))
.unwrap().rechunk();
let values = s.array_ref(0);
// If I try to skip making the above Series and just do
// let values = Box::<dyn polars_arrow::array::Array>::from(arcarray);
// then I get a panic.

let data_type = polars_arrow::array::ListArray::<i64>::default_datatype(values.data_type().clone());
let arr = unsafe {
    polars_arrow::array::ListArray::new(
        data_type,
        polars_arrow::offset::Offsets::new_unchecked(offsets).into(),
        values.clone(),
        None,
    )
};
let mut ca=ListChunked::with_chunk("polys", arr);
unsafe { ca.to_logical(DataType::List(Box::new(DataType::Array(Box::new(DataType::Float64), 2))))};
ca.set_fast_explode();
let s=Series::try_from(ca).unwrap();
eprintln!("{}",s);

which results in

shape: (2,)
Series: 'polys' [list[list[array[f64, 2]]]]
[
        [[[-111.0, 45.0], [-111.0, 41.0], … [-111.0, 45.0]]]
        [[[-111.0, 45.0], [-111.0, 41.0], … [-111.0, 45.0]], [[-110.0, 44.0], [-110.0, 42.0], … [-110.0, 44.0]]]
]

BTW: I copied most of the above from the polars implode method

Just to recap, I'm reaching out to see if you've got any pointers on how to do the above better and to see if you have any interest in having the feature to go to/from polars in geoarrow (I suppose it could be feature gated).

@kylebarron
Copy link
Member

I think there are two related questions here: how to handle Arrow data conversion between arrow-rs and polars-arrow, and how to handle geospatial data in Polars.

For the first, this is very general and not specific to geoarrow data or geoarrow-rs. I think you want to use polars_arrow::array::Arrow2Arrow. Going through a Vec will be less performant because going to a Vec will copy all arrow data. You may have to vendor Arrow2Arrow into your own project so that it's using the same version of arrow-rs that geoarrow is using.

For the second, you know I've been interested in geospatial support in polars for a long time.

geoarrow extension types into a struct with a field for each geometry type

Sure, this is like a custom implementation of an Arrow sparse union. You could have a struct of {"Point": PointArray, "LineString": LineStringArray<i64> ...} etc. But note that that will not be zero-copy with a geoarrow-rs MixedGeometryArray because the geoarrow Geometry array is an Arrow dense union.

So in particular, the PointArray in your struct will need to have the same number of rows as other data, which means allocating data for all points even when you have no points.

You'll still have to figure out the CRS storage somehow...

In a general sense, I'm most focused on growing the GeoArrow ecosystem. I would like to connect that to Polars but I'm not going to make a geopolars implementation that isn't zero-copy with GeoArrow. If Polars allows extension types, I'll make a GeoArrow-based GeoPolars, but otherwise it's more likely that I build on DataFusion when they merge user-defined types (which they plan to do: apache/datafusion#11513)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants