Skip to content

Conversation

@2010YOUY01
Copy link
Contributor

Hi 👋🏼 , I’m new to the project and still learning my way around. sedona-db looks great, and I’d really appreciate any feedbacks.

Rationale

Before, the execution logic for st_geometrytype() function is, for each row, first parse the WKB binary into a WKB object, then extract the base type from the object. This approach includes parsing unused fields in the WKB binary, since only the geometry type is needed.

This PR let it iterate through the raw WKB bytes, and directly parse the bytes to get the geometry type.

Implementation

  1. Extend GenericExecutor with a new API execute_wkb_bytes_void() to iterate on raw WKB bytes.
  2. Implement a util to parse the type from WKB binary according to the spec.
  3. Update st_geometrytype() with 1 and 2

I think it's better to move 2 to wkb crate, it doesn't have such a public interface yet 🤔

Benchmark

Command

pytest --benchmark-group-by=param:table --benchmark-columns=median,mean,stddev test_functions.py::TestBenchFunctions::test_st_geometrytype

Result:

5x faster for complex collections, 30% faster for simple collections:

-------------------------------- benchmark 'table=collections_complex': 3 tests -------------------------------
Name (time in ms)                                        Median                Mean            StdDev
---------------------------------------------------------------------------------------------------------------
test_st_geometrytype[collections_complex-SedonaDB]       2.3656 (1.0)        2.4929 (1.0)      0.3857 (1.0)
test_st_geometrytype[collections_complex-DuckDB]        34.2037 (14.46)     34.3980 (13.80)    0.8402 (2.18)
test_st_geometrytype[collections_complex-PostGIS]      304.6275 (128.77)   306.7333 (123.04)   5.8908 (15.27)
---------------------------------------------------------------------------------------------------------------

------------------------------ benchmark 'table=collections_simple': 3 tests -------------------------------
Name (time in ms)                                      Median               Mean            StdDev
------------------------------------------------------------------------------------------------------------
test_st_geometrytype[collections_simple-SedonaDB]      1.3585 (1.0)       1.7419 (1.0)      1.2142 (9.41)
test_st_geometrytype[collections_simple-DuckDB]        5.1103 (3.76)      5.1443 (2.95)     0.1291 (1.0)
test_st_geometrytype[collections_simple-PostGIS]      46.8870 (34.51)    46.9021 (26.93)    0.3712 (2.88)
------------------------------------------------------------------------------------------------------------
-------------------------------------- benchmark 'table=collections_complex': 3 tests -------------------------------------
Name (time in us)                                            Median                    Mean                StdDev
---------------------------------------------------------------------------------------------------------------------------
test_st_geometrytype[collections_complex-SedonaDB]         419.2500 (1.0)          450.9272 (1.0)        124.1193 (1.0)
test_st_geometrytype[collections_complex-DuckDB]        32,422.7921 (77.34)     32,917.7395 (73.00)    2,088.4215 (16.83)
test_st_geometrytype[collections_complex-PostGIS]      295,752.0001 (705.43)   294,866.8750 (653.91)   3,872.8562 (31.20)
---------------------------------------------------------------------------------------------------------------------------

------------------------------------ benchmark 'table=collections_simple': 3 tests -------------------------------------
Name (time in us)                                          Median                   Mean                StdDev
------------------------------------------------------------------------------------------------------------------------
test_st_geometrytype[collections_simple-SedonaDB]        613.2090 (1.0)       1,144.3652 (1.0)      1,073.4389 (3.42)
test_st_geometrytype[collections_simple-DuckDB]        5,502.5411 (8.97)      5,556.3829 (4.86)       314.2311 (1.0)
test_st_geometrytype[collections_simple-PostGIS]      36,191.1250 (59.02)    36,322.7638 (31.74)      730.0613 (2.32)
------------------------------------------------------------------------------------------------------------------------

@2010YOUY01 2010YOUY01 marked this pull request as draft September 16, 2025 12:48
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment on the executor but in general this looks great...thank you!

I think it's better to move 2 to wkb crate, it doesn't have such a public interface yet

We're using a fork of it for our first (shortly!) release that makes more fields public for approximately this reason and we're hoping to upstream those changes for our second release. I think that your approach here is a good one...there are some other places where we do byte munging with WKB for speed and so this is great.

Comment on lines 128 to 131
pub fn execute_wkb_bytes_void<F: FnMut(Option<&'b [u8]>) -> Result<()>>(
&self,
mut func: F,
) -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better to use the GeometryFactory/generic executor to do this. I believe something like this works (and is what we do for iterating over things that aren't wkb::Wkb from other libraries).

struct WkbBytesFactory {}

impl GeometryFactory for WkbBytesFactory {

    fn try_from_wkb<'a>(&self, wkb_bytes: &'a [u8]) -> Result<&'a [u8]> { Ok(wkb_bytes) }

}

type WkbBytesExecutor = GenericExecutor<WkbBytesFactory, WkbBytesFactory>;

@2010YOUY01
Copy link
Contributor Author

@paleolimbot Thank you for the review!

  • I've addressed the feedback to move execute_wkb_bytes_void() to a new WkbBytesExecutor Executor generic instantiation
  • Added a TODO in the wkb parsing util for now

@2010YOUY01 2010YOUY01 marked this pull request as ready for review September 17, 2025 05:54
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Just a note on the executor, which I think can eliminate the duplication there.

Comment on lines 267 to 300

impl<'b> GenericExecutor<'_, 'b, WkbBytesFactory, WkbBytesFactory> {
/// Execute a function by iterating over [Wkb]'s raw binary representation.
/// The provided `func` can assume its input bytes is a valid [Wkb] binary.
pub fn execute_wkb_bytes_void<F: FnMut(Option<&'b [u8]>) -> Result<()>>(
&self,
mut func: F,
) -> Result<()> {
// Ensure the first argument of the executor is either Wkb, WkbView, or
// a Null type (to support columns of all-null values)
match &self.arg_types[0] {
SedonaType::Wkb(_, _)
| SedonaType::WkbView(_, _)
| SedonaType::Arrow(DataType::Null) => {}
other => {
return sedona_internal_err!(
"Expected SedonaType::Wkb or SedonaType::WkbView or SedonaType::Arrow(DataType::Null) for the first arg, got {}",
other
)
}
}

match &self.args[0] {
ColumnarValue::Array(array) => {
array.iter_as_wkb_bytes(&self.arg_types[0], self.num_iterations, func)
}
ColumnarValue::Scalar(scalar_value) => {
let maybe_bytes = scalar_value.scalar_as_wkb_bytes()?;
func(maybe_bytes)
}
}
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See the below change...I am almost positive that executor.execute_wkb_void() does exactly this already unless I'm missing something!)

Suggested change
impl<'b> GenericExecutor<'_, 'b, WkbBytesFactory, WkbBytesFactory> {
/// Execute a function by iterating over [Wkb]'s raw binary representation.
/// The provided `func` can assume its input bytes is a valid [Wkb] binary.
pub fn execute_wkb_bytes_void<F: FnMut(Option<&'b [u8]>) -> Result<()>>(
&self,
mut func: F,
) -> Result<()> {
// Ensure the first argument of the executor is either Wkb, WkbView, or
// a Null type (to support columns of all-null values)
match &self.arg_types[0] {
SedonaType::Wkb(_, _)
| SedonaType::WkbView(_, _)
| SedonaType::Arrow(DataType::Null) => {}
other => {
return sedona_internal_err!(
"Expected SedonaType::Wkb or SedonaType::WkbView or SedonaType::Arrow(DataType::Null) for the first arg, got {}",
other
)
}
}
match &self.args[0] {
ColumnarValue::Array(array) => {
array.iter_as_wkb_bytes(&self.arg_types[0], self.num_iterations, func)
}
ColumnarValue::Scalar(scalar_value) => {
let maybe_bytes = scalar_value.scalar_as_wkb_bytes()?;
func(maybe_bytes)
}
}
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Updated in df88be5

Some(item) => {
builder.append_option(invoke_scalar(&item)?);
// Iterate over raw WKB bytes for faster type inference
executor.execute_wkb_bytes_void(|maybe_bytes| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
executor.execute_wkb_bytes_void(|maybe_bytes| {
executor.execute_wkb_void(|maybe_bytes| {

Comment on lines 92 to 93
///
/// TODO: Move it to `Wkb` crate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
///
/// TODO: Move it to `Wkb` crate

(We have a lot of stuff like this, both in this crate and the internal sedona-geometry)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in df88be5

Comment on lines 249 to 250
#[derive(Default)]
pub(crate) struct WkbBytesFactory {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, but this will probably be useful in a number of places!

Suggested change
#[derive(Default)]
pub(crate) struct WkbBytesFactory {}
/// A [GeometryFactory] whose geometry type are raw WKB bytes
///
/// Using this geometry factory iterates over items as references to the raw underlying
/// bytes, which is useful for writing optimized kernels that do not need the full buffer to
/// be validated and/or parsed.
#[derive(Default)]
pub struct WkbBytesFactory {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in df88be5

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jiayuasu jiayuasu merged commit 8d7d778 into apache:main Sep 17, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants