-
Notifications
You must be signed in to change notification settings - Fork 28
perf: Faster st_geometrytype() function
#90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment on the executor but in general this looks great...thank you!
I think it's better to move 2 to wkb crate, it doesn't have such a public interface yet
We're using a fork of it for our first (shortly!) release that makes more fields public for approximately this reason and we're hoping to upstream those changes for our second release. I think that your approach here is a good one...there are some other places where we do byte munging with WKB for speed and so this is great.
| pub fn execute_wkb_bytes_void<F: FnMut(Option<&'b [u8]>) -> Result<()>>( | ||
| &self, | ||
| mut func: F, | ||
| ) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be better to use the GeometryFactory/generic executor to do this. I believe something like this works (and is what we do for iterating over things that aren't wkb::Wkb from other libraries).
struct WkbBytesFactory {}
impl GeometryFactory for WkbBytesFactory {
fn try_from_wkb<'a>(&self, wkb_bytes: &'a [u8]) -> Result<&'a [u8]> { Ok(wkb_bytes) }
}
type WkbBytesExecutor = GenericExecutor<WkbBytesFactory, WkbBytesFactory>;|
@paleolimbot Thank you for the review!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Just a note on the executor, which I think can eliminate the duplication there.
|
|
||
| impl<'b> GenericExecutor<'_, 'b, WkbBytesFactory, WkbBytesFactory> { | ||
| /// Execute a function by iterating over [Wkb]'s raw binary representation. | ||
| /// The provided `func` can assume its input bytes is a valid [Wkb] binary. | ||
| pub fn execute_wkb_bytes_void<F: FnMut(Option<&'b [u8]>) -> Result<()>>( | ||
| &self, | ||
| mut func: F, | ||
| ) -> Result<()> { | ||
| // Ensure the first argument of the executor is either Wkb, WkbView, or | ||
| // a Null type (to support columns of all-null values) | ||
| match &self.arg_types[0] { | ||
| SedonaType::Wkb(_, _) | ||
| | SedonaType::WkbView(_, _) | ||
| | SedonaType::Arrow(DataType::Null) => {} | ||
| other => { | ||
| return sedona_internal_err!( | ||
| "Expected SedonaType::Wkb or SedonaType::WkbView or SedonaType::Arrow(DataType::Null) for the first arg, got {}", | ||
| other | ||
| ) | ||
| } | ||
| } | ||
|
|
||
| match &self.args[0] { | ||
| ColumnarValue::Array(array) => { | ||
| array.iter_as_wkb_bytes(&self.arg_types[0], self.num_iterations, func) | ||
| } | ||
| ColumnarValue::Scalar(scalar_value) => { | ||
| let maybe_bytes = scalar_value.scalar_as_wkb_bytes()?; | ||
| func(maybe_bytes) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See the below change...I am almost positive that executor.execute_wkb_void() does exactly this already unless I'm missing something!)
| impl<'b> GenericExecutor<'_, 'b, WkbBytesFactory, WkbBytesFactory> { | |
| /// Execute a function by iterating over [Wkb]'s raw binary representation. | |
| /// The provided `func` can assume its input bytes is a valid [Wkb] binary. | |
| pub fn execute_wkb_bytes_void<F: FnMut(Option<&'b [u8]>) -> Result<()>>( | |
| &self, | |
| mut func: F, | |
| ) -> Result<()> { | |
| // Ensure the first argument of the executor is either Wkb, WkbView, or | |
| // a Null type (to support columns of all-null values) | |
| match &self.arg_types[0] { | |
| SedonaType::Wkb(_, _) | |
| | SedonaType::WkbView(_, _) | |
| | SedonaType::Arrow(DataType::Null) => {} | |
| other => { | |
| return sedona_internal_err!( | |
| "Expected SedonaType::Wkb or SedonaType::WkbView or SedonaType::Arrow(DataType::Null) for the first arg, got {}", | |
| other | |
| ) | |
| } | |
| } | |
| match &self.args[0] { | |
| ColumnarValue::Array(array) => { | |
| array.iter_as_wkb_bytes(&self.arg_types[0], self.num_iterations, func) | |
| } | |
| ColumnarValue::Scalar(scalar_value) => { | |
| let maybe_bytes = scalar_value.scalar_as_wkb_bytes()?; | |
| func(maybe_bytes) | |
| } | |
| } | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Updated in df88be5
| Some(item) => { | ||
| builder.append_option(invoke_scalar(&item)?); | ||
| // Iterate over raw WKB bytes for faster type inference | ||
| executor.execute_wkb_bytes_void(|maybe_bytes| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| executor.execute_wkb_bytes_void(|maybe_bytes| { | |
| executor.execute_wkb_void(|maybe_bytes| { |
| /// | ||
| /// TODO: Move it to `Wkb` crate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// | |
| /// TODO: Move it to `Wkb` crate |
(We have a lot of stuff like this, both in this crate and the internal sedona-geometry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in df88be5
| #[derive(Default)] | ||
| pub(crate) struct WkbBytesFactory {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional, but this will probably be useful in a number of places!
| #[derive(Default)] | |
| pub(crate) struct WkbBytesFactory {} | |
| /// A [GeometryFactory] whose geometry type are raw WKB bytes | |
| /// | |
| /// Using this geometry factory iterates over items as references to the raw underlying | |
| /// bytes, which is useful for writing optimized kernels that do not need the full buffer to | |
| /// be validated and/or parsed. | |
| #[derive(Default)] | |
| pub struct WkbBytesFactory {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in df88be5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Hi 👋🏼 , I’m new to the project and still learning my way around.
sedona-dblooks great, and I’d really appreciate any feedbacks.Rationale
Before, the execution logic for
st_geometrytype()function is, for each row, first parse theWKBbinary into aWKBobject, then extract the base type from the object. This approach includes parsing unused fields in theWKBbinary, since only the geometry type is needed.This PR let it iterate through the raw
WKBbytes, and directly parse the bytes to get the geometry type.Implementation
GenericExecutorwith a new APIexecute_wkb_bytes_void()to iterate on rawWKBbytes.WKBbinary according to the spec.st_geometrytype()with 1 and 2I think it's better to move
2towkbcrate, it doesn't have such a public interface yet 🤔Benchmark
Command
Result:
5x faster for complex collections, 30% faster for simple collections: