-
Notifications
You must be signed in to change notification settings - Fork 729
feat: add proto serialization for FilteredReadExec #5954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
14527fa
cc80861
d0f24d7
3348756
cdf75dd
bd82778
2fd5742
6e4edc2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // SPDX-FileCopyrightText: Copyright The Lance Authors | ||
|
|
||
| syntax = "proto3"; | ||
|
|
||
| package lance.datafusion; | ||
|
|
||
| import "table_identifier.proto"; | ||
|
|
||
| message U64Range { | ||
| uint64 start = 1; | ||
| uint64 end = 2; | ||
| } | ||
|
|
||
| message ProjectionProto { | ||
| repeated int32 field_ids = 1; | ||
| bool with_row_id = 2; | ||
| bool with_row_addr = 3; | ||
| bool with_row_last_updated_at_version = 4; | ||
| bool with_row_created_at_version = 5; | ||
| BlobHandlingProto blob_handling = 6; | ||
| } | ||
|
|
||
| message BlobHandlingProto { | ||
| oneof mode { | ||
| // All blobs read as binary | ||
| bool all_binary = 1; | ||
| // Blobs as descriptions, other binary as binary (default) | ||
| bool blobs_descriptions = 2; | ||
| // All binary columns as descriptions | ||
| bool all_descriptions = 3; | ||
| // Specific blobs read as binary, rest as descriptions (non-blob binary stays binary) | ||
| FieldIdSet some_blobs_binary = 4; | ||
| // Specific columns as binary, all other binary as descriptions | ||
| FieldIdSet some_binary = 5; | ||
| } | ||
| } | ||
|
|
||
| message FieldIdSet { | ||
| repeated uint32 field_ids = 1; | ||
| } | ||
|
|
||
| message FilteredReadThreadingModeProto { | ||
| oneof mode { | ||
| uint64 one_partition_multiple_threads = 1; | ||
| uint64 multiple_partitions = 2; | ||
| } | ||
| } | ||
|
|
||
| // Serializable form of FilteredReadOptions. | ||
| message FilteredReadOptionsProto { | ||
| optional U64Range scan_range_before_filter = 1; | ||
| optional U64Range scan_range_after_filter = 2; | ||
| bool with_deleted_rows = 3; | ||
| optional uint32 batch_size = 4; | ||
| optional uint64 fragment_readahead = 5; | ||
| repeated uint64 fragment_ids = 6; | ||
| ProjectionProto projection = 7; | ||
| optional bytes refine_filter_substrait = 8; | ||
| optional bytes full_filter_substrait = 9; | ||
| FilteredReadThreadingModeProto threading_mode = 10; | ||
| optional uint64 io_buffer_size_bytes = 11; | ||
| // Arrow IPC schema for decoding Substrait filters (may be wider than projection). | ||
| optional bytes filter_schema_ipc = 12; | ||
| } | ||
|
|
||
| // Serializable form of FilteredReadPlan (planned/distributed mode). | ||
| // RowAddrTreeMap serialized via its built-in serialize_into/deserialize_from. | ||
| // Per-fragment filters are Substrait-encoded and deduplicated. | ||
| message FilteredReadPlanProto { | ||
| bytes row_addr_tree_map = 1; | ||
| optional U64Range scan_range_after_filter = 2; | ||
| // Arrow IPC schema for decoding Substrait filters (matches the schema used at encode time). | ||
| optional bytes filter_schema_ipc = 3; | ||
| // Deduplicated filter storage: frag_id → index into filter_expressions. | ||
| map<uint32, uint32> fragment_filter_ids = 4; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you expand on what the key and value represent? I think the key is fragment id and the value is an index into |
||
| // Unique Substrait-encoded filter expressions (indexed by fragment_filter_ids values). | ||
| repeated bytes filter_expressions = 5; | ||
| } | ||
|
|
||
| // Top-level wrapper for FilteredReadExec serialization. | ||
| message FilteredReadExecProto { | ||
| TableIdentifier table = 1; | ||
| FilteredReadOptionsProto options = 2; | ||
| optional FilteredReadPlanProto plan = 3; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you explain why this is optional?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FilteredReadExec can run in two modes:
|
||
| // index_input (child plan) handled by DataFusion's codec via inputs[] | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not entirely sure what this means
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. // Top-level wrapper for FilteredReadExec serialization. basically datafusion will give the deserialized index input to FilteredReadExec when constructing it, we don't need to deal with that. |
||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // SPDX-FileCopyrightText: Copyright The Lance Authors | ||
|
|
||
| syntax = "proto3"; | ||
|
|
||
| package lance.datafusion; | ||
|
|
||
| // Identifies a Lance dataset for remote reconstruction. | ||
| // | ||
| // Two modes: | ||
| // 1. uri + serialized_manifest (fast): remote executor skips manifest read. | ||
| // 2. uri + version + etag (lightweight): remote executor loads manifest from storage. | ||
| message TableIdentifier { | ||
| string uri = 1; | ||
| uint64 version = 2; | ||
| optional string manifest_etag = 3; | ||
| optional bytes serialized_manifest = 4; | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wjones127 do you think we need to serialize the manifest_etag and even send the serialized manifest itself? |
||
| } | ||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // SPDX-FileCopyrightText: Copyright The Lance Authors | ||
|
|
||
| use std::io::Result; | ||
|
|
||
| fn main() -> Result<()> { | ||
| println!("cargo:rerun-if-changed=protos"); | ||
|
|
||
| #[cfg(feature = "protoc")] | ||
| // Use vendored protobuf compiler if requested. | ||
| std::env::set_var("PROTOC", protobuf_src::protoc()); | ||
|
|
||
| let mut prost_build = prost_build::Config::new(); | ||
| prost_build.protoc_arg("--experimental_allow_proto3_optional"); | ||
| prost_build.enable_type_names(); | ||
| prost_build.compile_protos( | ||
| &[ | ||
| "./protos/table_identifier.proto", | ||
| "./protos/filtered_read.proto", | ||
| ], | ||
| &["./protos"], | ||
| )?; | ||
|
|
||
| Ok(()) | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| ../../protos |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,7 @@ | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // SPDX-FileCopyrightText: Copyright The Lance Authors | ||
|
|
||
| use arrow_schema::Schema as ArrowSchema; | ||
| use arrow_schema::{DataType, Schema as ArrowSchema}; | ||
| use datafusion::{execution::SessionState, logical_expr::Expr}; | ||
|
|
||
| use crate::aggregate::Aggregate; | ||
|
|
@@ -27,6 +27,31 @@ use snafu::location; | |
| use std::collections::HashMap; | ||
| use std::sync::Arc; | ||
|
|
||
| /// Substrait doesn't yet support all data types. | ||
| fn is_substrait_compatible(data_type: &DataType) -> bool { | ||
| match data_type { | ||
| DataType::Null | DataType::FixedSizeList(_, _) | DataType::Float16 => false, | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @westonpace do you know if this is still the limitation for substrait?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pretty sure this is correct though maybe there is support for Float16 now in DF if I recall correctly. |
||
| DataType::List(inner) => is_substrait_compatible(inner.data_type()), | ||
| DataType::Struct(fields) => fields | ||
| .iter() | ||
| .all(|f| is_substrait_compatible(f.data_type())), | ||
| _ => true, | ||
| } | ||
| } | ||
|
|
||
| /// Removes top-level fields that contain data types that Substrait | ||
| /// is not capable of serializing. | ||
| pub fn prune_schema_for_substrait(schema: &ArrowSchema) -> ArrowSchema { | ||
| ArrowSchema::new( | ||
| schema | ||
| .fields() | ||
| .iter() | ||
| .filter(|f| is_substrait_compatible(f.data_type())) | ||
| .cloned() | ||
| .collect::<Vec<_>>(), | ||
| ) | ||
| } | ||
|
|
||
| /// Convert a DF Expr into a Substrait ExtendedExpressions message | ||
| /// | ||
| /// The schema needs to contain all of the fields that are referenced in the expression. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR does it make sense to make this a
oneofoverbytesandrepeated U64Range? Then we could just use the smaller representation on serialization (easy computation).In my plan_splits PR I merged the
FilteredReadPlanandFilteredReadIntenralPlanbecause IIUC the only reason we have differentiation is that serializing a bitmap is smaller than ranges. I would challenge that under the intuition (or maybe "hope") that we have many dense splits where the seriaization is actually smaller for ranges (ex. all 4k values a range is 16 bytes (u64 start, u64 end) whereas a bitmap is 4k / 8 = 500 bytes). If this is the case, then we would do a series of inefficient serializations to move to bitmap for every query.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that having FilteredReadPlan as a outside plan is mainly not for serialization / deserialization. I keep FilteredReadPlan with external plan using RoaringBitmap for the following reasons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't have a good sense for where the trade-offs are for the two approaches but it sounds like you've both thought about it. My preference is to move forward with what we've got written and revisit if serialization time seems to be a bottleneck?