Skip to content

Commit ed2b80b

Browse files
authored
don't error if more fields exist than expected in a struct expression (#267)
This allows us to evaluate a `struct` expression where the base data has more fields than the specified schema. This is an alternative to #264, and matches more closely with the spec which states: > clients can assume that unrecognized actions, fields, and/or metadata domains are never required in order to correctly interpret the transaction log. Clients must ignore such unrecognized fields, and should not produce an error when reading a table that contains unrecognized fields. NB: This does NOT actually _drop_ the data from the returned expression, it just does no validation on columns that aren't specified. So if the parquet files contains: ```json { "a": 1, "b": 2, } ``` and kernel passes a struct schema like `{"name": "a", "type": "int"}`, the returned data will be: ```json { "a": 1, "b": 2, } ``` This works for metadata since we can then call `extract` and things will "just work", but might not be good enough for when we do final "fix-up" via expressions. However, actually dropping the column from the arrow data requires a lot more code change, so I'm proposing we do this for now to fix things like #261, and then figure out the best way to "do the right thing" when a subset of fields are specified in a struct expression. Verified that things work as expected with this change: ``` D select * from delta_scan('/home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/with_checkpoint/delta/'); ┌─────────┬───────┬────────────┐ │ letter │ int │ date │ │ varchar │ int64 │ date │ ├─────────┼───────┼────────────┤ │ a │ 93 │ 1975-06-01 │ │ b │ 753 │ 2012-05-01 │ │ c │ 620 │ 1983-10-01 │ │ a │ 595 │ 2013-03-01 │ │ │ 653 │ 1995-12-01 │ └─────────┴───────┴────────────┘ ``` Closes #261
1 parent 4b25f40 commit ed2b80b

File tree

1 file changed

+23
-10
lines changed

1 file changed

+23
-10
lines changed

kernel/src/engine/arrow_expression.rs

+23-10
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,9 @@ fn make_arrow_error(s: String) -> Error {
168168
/// Ensure a kernel data type matches an arrow data type. This only ensures that the actual "type"
169169
/// is the same, but does so recursively into structs, and ensures lists and maps have the correct
170170
/// associated types as well. This returns an `Ok(())` if the types are compatible, or an error if
171-
/// the types do not match.
171+
/// the types do not match. If there is a `struct` type included, we only ensure that the named
172+
/// fields that the kernel is asking for exist, and that for those fields the types
173+
/// match. Un-selected fields are ignored.
172174
fn ensure_data_types(kernel_type: &DataType, arrow_type: &ArrowDataType) -> DeltaResult<()> {
173175
match (kernel_type, arrow_type) {
174176
(DataType::Primitive(_), _) if arrow_type.is_primitive() => Ok(()),
@@ -215,17 +217,28 @@ fn ensure_data_types(kernel_type: &DataType, arrow_type: &ArrowDataType) -> Delt
215217
}
216218
}
217219
(DataType::Struct(kernel_fields), ArrowDataType::Struct(arrow_fields)) => {
218-
require!(
219-
kernel_fields.fields.len() == arrow_fields.len(),
220-
make_arrow_error(format!(
221-
"Struct types have different numbers of fields. Expected {}, got {}",
222-
kernel_fields.fields.len(),
223-
arrow_fields.len()
224-
))
225-
);
226-
for (kernel_field, arrow_field) in kernel_fields.fields().zip(arrow_fields.iter()) {
220+
// build a list of kernel fields that matches the order of the arrow fields
221+
let mapped_fields = arrow_fields
222+
.iter()
223+
.flat_map(|f| kernel_fields.fields.get(f.name()));
224+
225+
// keep track of how many fields we matched up
226+
let mut found_fields = 0;
227+
// ensure that for the fields that we found, the types match
228+
for (kernel_field, arrow_field) in mapped_fields.zip(arrow_fields) {
227229
ensure_data_types(&kernel_field.data_type, arrow_field.data_type())?;
230+
found_fields += 1;
228231
}
232+
233+
// require that we found the number of fields that we requested.
234+
require!(kernel_fields.fields.len() == found_fields, {
235+
let kernel_field_names = kernel_fields.fields.keys().join(", ");
236+
let arrow_field_names = arrow_fields.iter().map(|f| f.name()).join(", ");
237+
make_arrow_error(format!(
238+
"Missing Struct fields. Requested: {}, found: {}",
239+
kernel_field_names, arrow_field_names,
240+
))
241+
});
229242
Ok(())
230243
}
231244
_ => Err(make_arrow_error(format!(

0 commit comments

Comments
 (0)