-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
don't error if more fields exist than expected in a struct expression #267
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm little nit and for now DAT test seems fine but unit test maybe good follow up
)) | ||
); | ||
for (kernel_field, arrow_field) in kernel_fields.fields().zip(arrow_fields.iter()) { | ||
// build a list of kernel fields that matches the order of the arrow fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we update docstring on ensure_data_types
above (can't comment on it since it isn't in the diff); just maybe a more rigorous definition of 'matching' saying that we will check kernel_fields is a subset?
// keep track of how many fields we matched up | ||
let mut found_fields = 0; | ||
// ensure that for the fields that we found, the types match | ||
for (kernel_field, arrow_field) in mapped_fields.zip(arrow_fields) { | ||
ensure_data_types(&kernel_field.data_type, arrow_field.data_type())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we originally had this function returning the actual schema to use... but dropped that because it wasn't actually changing the schema. Ironically we may need to reinstate that control flow? That way, unexpected struct fields just get ignored because they're not in the returned schema?
(but agree that can be a follow-up after this PR fixes the immediate DAT issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, exactly. I started writing that for this PR and realized that munging arrow structs is... complicated, so decided to just do this quick fix.
We will need to make a deeper PR that will have to pass the actual <dyn Array>
rather than just the schema, and then if the struct is having only a subset of fields selected, deconstruct each element, and put it back together without the dropped fields, and then return a new array to return as the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait a minute... shouldn't we be filtering the read schema so it never even gets fetched from disk? Why would we need to filter arrow data at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! The issue is because we don't push down selection into leaf fields, and only support selection on the top level fields. Probably the right thing to do is properly push that down, and then we don't have to filter the arrow data, as you note, and this can probably get more simple again...
This allows us to evaluate a
struct
expression where the base data has more fields than the specified schema.This is an alternative to #264, and matches more closely with the spec which states:
NB: This does NOT actually drop the data from the returned expression, it just does no validation on columns that aren't specified. So if the parquet files contains:
and kernel passes a struct schema like
{"name": "a", "type": "int"}
, the returned data will be:This works for metadata since we can then call
extract
and things will "just work", but might not be good enough for when we do final "fix-up" via expressions. However, actually dropping the column from the arrow data requires a lot more code change, so I'm proposing we do this for now to fix things like #261, and then figure out the best way to "do the right thing" when a subset of fields are specified in a struct expression.Verified that things work as expected with this change:
Closes #261