-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFusion does not validate that Substrait NamedScan schemas match registered tables #12223
Comments
I'm in the process of preparing a PR for this issue. |
From conversations with @Blizzara, the requirement that the DataFusion and Substrait schemas match exactly is stricter than it needs to be. In practice, if the Substrait schema is a subset of the DataFusion schema, the consumer can adapt the plan as it consumes it to make it match the shape expected by Substrait. For example, if DataFusion has a schema |
* fix: producer did not emit base_schema struct field for ReadRel Substrait plans are not valid without this, and it is generally useful for round trip testing * feat: include field_qualifier param for from_substrait_named_struct * feat: verify that Substrait and DataFusion agree on NamedScan schemas If the schema registered with DataFusion and the schema as given by the Substrait NamedScan do not have the same names and types, DataFusion should reject it * test: update existing substrait test + substrait validation test * added substrait_validation test * extracted useful test utilities The utils::test::TestSchemaCollector::generate_context_from_plan function can be used to dynamically generate a SessionContext from a Substrait plan, which will include the schemas for NamedTables as given in the Substrait plan. This helps us avoid the issue of DataFusion test schemas and Substrait plan schemas not being in sync. * feat: expose from_substrait_named_struct * refactor: remove unused imports * docs: add missing licenses * refactor: deal with unused code warnings * remove optional qualifier from from_substrait_named_struct * return DFSchema from from_substrait_named_struct * one must imagine clippy happy * accidental blah * loosen the validation for schemas allow cases where the Substrait schema is a subset of the DataFusion schema * minor doc tweaks * update test data to deal with case issues in tests * fix error message * improve readability of field compatability check * make TestSchemaCollector more flexible * fix doc typo Co-authored-by: Arttu <[email protected]> * remove unecessary TODO * handle ReadRel projection on top of mismatched schema --------- Co-authored-by: Arttu <[email protected]>
Describe the bug
As written, the test assertion in
datafusion/datafusion/substrait/tests/cases/logical_plans.rs
Lines 46 to 50 in 1fce2a9
should fail because DataFusion registers the
data
table with 5 fields [a, b, c, d, e] but the schema for the table in the Substrait plan only has a single field [D].To Reproduce
No response
Expected behavior
DataFusion should reject Substrait plans in which NamedScan schemas do not match the corresponding table that is is registered.
Additional context
Generally speaking, if the plan consumer (DataFusion) and the producer do not agree on column names and types, it is unlikely that execution will be meaningful.
The text was updated successfully, but these errors were encountered: