Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFusion does not validate that Substrait NamedScan schemas match registered tables #12223

Closed
vbarua opened this issue Aug 28, 2024 · 2 comments · Fixed by #12245
Closed

DataFusion does not validate that Substrait NamedScan schemas match registered tables #12223

vbarua opened this issue Aug 28, 2024 · 2 comments · Fixed by #12245
Assignees
Labels
bug Something isn't working

Comments

@vbarua
Copy link
Contributor

vbarua commented Aug 28, 2024

Describe the bug

As written, the test assertion in

assert_eq!(
format!("{}", plan),
"Projection: NOT DATA.a AS EXPR$0\
\n TableScan: DATA projection=[a, b, c, d, e, f]"
);

should fail because DataFusion registers the data table with 5 fields [a, b, c, d, e] but the schema for the table in the Substrait plan only has a single field [D].

To Reproduce

No response

Expected behavior

DataFusion should reject Substrait plans in which NamedScan schemas do not match the corresponding table that is is registered.

Additional context

Generally speaking, if the plan consumer (DataFusion) and the producer do not agree on column names and types, it is unlikely that execution will be meaningful.

@vbarua vbarua added the bug Something isn't working label Aug 28, 2024
@vbarua
Copy link
Contributor Author

vbarua commented Aug 28, 2024

I'm in the process of preparing a PR for this issue.

@vbarua
Copy link
Contributor Author

vbarua commented Sep 4, 2024

From conversations with @Blizzara, the requirement that the DataFusion and Substrait schemas match exactly is stricter than it needs to be. In practice, if the Substrait schema is a subset of the DataFusion schema, the consumer can adapt the plan as it consumes it to make it match the shape expected by Substrait.

For example, if DataFusion has a schema [a, b, c] for table t, and Substrait has a schema [b, c] for table t, as DataFusion consumes the plan it may add a project for fields [b,c] immediately after the read from table t to bring it in line with what the Substrait plan expects.

alamb pushed a commit that referenced this issue Sep 10, 2024
* fix: producer did not emit base_schema struct field for ReadRel

Substrait plans are not valid without this, and it is generally useful
for round trip testing

* feat: include field_qualifier param for from_substrait_named_struct

* feat: verify that Substrait and DataFusion agree on NamedScan schemas

If the schema registered with DataFusion and the schema as given by the
Substrait NamedScan do not have the same names and types, DataFusion
should reject it

* test: update existing substrait test + substrait validation test

* added substrait_validation test
* extracted useful test utilities

The utils::test::TestSchemaCollector::generate_context_from_plan
function can be used to dynamically generate a SessionContext from a
Substrait plan, which will include the schemas for NamedTables as given
in the Substrait plan.

This helps us avoid the issue of DataFusion test schemas and Substrait
plan schemas not being in sync.

* feat: expose from_substrait_named_struct

* refactor: remove unused imports

* docs: add missing licenses

* refactor: deal with unused code warnings

* remove optional qualifier from from_substrait_named_struct

* return DFSchema from from_substrait_named_struct

* one must imagine clippy happy

* accidental blah

* loosen the validation for schemas

allow cases where the Substrait schema is a subset of the DataFusion
schema

* minor doc tweaks

* update test data to deal with case issues in tests

* fix error message

* improve readability of field compatability check

* make TestSchemaCollector more flexible

* fix doc typo

Co-authored-by: Arttu <[email protected]>

* remove unecessary TODO

* handle ReadRel projection on top of mismatched schema

---------

Co-authored-by: Arttu <[email protected]>
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant