Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate and adjust Substrait NamedTable schemas (#12223) #12245

Merged
merged 21 commits into from
Sep 10, 2024

Conversation

vbarua
Copy link
Contributor

@vbarua vbarua commented Aug 29, 2024

Closes #12223

Rationale for this change

When DataFusion consumes a Substrait plan, if the schema it has a for a table is incompatible with what is given/expected by the Substrait plan it should reject the plan. A Substrait schema is compatible with a DataFusion schema if the Substrait schema is a subset of the DataFusion schema.

Attempting to execute a plan when DataFusion and Substrait disagree on the schema is unlikely to lead to meaningful results, so in cases where the schemas are not compatible DataFusion should reject the plan.

What changes are included in this PR?

This PR:

  • Adds a validation to the the Substrait plan consumer for NamedScans that rejects Substrait plans when the schema in Substrait is incompatible with that in DataFusion.
  • Updates existing tests that fail because of the above validation.
  • Updates the Substrait plan producer to include base schemas for ReadRels, which helps with round trip testing (this also caused test failures with the validation).
  • Updates from_substrait_named_struct to return a DFSchema instead of a DFSchemaRef to aid with re-use.
  • Makes the from_substrait_named_struct public as it is generally useful, and also assists with testing.

Are these changes tested?

Additional tests were added in substrait_validations.rs for the validation functionality.

Existing tests failed because of the validation (correctly) and where updated to test their original functionality.

Are there any user-facing changes?

The added validation could potentially cause calls to from_substrait_plan that worked in prior versions to fail. The plans that fail though are unlikely to have been meaningful given that they have major field differences between DataFusion and Substrait.

Substrait plans are not valid without this, and it is generally useful
for round trip testing
If the schema registered with DataFusion and the schema as given by the
Substrait NamedScan do not have the same names and types, DataFusion
should reject it
* added substrait_validation test
* extracted useful test utilities

The utils::test::TestSchemaCollector::generate_context_from_plan
function can be used to dynamically generate a SessionContext from a
Substrait plan, which will include the schemas for NamedTables as given
in the Substrait plan.

This helps us avoid the issue of DataFusion test schemas and Substrait
plan schemas not being in sync.
datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
\n TableScan: nation projection=[a, b, c, d, e, f]"
"Projection: nation.n_name\
\n Filter: contains(nation.n_name, Utf8(\"IA\"))\
\n TableScan: nation projection=[n_nationkey, n_name, n_regionkey, n_comment]"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see in this change how the DataFusion and Substrait plans had different schemas.


pub(crate) struct TestSchemaCollector {
ctx: SessionContext,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This collector is based on a similar bit of tooling in Isthmus, a Substrait library used for integrating with Apache Calcite.

Copy link
Contributor

@Blizzara Blizzara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition in producer is great, and I feel like some validation is fine to add for consumer - but I'd prefer it to be more robust (ie. accept plans where the DF schema is a superset of the Substrait schema)

datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
datafusion/substrait/tests/utils.rs Outdated Show resolved Hide resolved
datafusion/substrait/tests/utils.rs Outdated Show resolved Hide resolved
@vbarua vbarua changed the title validate Substrait NamedScan schemas (#12223) validate Substrait NamedTable schemas (#12223) Sep 4, 2024
@vbarua vbarua changed the title validate Substrait NamedTable schemas (#12223) validate and adjust Substrait NamedTable schemas (#12223) Sep 4, 2024
substrait_err!(
"Field '{}' is nullable in the Substrait schema but not nullable in the DataFusion schema.",
substrait_field.name()
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of what's mentioned in the TODO, this code never fires. I'm opting not to fix that as part of this PR.

datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
@@ -1,2 +1,2 @@
n_nationkey,n_name,n_regionkey,n_comment
N_NATIONKEY,N_NAME,N_REGIONKEY,N_COMMENT
0,ALGERIA,0, haggle. carefully final deposits detect slyly agai
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Blizzara this is based on what you did in this PR to fix the tests that fail due to case difference in fields name.

.replace_qualifier(table_reference.clone());

let t = ctx.table(table_reference.clone()).await?;
let t = ensure_schema_compatability(t, substrait_schema)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should maybe add this for the local file reads below as well (I didn't have it in my branch yet as I didn't need it immediately). Somewhat annoyingly it'll cause the rest of the TPCH tests to fail...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth doing as one big swoop in this PR, or would it make sense to do it as a followup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me to do it separately!

.replace_qualifier(table_reference.clone());

let t = ctx.table(table_reference.clone()).await?;
let t = ensure_schema_compatability(t, substrait_schema)?;
let t = t.into_optimized_plan()?;
extract_projection(t, &read.projection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started wondering if the ensure_schema_compatability can now conflict with extract_projection - and I think it can, either by failing if DF doesn't optimize the select into a projection, or if DF does, then by overriding the select's projection with the Substrait projection...

I guess a fix would be something like in extract_projection, if there is an existing scan.projection, then apply columnIndices on it first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That did indeed cause problems. It triggered an error of unexpected plan for table in extract_projection.

I added some code for this case in d571eb2 (#12245). Is something like this what you had in mind?

I am noticing that the plans generated look a little weird/bad with a lot of redundant projects

                "Projection: DATA.a, DATA.b\
                \n  Projection: DATA.a, DATA.b\
                \n    Projection: DATA.a, DATA.b, DATA.c\
                \n      TableScan: DATA projection=[b, a, c]"

but they are at least correct for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is something like this what you had in mind?

What I had in mind was manipulating the scan.projection directly - kinda like it is alreadydone in extract_projection, we could do it that way also for ensure_schema_compatibility. That way there wouldn't be additional Projections, and maybe it'd be a bit more efficient if the current setup doesn't push the column-pruning into the scan level (though I'm a bit surprised they don't get optimized anyways).

But I don't think it's necessary - the way you've done it here seems correct, and we (I?) can do the project-mangling as a followup, unless you want to take a stab at it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think project unmangling would be better as a follow-up. Possible as part of #12347 because supporting remaps is going to add yet another layer of Projects 😅

@Blizzara
Copy link
Contributor

Blizzara commented Sep 6, 2024

I think this is good by me - @alamb would you (or someone else) be able to do the official review, please? :)

Only note I have is that I think this change makes Substrait consumer case-sensitive wrt column names, which it wasn't before. I don't strictly have opinion on whether that's a good or a bad thing. I think for my usecase it would be fine, but dunno about others.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @vbarua and for the review @Blizzara 🙏

I kicked off the CI tests and quickly skimmed the PR . Once they pass I think this PR is ready to go from my perspective

cc @waynexia and @Lordworms

@alamb
Copy link
Contributor

alamb commented Sep 10, 2024

Thanks again everyone

@alamb alamb merged commit 41c5f4e into apache:main Sep 10, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFusion does not validate that Substrait NamedScan schemas match registered tables
3 participants