Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement TPCH substrait integration teset, support tpch_1 #10842

Merged
merged 2 commits into from
Jun 11, 2024

Conversation

Lordworms
Copy link
Contributor

Which issue does this PR close?

part of #10710

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@@ -1253,6 +1335,8 @@ fn from_substrait_type(
r#type::Kind::Struct(s) => Ok(DataType::Struct(from_substrait_struct_type(
s, dfs_names, name_idx,
)?)),
r#type::Kind::Varchar(_) => Ok(DataType::Utf8),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently directly use Utf8

@Lordworms
Copy link
Contributor Author

Lordworms commented Jun 9, 2024

another problem here is that the json file seems to do aggregation after projection
image

so in the actual logical plan generated from this proto, we do aggregate after projection

Also we do not have the alias support, so we would recalculate the elements in Aggregate again, which causes the logical plan like this

image

we have done twice calculations in both projection and aggregation.

but since we just generated a plan from a JSON file. I don't know whether we need to integrate the optimizer in our test here

@@ -0,0 +1,2 @@
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
1,1,1,1,17,21168.23,0.04,0.02,'N','O','1996-03-13','1996-02-12','1996-03-22','DELIVER IN PERSON','TRUCK','egular courts above the'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to minimize the repo size, just upload a CSV version of lineitem table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a single line row is fine 👍

@Lordworms Lordworms marked this pull request as ready for review June 9, 2024 22:23
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Lordworms -- this is very cool.

I have some readability / organization / documentation suggestions I think would help this PR but we could also make them as a follow on PR too

cc @waynexia and @Blizzara

Once we merge this PR what do you think about creating tickets to track supporting the other queries in TPCH? Or maybe we can just use #10710 🤔

@@ -0,0 +1,2 @@
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
1,1,1,1,17,21168.23,0.04,0.02,'N','O','1996-03-13','1996-02-12','1996-03-22','DELIVER IN PERSON','TRUCK','egular courts above the'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a single line row is fine 👍

@@ -19,3 +19,4 @@ mod logical_plans;
mod roundtrip_logical_plan;
mod roundtrip_physical_plan;
mod serialize;
mod tpch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about renaming this module to consumer_integration to make it clearer that this is an integration test of existing substrait plans?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@@ -0,0 +1,810 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please

  1. Move this file into a directory that makes it clearer where it came from. Perhaps datafusion/substrait/tests/testdata/tpch_substrait_plans/query_1.json
  2. add a README.md file in datafusion/substrait/tests/testdata/tpch_substrait_plans that explains the files came from https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans?

// specific language governing permissions and limitations
// under the License.

//! tests contains in <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool -- thank you . I think the context of this PR may be lost after merge so some more documentation might help

Something like

Suggested change
//! tests contains in <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans>
//! TPCH `substrait_consumer` tests
//!
//! This module tests that substrait plans as json encoded protobuf can be
//! correctly read as DataFusion plans.
//!
//! The input data comes from <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//! TPCH substrait_consumer tests
//!
//! This module tests that substrait plans as json encoded protobuf can be
//! correctly read as DataFusion plans.
//!
//! The input data comes from https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans

Got it

fn extract_filename(name: &str) -> Option<String> {
let corrected_url =
if name.starts_with("file://") && !name.starts_with("file:///") {
name.replacen("file://", "file:///", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes all URLs absolute (is that intended)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the file name in those json files are all starts with FILE:// which makes it impossible for url librarary to parse so I did the transformation, or we could try direct string parse otherwise

}

// we could use the file name to check the original table provider
// TODO: currently does not support multiple local files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we file at ticket for this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #10864

.iter()
.map(|item| item.field as usize)
.collect();
match &t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you matched on t you could avoid the scan.clone() later on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, gotta fix it

Copy link
Contributor

@Blizzara Blizzara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes sense to me - mostly left some nits!

@@ -22,6 +22,9 @@ use datafusion::arrow::datatypes::{
use datafusion::common::{
not_impl_err, substrait_datafusion_err, substrait_err, DFSchema, DFSchemaRef,
};
use substrait::proto::expression::literal::IntervalDayToSecond;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we combine/move these into the other substrait imports below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


if lf.items.len() > 1 || filename.is_none() {
return not_impl_err!(
"Only NamedTable and VirtualTable reads are supported"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Only NamedTable and VirtualTable reads are supported"
"Only single file reads are supported"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it!

let table_reference = TableReference::Bare { table: name.into() };
let t = ctx.table(table_reference).await?;
let t = t.into_optimized_plan()?;
match &read.projection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this logic same as for NamedTable? If so, maybe extract into a function / reuse in some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Comment on lines 645 to 647
_ => {
not_impl_err!("Only NamedTable and VirtualTable reads are supported")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ => {
not_impl_err!("Only NamedTable and VirtualTable reads are supported")
}
_ => not_impl_err!("Unsupported ReadType: {:?}", &read.as_ref().read_type),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

@@ -810,14 +885,21 @@ pub async fn from_substrait_agg_func(
f.function_reference
);
};

let function_name = function_name.split(':').next().unwrap_or(function_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is same idea as in

let name = match name.rsplit_once(':') {
? Might be worth consolidating those into some helper function as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely


async fn create_context() -> datafusion::common::Result<SessionContext> {
let ctx = SessionContext::new();
ctx.register_csv(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the substrait plan indicates the files would be parquet, I wonder if that'll cause trouble now or later given we use csv (and it makes sense to use CSV here, I think)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the only concern for using csv file is that we could not parse if there are some partitioned parquet filenames in the json file. But right now I think it is fine.

@Lordworms
Copy link
Contributor Author

I have resolved all the reviews provided. Thanks so much for your help, really appreciate it. @alamb @Blizzara

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- thank you so much @Lordworms 🙏

}

// we could use the file name to check the original table provider
// TODO: currently does not support multiple local files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #10864

@alamb alamb changed the title support tpch_1 consumer_producer_test Implement TPCH substrait integration teset, support tpch_1 Jun 11, 2024
@alamb alamb merged commit 76f5110 into apache:main Jun 11, 2024
26 checks passed
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
)

* support tpch_1 consumer_producer_test

* refactor and optimize code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants