Skip to content

Conversation

@houqp
Copy link
Member

@houqp houqp commented Feb 20, 2021

  • Added contains method for arrow::datatypes::Schema and
    arrow::datatypes::Field
  • Relax batch schema validation using contains check when creating a
    MemTable in datafusion

…rged schema

* Added `contains` method for `arrow::datatypes::Schema` and
`arrow::datatypes::Field`
* Relax batch schema validation using `contains` check when creating a
MemTable in datafusion
@github-actions
Copy link

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool change @houqp -- thanks.

I am not sure if DataFusion will work with different schemas without some additional modifications. Specifically, when the schemas are actually subsets of each other with different numbers of columns -- here is what I came up with: https://github.com/influxdata/influxdb_iox/blob/main/query/src/provider/adapter.rs#L44-L70

I think the contains logic makes sense, and is actually quite interesting -- in IOx, we have similar code to effectively merge schemas. This implements compatible definitions of merge.


/// Check to see if `self` is a superset of `other` schema Here are the comparision rules:
///
/// * for every field `f` in other, the field in self with corresponding index should be a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thank you for the clear comments

@houqp
Copy link
Member Author

houqp commented Feb 21, 2021

@alamb good call, I only assumed that logically makes sense, but never checked to see if it's actually implemented in datafusion myself. I have pushed a commit to check for fields count in the contains method for now. My original use-case was mainly to use a schema with extra metadata for a memtable, which is the case for if schema and record batches are created from our parquet crate.

I am a fan of your SchemaAdapterStream implementation, looks like it would be useful to include the core of that logic in datafusion as well.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great @houqp - thank you. @nevi-me / @jorgecarleitao / @andygrove any comments?

],
)?;

match MemTable::try_new(schema2, vec![vec![batch]]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb closed this in f7cf157 Feb 28, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in merging @houqp -- it has been a busy week!

@houqp houqp deleted the qp_schema branch February 28, 2021 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants