-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GetIndexedField for map-typed columns #7825
Conversation
22aa582
to
545548d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @swgillespie - this looks very nice.
I also played around with the file locally with datafusion-cli
and it worked great.
Thank you 🙏
❯ describe 'parquet_map.parquet';
+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| column_name | data_type | is_nullable |
+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| ints | Map(Field { name: "entries", data_type: Struct([Field { name: "key", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false) | NO |
| strings | Map(Field { name: "entries", data_type: Struct([Field { name: "key", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false) | NO |
| timestamp | Utf8 | NO |
+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
3 rows in set. Query took 0.018 seconds.
(DataType::Map(fields, _), _) => { | ||
match fields.data_type() { | ||
DataType::Struct(fields) if fields.len() == 2 => { | ||
// Arrow's MapArray is essentially a ListArray of structs with two columns. They are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
(DataType::Map(_, _), ScalarValue::Utf8(Some(k))) => { | ||
let map_array = as_map_array(array.as_ref())?; | ||
let key_scalar = Scalar::new(StringArray::from(vec![k.clone()])); | ||
let keys = arrow_ord::cmp::eq(&key_scalar, map_array.keys())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think arrow::compute::eq
is probably the more standard way to compare two arrays (though they call the same underlying kernels). That would let you avoid having to add the (newly explict) arrow-ord
dependency.
@swgillespie I think you could avoid adding the arrow-ord dependency, but I don't think that is critical (as arrow-ord is already a transitive dependency via arrow anyways). Let me know if you are willing to make this change |
@alamb no problem - done! |
@@ -183,6 +186,14 @@ impl PhysicalExpr for GetIndexedFieldExpr { | |||
let array = self.arg.evaluate(batch)?.into_array(batch.num_rows()); | |||
match &self.field { | |||
GetFieldAccessExpr::NamedStructField{name} => match (array.data_type(), name) { | |||
(DataType::Map(_, _), ScalarValue::Utf8(Some(k))) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👨🍳 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice @swgillespie -- thank you 🙏
Which issue does this PR close?
Closes #7824.
Rationale for this change
It's impossible to write a logical plan to query a column from a Parquet data source whose type is
Map
. TheMap
type is not explicitly supported byGetIndexedField
. Maps are a useful column, are already supported by both Arrow and Parquet, and it makes sense to support it here.What changes are included in this PR?
This commit extends the
NamedStructField
FieldAccess type to understand theMap
data type. I chose this because the DataFusion SQL frontend parses the expressionx['y']
into aNamedStructField
, which is a reasonable thing to do if we require that the argument tox
be a constant scalar (which it is, in this implementation).The Arrow Map array is essentially a list of structs, where each struct is a two-field struct. The first field of the struct is the key, and the second field of the struct is the value. Arrow traditionally names these
key
andvalue
, though this implementation does not assume what they are named and instead assumes that the second column is thevalue
column and the first is thekey
column, which is the same assumption made by the Arrow implementation we use.To execute a mapped index access, we first scan the key column to identify entries that match the key that we are indexing, and again to gather the values corresponding to the keys that were selected.
Are these changes tested?
This PR adds a new test,
map.slt
, which includes a Parquet file with twoMap
columns (one mapping strings to strings, the other mapping strings to ints) and writes some queries that use them.Are there any user-facing changes?
This change allows for the
GetIndexedField
type to now be usable with columns of typeMap
, which was not possible before.