Implement GetIndexedField for map-typed columns #7825

swgillespie · 2023-10-14T19:38:42Z

Which issue does this PR close?

Closes #7824.

Rationale for this change

It's impossible to write a logical plan to query a column from a Parquet data source whose type is Map. The Map type is not explicitly supported by GetIndexedField. Maps are a useful column, are already supported by both Arrow and Parquet, and it makes sense to support it here.

What changes are included in this PR?

This commit extends the NamedStructField FieldAccess type to understand the Map data type. I chose this because the DataFusion SQL frontend parses the expression x['y'] into a NamedStructField, which is a reasonable thing to do if we require that the argument to x be a constant scalar (which it is, in this implementation).

The Arrow Map array is essentially a list of structs, where each struct is a two-field struct. The first field of the struct is the key, and the second field of the struct is the value. Arrow traditionally names these key and value, though this implementation does not assume what they are named and instead assumes that the second column is the value column and the first is the key column, which is the same assumption made by the Arrow implementation we use.

To execute a mapped index access, we first scan the key column to identify entries that match the key that we are indexing, and again to gather the values corresponding to the keys that were selected.

Are these changes tested?

This PR adds a new test, map.slt, which includes a Parquet file with two Map columns (one mapping strings to strings, the other mapping strings to ints) and writes some queries that use them.

Are there any user-facing changes?

This change allows for the GetIndexedField type to now be usable with columns of type Map, which was not possible before.

alamb

Thank you @swgillespie - this looks very nice.

I also played around with the file locally with datafusion-cli and it worked great.

Thank you 🙏

❯ describe 'parquet_map.parquet';
+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| column_name | data_type                                                                                                                                                                                                                                                                                                                                              | is_nullable |
+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| ints        | Map(Field { name: "entries", data_type: Struct([Field { name: "key", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false) | NO          |
| strings     | Map(Field { name: "entries", data_type: Struct([Field { name: "key", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false)  | NO          |
| timestamp   | Utf8                                                                                                                                                                                                                                                                                                                                                   | NO          |
+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
3 rows in set. Query took 0.018 seconds.

alamb · 2023-10-15T09:59:21Z

datafusion/expr/src/field_util.rs

+                    (DataType::Map(fields, _), _) => {
+                        match fields.data_type() {
+                            DataType::Struct(fields) if fields.len() == 2 => {
+                                // Arrow's MapArray is essentially a ListArray of structs with two columns. They are


alamb · 2023-10-15T10:07:25Z

datafusion/physical-expr/src/expressions/get_indexed_field.rs

+                (DataType::Map(_, _), ScalarValue::Utf8(Some(k))) => {
+                    let map_array = as_map_array(array.as_ref())?;
+                    let key_scalar = Scalar::new(StringArray::from(vec![k.clone()]));
+                    let keys = arrow_ord::cmp::eq(&key_scalar, map_array.keys())?;


I think arrow::compute::eq is probably the more standard way to compare two arrays (though they call the same underlying kernels). That would let you avoid having to add the (newly explict) arrow-ord dependency.

alamb · 2023-10-15T10:12:11Z

@swgillespie I think you could avoid adding the arrow-ord dependency, but I don't think that is critical (as arrow-ord is already a transitive dependency via arrow anyways). Let me know if you are willing to make this change

swgillespie · 2023-10-15T18:24:43Z

@alamb no problem - done!

alamb · 2023-10-16T14:28:54Z

datafusion/physical-expr/src/expressions/get_indexed_field.rs

@@ -183,6 +186,14 @@ impl PhysicalExpr for GetIndexedFieldExpr {
        let array = self.arg.evaluate(batch)?.into_array(batch.num_rows());
        match &self.field {
            GetFieldAccessExpr::NamedStructField{name} => match (array.data_type(), name) {
+                (DataType::Map(_, _), ScalarValue::Utf8(Some(k))) => {


👨‍🍳 👌

alamb

Really nice @swgillespie -- thank you 🙏

github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Oct 14, 2023

Implement GetIndexedField for map-typed columns

545548d

swgillespie force-pushed the swgillespie/maps-3 branch from 22aa582 to 545548d Compare October 14, 2023 19:43

alamb approved these changes Oct 15, 2023

View reviewed changes

Drop explicit dep on arrow-ord and use re-exported kernel

dd0b68d

alamb reviewed Oct 16, 2023

View reviewed changes

alamb approved these changes Oct 16, 2023

View reviewed changes

alamb merged commit e84b999 into apache:main Oct 16, 2023
23 checks passed

swgillespie deleted the swgillespie/maps-3 branch October 16, 2023 18:19

andygrove added the enhancement New feature or request label Nov 5, 2023

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement GetIndexedField for map-typed columns #7825

Implement GetIndexedField for map-typed columns #7825

swgillespie commented Oct 14, 2023 •

edited

Loading

alamb left a comment

alamb Oct 15, 2023

alamb Oct 15, 2023

alamb commented Oct 15, 2023 •

edited

Loading

swgillespie commented Oct 15, 2023

alamb Oct 16, 2023

alamb left a comment

Implement GetIndexedField for map-typed columns #7825

Implement GetIndexedField for map-typed columns #7825

Conversation

swgillespie commented Oct 14, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Oct 15, 2023

Choose a reason for hiding this comment

alamb Oct 15, 2023

Choose a reason for hiding this comment

alamb commented Oct 15, 2023 • edited Loading

swgillespie commented Oct 15, 2023

alamb Oct 16, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

swgillespie commented Oct 14, 2023 •

edited

Loading

alamb commented Oct 15, 2023 •

edited

Loading