Support parquet statistics for struct columns #8334

alamb · 2023-11-27T20:04:44Z

Is your feature request related to a problem or challenge?

While working on #8294 @tustvold noted that the statistics extraction code does not do the right thing with Structs that were written to parquet

I think the usecase is something like the following query:

SELECT * 
FROM my_table
WHERE struct_column = struct('foo', 'bar')

Describe the solution you'd like

I would like the parquet statistics reading code to handle structs (I think).

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2023-11-27T20:09:10Z

It appears that statistics for such types is ready as all nulls

tustvold · 2023-11-27T20:38:53Z

Yes, this is what I would expect in the absence of a column name collision. My greater concern is that the presence of the struct column will mess up the ordinals of the other columns

alamb · 2023-11-27T20:59:32Z

Yes, this is what I would expect in the absence of a column name collision. My greater concern is that the presence of the struct column will mess up the ordinals of the other columns

I added tests for this case here: #8294 (comment) (TLDR is it seems to do the right thing, though it would be good to get a second set of eyes)

edmondop · 2024-04-06T00:19:22Z

I read in statistics

/// Lookups up the parquet column by name
///
/// Returns the parquet column index and the corresponding arrow field
pub(crate) fn parquet_column<'a>(
    parquet_schema: &SchemaDescriptor,
    arrow_schema: &'a Schema,
    name: &str,
) -> Option<(usize, &'a FieldRef)> {
    let (root_idx, field) = arrow_schema.fields.find(name)?;
    if field.data_type().is_nested() {
        // Nested fields are not supported and require non-trivial logic
        // to correctly walk the parquet schema accounting for the
        // logical type rules - <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md>
        //
        // For example a ListArray could correspond to anything from 1 to 3 levels
        // in the parquet schema
        return None;
    }

    // This could be made more efficient (#TBD)
    let parquet_idx = (0..parquet_schema.columns().len())
        .find(|x| parquet_schema.get_column_root_idx(*x) == root_idx)?;
    Some((parquet_idx, field))
}```

`git blame` shows @alamb as an author of those lines... I'll look into the rules.  I suppose aggregation functions will need to be updated to walk the schema correctly?

edmondop · 2024-04-15T20:34:53Z

@alamb I realized the last part of the code is actually the question I have for you, apologies

I'll look into the rules. I suppose aggregation functions will need to be updated to walk the schema correctly?

alamb · 2024-04-17T17:14:52Z

I'll look into the rules. I suppose aggregation functions will need to be updated to walk the schema correctly?

That sounds likely.

I think the first thing to do would be to write a test case showing the incorrect / lack of behavior. Then we can work out how to solve the problem

edmondop · 2024-04-19T21:24:49Z

@alamb I have done some progress in getting a test, but I need guidance here:

        let (idx, _) = parquet_column(parquet_schema, &schema, "struct_col").unwrap();
        assert_eq!(idx, 0);
 
        let row_groups = metadata.row_groups();
        let iter = row_groups.iter().map(|x| x.column(idx).statistics());

The test now fails because row_groups.iter().map(|x| x.column(idx).statistics()) doesn't return the statistics of the first column of the RecordBatch, the struct, but of the first column of the struct. I can definitely create myself the statistics doing something like that:

        let datatype = DataType::Struct(
            Fields::from(vec![
                Field::new("bool_col", DataType::Boolean, false),
                Field::new("int_col", DataType::Boolean, false),
            ]),
        );
       // Comment next line, change iter above for the two fields of the struct, and build the min as a struxt
        let min_field1 = min_statistics(&datatype, iter1.clone()).unwrap();
        let min_field2 = min_statistics(&datatype, iter2.clone()).unwrap();
       let min = find out how to do this// 
        assert_eq!(
            &min,
            &expected_min,
            "Min. Statistics\n\n{}\n\n",
            DisplayStats(row_groups)
        );

but I am not sure that's the right way to proceed

alamb added the enhancement New feature or request label Nov 27, 2023

This was referenced Nov 27, 2023

[EPIC] Improved support for nested / structured types (Struct , List, ListArray, and other Composite types) #2326

Open

Epic: Statistics improvements #8227

Open

edmondop mentioned this issue Apr 19, 2024

[DRAFT] Supporting statistics for structs #10142

Closed

xinlifoobar mentioned this issue May 29, 2024

Incorrect statistics read for struct array in parquet #10609

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parquet statistics for struct columns #8334

Support parquet statistics for struct columns #8334

alamb commented Nov 27, 2023

alamb commented Nov 27, 2023

tustvold commented Nov 27, 2023

alamb commented Nov 27, 2023

edmondop commented Apr 6, 2024

edmondop commented Apr 15, 2024

alamb commented Apr 17, 2024

edmondop commented Apr 19, 2024

Support parquet statistics for struct columns #8334

Support parquet statistics for struct columns #8334

Comments

alamb commented Nov 27, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Nov 27, 2023

tustvold commented Nov 27, 2023

alamb commented Nov 27, 2023

edmondop commented Apr 6, 2024

edmondop commented Apr 15, 2024

alamb commented Apr 17, 2024

edmondop commented Apr 19, 2024