Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatisticsConverter::row_group_null_counts incorrect for missing column #10926

Closed
Tracked by #10922
alamb opened this issue Jun 15, 2024 · 4 comments · Fixed by #10946
Closed
Tracked by #10922

StatisticsConverter::row_group_null_counts incorrect for missing column #10926

alamb opened this issue Jun 15, 2024 · 4 comments · Fixed by #10946
Assignees
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Jun 15, 2024

Describe the bug

I noticed this while working on #10852 with @marvinlanhenke

Basially, when generating statistics for a non existent column, the StatisticsExtractor will return a null array of the type of the column not a UInt64Array

Specifically

pub fn row_group_null_counts<I>(&self, metadatas: I) -> Result<ArrayRef>
where
I: IntoIterator<Item = &'a RowGroupMetaData>,
{
let data_type = self.arrow_field.data_type();
let Some(parquet_index) = self.parquet_index else {
return Ok(self.make_null_array(data_type, metadatas));
};
let null_counts = metadatas
.into_iter()
.map(|x| x.column(parquet_index).statistics())
.map(|s| s.map(|s| s.null_count()));
Ok(Arc::new(UInt64Array::from_iter(null_counts)))
}

The same problem exists for data_page_null_counts and data_page_row_counts (not for row_group_row_counts

To Reproduce

Try to call row_group_null_counts for a column that isn't in the parquet file

Expected behavior

  1. row_group_null_counts should always return an UInt64Array (not an ArrayRef)
  2. If there is not a column, the UInt64Array should be all nulls

Additional context

No response

@marvinlanhenke
Copy link
Contributor

take

@marvinlanhenke
Copy link
Contributor

marvinlanhenke commented Jun 16, 2024

@alamb
I have two questions here:

Should we also change the signature and return Result<Uint64Array>?
This however, would require to downcast_ref the result from new_null_array and clone the result.

Something like:

let Some(parquet_index) = self.parquet_index else {
            return Ok(self
                .make_null_array(&DataType::UInt64, metadatas)
                .as_any()
                .downcast_ref::<UInt64Array>()
                .expect("failed to downcast array")
                .clone());
        };

If we change the signature we also have to make some changes downstream, e.g. in row_groups.rs fn null_counts

Yet another question, what is the correct behaviour in data_page_row_counts if we cannot find a matching parquet_index, should we return a null_array (like we do already) or provide the row_group_row_count instead, since we cannot access the pages via parquet_index anyway? Is it even correct to not return a null-array in row_group_row_counts even if we can't match the column?

EDIT: I guess this answers the second question:

If there is not a column, the UInt64Array should be all nulls

@alamb
Copy link
Contributor Author

alamb commented Jun 16, 2024

Should we also change the signature and return Result?

Yes I think so

Should we also change the signature and return Result?

I actually think using new_null_array is wrong here. It returns the wrong type -- for example if we are getting row counts for a Utf8 column, using new_null_arrray will return a StringArray rather than a UInt64Array.

I actually hit this issue when working on #10924

And I had to change it to

        let Some(parquet_index) = self.parquet_index else {
            let num_row_groups = metadatas.into_iter().count();
            return Ok(Arc::new(UInt64Array::from_iter(
                std::iter::repeat(None).take(num_row_groups),
            )));
        };

To make a test pass

@marvinlanhenke
Copy link
Contributor

marvinlanhenke commented Jun 16, 2024

I actually think using new_null_array is wrong here. It returns the wrong type

I passed &DataType::UInt64 into new_null_array(...). However, your approach solves the issue with downcasting the result from new_null_array so it makes more sense here. Thank you.

@alamb just to confirm my understanding (regarding my second question) row_group_row_counts should also return a null_array when we reference a non-existing column; thus regardless if row group or data page both row_counts behave the same way when encountering a missing col?

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants