Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFusion ignores "column order" parquet statistics specification #10586

Open
Tracked by #10453 ...
alamb opened this issue May 20, 2024 · 0 comments
Open
Tracked by #10453 ...

DataFusion ignores "column order" parquet statistics specification #10586

alamb opened this issue May 20, 2024 · 0 comments

Comments

@alamb
Copy link
Contributor

alamb commented May 20, 2024

Describe the bug

As @tustvold points out, there is a column_order API defined in parquet that is currently entirely ignored by DataFusion

It is not entirely clear to me what the implications of ignoring this field are or what other parquet writers populate it with, but we should probably not ignore it

To Reproduce

No response

Expected behavior

No response

Additional context

To emphasise the point I made when this API was originally proposed, you need more than just the ParquetStatistics in order to correctly interpret the data. You need at least the FileMetadata to get the https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html#method.column_order in order to be able to even interpret what the statistics mean for a given column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant