Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetExec::statistics() does not read statistics for many column types (like timstamps, strings, etc) #8295

Open
alamb opened this issue Nov 21, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Nov 21, 2023

Describe the bug

While working on #8229 I found another bug that is non obvious, but that can be clearly seen now thanks to #8110 and #8111 from @NGA-TRAN

To Reproduce

❯ copy (values ('foo'), ('bar'), ('baz')) to '/tmp/strings.parquet';
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.023 seconds.

And then look at the explain verbose up can see there are no min/max statisics shown:

❯ explain verbose select * from '/tmp/strings.parquet';

|                                                            |                                                                                                                                                                |
| physical_plan_with_stats                                   | ParquetExec: file_groups={1 group: [[private/tmp/strings.parquet]]}, projection=[column1], statistics=[Rows=Exact(3), Bytes=Absent, [(Col[0]: Null=Exact(0))]] |
|                                                            |                                                                                                                                                                |
+------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
80 rows in set. Query took 0.002 seconds.

Expected behavior

I expect there to be min/max values extracted in the statistics for the strings, as there are for integers ((Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3)))

❯ copy (values (1), (2), (3)) to '/tmp/ints.parquet';
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.023 seconds.
❯ explain verbose select * from '/tmp/ints.parquet';
...
                                                                                                               |
| physical_plan                                              | ParquetExec: file_groups={1 group: [[private/tmp/ints.parquet]]}, projection=[column1]                                                                                                              |
|                                                            |                                                                                                                                                                                                     |
| physical_plan_with_stats                                   | ParquetExec: file_groups={1 group: [[private/tmp/ints.parquet]]}, projection=[column1], statistics=[Rows=Exact(3), Bytes=Absent, [(Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3)) Null=Exact(0))]] |
|                                                            |                                                                                                                                                                                                     |
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Additional context

No response

@alamb alamb added the bug Something isn't working label Nov 21, 2023
@alamb
Copy link
Contributor Author

alamb commented Nov 21, 2023

Note that the pruning predicate code does correctly read the statistics for other strings and timestamps, because it uses a different code path

@alamb
Copy link
Contributor Author

alamb commented Nov 21, 2023

I plan to fix this

@Weijun-H
Copy link
Member

Weijun-H commented Feb 4, 2024

Could I pick this ticket up?

@Weijun-H
Copy link
Member

Weijun-H commented Feb 4, 2024

In fn summarize_min_max, it cannot handle ByteArray(ValueStatistics<ByteArray>) well. Do we need to convert it to a different type like timestamps, strings, etc 🤔 ?

@alamb
Copy link
Contributor Author

alamb commented Feb 4, 2024

In fn summarize_min_max, it cannot handle ByteArray(ValueStatistics<ByteArray>) well. Do we need to convert it to a different type like timestamps, strings, etc 🤔 ?

I think there is some subtly related to decimals as well -- the best thing to do is probably to study what the existing code in row_groups does -- I think it is here https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L57

@alamb
Copy link
Contributor Author

alamb commented Feb 4, 2024

At some point there were multiple code paths to extract statistics in parquet (one for file level and one for row group level) that should likely be combined

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants