-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't scan first column on empty projection #3214
Comments
👍 this is an important optimization as |
I find this comment: https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/projection_push_down.rs#L98-L100 It says that |
The reason is that several Arrow readers don´t support empty projections. I added a PR for csv / json upstream apache/arrow-rs#2604 |
Thank you, @Dandandan. I could reproduce the error when reading csv with empty projection
If this depends on the support of arrow-rs, should we add a new label such as |
Might |
You're right, for a schema provider that has statistics available, we can skip scanning. You're right that we could also use the parquet statistics for files instead of skipping reading the columns. I think we don't support this yet. At least for min/max statisticd his avoids having to scan the entire column and compute the min/max. |
I think @tustvold has been thinking of this in the context of the various parquet reader improvements |
I think there are two different optimisations being discussed here:
Parquet has supported the latter since apache/arrow-rs#1560, and CSV/JSON will support it once apache/arrow-rs#2604 is released. I think it should be then be possible to remove the workaround, as it will be no longer necessary. As to the former, I think it should be fairly straightforward to implement a physical optimiser pass that uses statistics to simplify counts into projections based on statistics if available. I had thought we had already implemented this tbh... 🤔 Edit: Yup AggregateStatistics |
Yes, this is what I was talking about. https://docs.rs/datafusion/latest/datafusion/physical_optimizer/aggregate_statistics/struct.AggregateStatistics.html is very cool 👍 (thanks @rdettai !) |
Draft PR here: |
Maybe we can teach https://docs.rs/arrow/22.0.0/arrow/datatypes/struct.Schema.html#method.project and https://docs.rs/arrow/22.0.0/arrow/record_batch/struct.RecordBatch.html#method.project about empty projections? |
Thanks, I did just that yesterday, for |
Closed by #7920 |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Depends on: #2603
When we perform without needing the like
SELECT COUNT(1) FROM table
, the plan always reads the first column (whatever this is). This is inefficient: in case of formats like Parquet we can avoid scanning / reading the column and just produce the row counts. For non-columnar formats it can avoid unnecessary parsing (or implementing a fast path, i.e. only counting lines).Should become:
Describe the solution you'd like
We can push the responsibility of dealing with producing an array with a certain number of rows into the individual readers / other parts of the plans. They should produce
RecordBatch
es with the number of rows.We should remove the line
projection.insert(0);
from projection push down.Describe alternatives you've considered
Additional context
Some queries in the ClickBench benchmark show this performance issue (https://benchmark.clickhouse.com/ ):
The text was updated successfully, but these errors were encountered: