Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't scan first column on empty projection #3214

Closed
Dandandan opened this issue Aug 21, 2022 · 13 comments · Fixed by #7920
Closed

Don't scan first column on empty projection #3214

Dandandan opened this issue Aug 21, 2022 · 13 comments · Fixed by #7920
Labels
enhancement New feature or request performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Aug 21, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Depends on: #2603

When we perform without needing the like SELECT COUNT(1) FROM table, the plan always reads the first column (whatever this is). This is inefficient: in case of formats like Parquet we can avoid scanning / reading the column and just produce the row counts. For non-columnar formats it can avoid unnecessary parsing (or implementing a fast path, i.e. only counting lines).

Projection: Count(1)
  TableScan: test projection=[a]

Should become:

Projection: Count(1)
  TableScan: test projection=[]

Describe the solution you'd like
We can push the responsibility of dealing with producing an array with a certain number of rows into the individual readers / other parts of the plans. They should produce RecordBatches with the number of rows.
We should remove the line projection.insert(0); from projection push down.

Describe alternatives you've considered

Additional context
Some queries in the ClickBench benchmark show this performance issue (https://benchmark.clickhouse.com/ ):

| logical_plan  | Projection: #COUNT(UInt8(1))                                                                                                       |
|               |   Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]                                                                                |
|               |     TableScan: hits projection=[WatchID]                                                                                           |
@Dandandan Dandandan added enhancement New feature or request performance Make DataFusion faster labels Aug 21, 2022
@alamb
Copy link
Contributor

alamb commented Aug 21, 2022

👍 this is an important optimization as select count(*) type queries are so common

@HaoYang670
Copy link
Contributor

I find this comment: https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/projection_push_down.rs#L98-L100

It says that Ensure that we are reading at least one column from the table. Is there any reason or background of why we need to do this?

@Dandandan
Copy link
Contributor Author

I find this comment: https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/projection_push_down.rs#L98-L100

It says that Ensure that we are reading at least one column from the table. Is there any reason or background of why we need to do this?

The reason is that several Arrow readers don´t support empty projections. I added a PR for csv / json upstream apache/arrow-rs#2604

@HaoYang670
Copy link
Contributor

HaoYang670 commented Aug 29, 2022

The reason is that several Arrow readers don´t support empty projections.

Thank you, @Dandandan. I could reproduce the error when reading csv with empty projection

Arrow error: Invalid argument error: must either specify a row count or at least one column

If this depends on the support of arrow-rs, should we add a new label such as arrow-dependency for this issue?

@Dandandan Dandandan added the waiting-on-upstream PR is waiting on an upstream dependency to be updated label Aug 29, 2022
@avantgardnerio
Copy link
Contributor

Might count(*) be as simple as a stats lookup in Parquet or DeltaLake? Reading a billion values just to count them seems sub-optimal, but that can definitely be addressed with a TODO and a future PR.

@Dandandan
Copy link
Contributor Author

Might count(*) be as simple as a stats lookup in Parquet or DeltaLake? Reading a billion values just to count them seems sub-optimal, but that can definitely be addressed with a TODO and a future PR.

You're right, for a schema provider that has statistics available, we can skip scanning.
AFAIK DataFusion already has support for using the statistics-provided count/min/max from the provider (e.g. delta lake).

You're right that we could also use the parquet statistics for files instead of skipping reading the columns. I think we don't support this yet. At least for min/max statisticd his avoids having to scan the entire column and compute the min/max.

@alamb
Copy link
Contributor

alamb commented Aug 31, 2022

I think @tustvold has been thinking of this in the context of the various parquet reader improvements

@tustvold
Copy link
Contributor

tustvold commented Aug 31, 2022

I think there are two different optimisations being discussed here:

  • Skip interacting with the file based on catalog statistics if available
  • Remove projection "hack" and delegate to file readers

Parquet has supported the latter since apache/arrow-rs#1560, and CSV/JSON will support it once apache/arrow-rs#2604 is released. I think it should be then be possible to remove the workaround, as it will be no longer necessary.

As to the former, I think it should be fairly straightforward to implement a physical optimiser pass that uses statistics to simplify counts into projections based on statistics if available. I had thought we had already implemented this tbh... 🤔 Edit: Yup AggregateStatistics

@alamb
Copy link
Contributor

alamb commented Aug 31, 2022

Remove projection "hack" and delegate to file readers

Yes, this is what I was talking about. https://docs.rs/datafusion/latest/datafusion/physical_optimizer/aggregate_statistics/struct.AggregateStatistics.html is very cool 👍 (thanks @rdettai !)

@Dandandan Dandandan removed the waiting-on-upstream PR is waiting on an upstream dependency to be updated label Sep 7, 2022
@Dandandan
Copy link
Contributor Author

Draft PR here:
#3382
It turns out it is a bit more complex than removing a line, as every Exec node should support producing records without columns/empty schema. I think the only thing we can do is hunting every RecordBatch::try_new and adapting it for projections without columns 🤔

@alamb
Copy link
Contributor

alamb commented Sep 8, 2022

@Dandandan
Copy link
Contributor Author

Maybe we can teach https://docs.rs/arrow/22.0.0/arrow/datatypes/struct.Schema.html#method.project and https://docs.rs/arrow/22.0.0/arrow/record_batch/struct.RecordBatch.html#method.project about empty projections?

Thanks, I did just that yesterday, for RecordBach::project: apache/arrow-rs#2691. Schema::project already seems to handle empty projections just fine 🎉

@Dandandan
Copy link
Contributor Author

Closed by #7920

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
5 participants