DataFusion should scan Parquet statistics once per query #871

andygrove · 2021-08-13T14:22:55Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When running the benchmarks with DataFusion I noticed that we scan statistics for all tables early on (even tables not referenced in the query). This happens in ExecutionContext::register_table. We then scan statistics again later on for the tables that are actually used in the query.

../target/release/tpch benchmark datafusion   --path /mnt/bigdata/tpch-sf1000-parquet/   --format parquet   --iterations 1   --debug   --concurrency 24   --query 3
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 3, debug: true, iterations: 1, concurrency: 24, batch_size: 8192, path: "/mnt/bigdata/tpch-sf1000-parquet/", file_format: "parquet", mem_table: false, partitions: 8 }
ParquetExec::try_from_path(/mnt/bigdata/tpch-sf1000-parquet//part)
Scanned 48 Parquet files for statistics in 0 seconds
ParquetExec::try_from_path(/mnt/bigdata/tpch-sf1000-parquet//supplier)
Scanned 48 Parquet files for statistics in 0 seconds
ParquetExec::try_from_path(/mnt/bigdata/tpch-sf1000-parquet//partsupp)
Scanned 48 Parquet files for statistics in 1 seconds
ParquetExec::try_from_path(/mnt/bigdata/tpch-sf1000-parquet//customer)
Scanned 48 Parquet files for statistics in 0 seconds
ParquetExec::try_from_path(/mnt/bigdata/tpch-sf1000-parquet//orders)
Scanned 48 Parquet files for statistics in 4 seconds
ParquetExec::try_from_path(/mnt/bigdata/tpch-sf1000-parquet//lineitem)
Scanned 48 Parquet files for statistics in 30 seconds
ParquetExec::try_from_path(/mnt/bigdata/tpch-sf1000-parquet//nation)
Scanned 1 Parquet files for statistics in 0 seconds
ParquetExec::try_from_path(/mnt/bigdata/tpch-sf1000-parquet//region)
Scanned 1 Parquet files for statistics in 0 seconds
=== Logical plan ===
Sort: #revenue DESC NULLS FIRST, #orders.o_orderdate ASC NULLS FIRST
  Projection: #lineitem.l_orderkey, #SUM(lineitem.l_extendedprice Multiply Int64(1) Minus lineitem.l_discount) AS revenue, #orders.o_orderdate, #orders.o_shippriority
    Aggregate: groupBy=[[#lineitem.l_orderkey, #orders.o_orderdate, #orders.o_shippriority]], aggr=[[SUM(#lineitem.l_extendedprice Multiply Int64(1) Minus #lineitem.l_discount)]]
      Filter: #customer.c_mktsegment Eq Utf8("BUILDING") And #orders.o_orderdate Lt CAST(Utf8("1995-03-15") AS Date32) And #lineitem.l_shipdate Gt CAST(Utf8("1995-03-15") AS Date32)
        Join: #orders.o_orderkey = #lineitem.l_orderkey
          Join: #customer.c_custkey = #orders.o_custkey
            TableScan: customer projection=None
            TableScan: orders projection=None
          TableScan: lineitem projection=None

=== Optimized logical plan ===
Sort: #revenue DESC NULLS FIRST, #orders.o_orderdate ASC NULLS FIRST
  Projection: #lineitem.l_orderkey, #SUM(lineitem.l_extendedprice Multiply Int64(1) Minus lineitem.l_discount) AS revenue, #orders.o_orderdate, #orders.o_shippriority
    Aggregate: groupBy=[[#lineitem.l_orderkey, #orders.o_orderdate, #orders.o_shippriority]], aggr=[[SUM(#lineitem.l_extendedprice Multiply Int64(1) Minus #lineitem.l_discount)]]
      Join: #orders.o_orderkey = #lineitem.l_orderkey
        Join: #customer.c_custkey = #orders.o_custkey
          Filter: #customer.c_mktsegment Eq Utf8("BUILDING")
            TableScan: customer projection=Some([0, 6]), filters=[#customer.c_mktsegment Eq Utf8("BUILDING")]
          Filter: #orders.o_orderdate Lt Date32("9204")
            TableScan: orders projection=Some([0, 1, 4, 7]), filters=[#orders.o_orderdate Lt Date32("9204")]
        Filter: #lineitem.l_shipdate Gt Date32("9204")
          TableScan: lineitem projection=Some([0, 5, 6, 10]), filters=[#lineitem.l_shipdate Gt Date32("9204")]

ParquetExec::try_from_path(/mnt/bigdata/tpch-sf1000-parquet//customer)
Scanned 48 Parquet files for statistics in 0 seconds
ParquetExec::try_from_path(/mnt/bigdata/tpch-sf1000-parquet//orders)
Scanned 48 Parquet files for statistics in 4 seconds
ParquetExec::try_from_path(/mnt/bigdata/tpch-sf1000-parquet//lineitem)
Scanned 48 Parquet files for statistics in 30 seconds
=== Physical plan ===
SortExec: [revenue@1 DESC,o_orderdate@2 ASC]
  CoalescePartitionsExec
    ProjectionExec: expr=[l_orderkey@0 as l_orderkey, SUM(lineitem.l_extendedprice Multiply Int64(1) Minus lineitem.l_discount)@3 as revenue, o_orderdate@1 as o_orderdate, o_shippriority@2 as o_shippriority]
      HashAggregateExec: mode=FinalPartitioned, gby=[l_orderkey@0 as l_orderkey, o_orderdate@1 as o_orderdate, o_shippriority@2 as o_shippriority], aggr=[SUM(l_extendedprice Multiply Int64(1) Minus l_discount)]
        CoalesceBatchesExec: target_batch_size=4096
          RepartitionExec: partitioning=Hash([Column { name: "l_orderkey", index: 0 }, Column { name: "o_orderdate", index: 1 }, Column { name: "o_shippriority", index: 2 }], 24)
            HashAggregateExec: mode=Partial, gby=[l_orderkey@6 as l_orderkey, o_orderdate@4 as o_orderdate, o_shippriority@5 as o_shippriority], aggr=[SUM(l_extendedprice Multiply Int64(1) Minus l_discount)]
              CoalesceBatchesExec: target_batch_size=4096
                HashJoinExec: mode=Partitioned, join_type=Inner, on=[(Column { name: "o_orderkey", index: 2 }, Column { name: "l_orderkey", index: 0 })]
                  CoalesceBatchesExec: target_batch_size=4096
                    RepartitionExec: partitioning=Hash([Column { name: "o_orderkey", index: 2 }], 24)
                      CoalesceBatchesExec: target_batch_size=4096
                        HashJoinExec: mode=Partitioned, join_type=Inner, on=[(Column { name: "c_custkey", index: 0 }, Column { name: "o_custkey", index: 1 })]
                          CoalesceBatchesExec: target_batch_size=4096
                            RepartitionExec: partitioning=Hash([Column { name: "c_custkey", index: 0 }], 24)
                              CoalesceBatchesExec: target_batch_size=4096
                                FilterExec: c_mktsegment@1 = BUILDING
                                  ParquetExec: batch_size=8192, limit=None, partitions=[...]
                          CoalesceBatchesExec: target_batch_size=4096
                            RepartitionExec: partitioning=Hash([Column { name: "o_custkey", index: 1 }], 24)
                              CoalesceBatchesExec: target_batch_size=4096
                                FilterExec: o_orderdate@2 < 9204
                                  ParquetExec: batch_size=8192, limit=None, partitions=[...]
                  CoalesceBatchesExec: target_batch_size=4096
                    RepartitionExec: partitioning=Hash([Column { name: "l_orderkey", index: 0 }], 24)
                      CoalesceBatchesExec: target_batch_size=4096
                        FilterExec: l_shipdate@3 > 9204
                          ParquetExec: batch_size=8192, limit=None, partitions=[...]

Describe the solution you'd like

We should only scan statistics for tables that are used in the query
We should only scan statistics once

Describe alternatives you've considered
N/A

Additional context
N/A

The text was updated successfully, but these errors were encountered:

andygrove added enhancement New feature or request performance Make DataFusion faster labels Aug 13, 2021

Dandandan mentioned this issue Aug 13, 2021

Register table based on known schema without file IO #872

Merged

yjshen mentioned this issue Aug 25, 2021

Table Scan Enhancement Plan #944

Closed

7 tasks

rdettai mentioned this issue Aug 31, 2021

Moving cost based optimizations to physical planning #962

Closed

rdettai mentioned this issue Sep 16, 2021

Reorganize table providers by table format #1009

Closed

mateuszkj mentioned this issue Sep 28, 2022

Cache collected file statistics #3649

Merged

alamb closed this as completed in #3649 Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFusion should scan Parquet statistics once per query #871

DataFusion should scan Parquet statistics once per query #871

andygrove commented Aug 13, 2021

DataFusion should scan Parquet statistics once per query #871

DataFusion should scan Parquet statistics once per query #871

Comments

andygrove commented Aug 13, 2021