Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad Join Order for TPCH Q17 results in slow performance #7949

Closed
alamb opened this issue Oct 27, 2023 · 2 comments · Fixed by #8126
Closed

Bad Join Order for TPCH Q17 results in slow performance #7949

alamb opened this issue Oct 27, 2023 · 2 comments · Fixed by #8126
Assignees
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Oct 27, 2023

Describe the bug

The join order chosen for TPCH query 17 is bad making datafusion take 50% longer to execute the query.

To Reproduce

Step 1: Create data:

cd arrow-datafusion/benchmarks
./bench.sh  data tpch10

Step 2: Run query with datafusion-cli:

cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select	sum(l_extendedprice) / 7.0 as avg_yearly	from	lineitem,	part	where	p_partkey = l_partkey	and p_brand = 'Brand#23'	and p_container = 'MED BOX'	and l_quantity < (	select	0.2 * avg(l_quantity)	from	lineitem	where	l_partkey = p_partkey	);"

Takes 7.52 seconds

+-------------------+
| avg_yearly        |
+-------------------+
| 3295493.512857143 |
+-------------------+
1 row in set. Query took 7.525 seconds.

However, if we change the query slightly (swap the table order) it is much faster (4.5 seconds)

cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select   sum(l_extendedprice) / 7.0 as avg_yearly        from    part,lineitem   where   p_partkey = l_partkey   and p_brand = 'Brand#23'        and p_container = 'MED BOX' and l_quantity < (      select  0.2 * avg(l_quantity)   from    lineitem        where   l_partkey = p_partkey   );"
DataFusion CLI v32.0.0
+-------------------+
| avg_yearly        |
+-------------------+
| 3295493.512857143 |
+-------------------+
1 row in set. Query took 4.560 seconds.

Here is the difference:

select
sum(l_extendedprice) / 7.0 as avg_yearly from
- lineitem, part
+ part, lineitem
where
  p_partkey = l_partkey
  and p_brand = 'Brand#23'
  and p_container = 'MED BOX'
  and l_quantity < (	select	0.2 * avg(l_quantity)	from	lineitem where	l_partkey = p_partkey	);

Expected behavior

DataFusion should pick the correct join order

Analysis

What is going on? The answer is in the order of the joins. Here is the plan DataFusion makes for Q17, annotated from output row counts (I used the output of EXPLAIN ANALYZE):

                                       DataFusion TPCH SF10 Q17 Plan                  
                                      annotated with output row counts                
                                                                                      
                                                                                      
                                                                                      
                                                            1 Row                     
                                                                                      
                                                ┌────────────────┐                    
                                                │   Aggregate    │                    
                                                │                │                    
This join is building a 60M row hash            │                │                    
 table and probing 2M rows, rather              └────────────────┘                    
 than building a 2M row hash table                       │                            
    and probing a 60M row table                          ▼        2M Rows             
                  │                             ┌────────────────┐                    
                                                │  HashJoin (1)  │                    
                  │                             │   p_partkey =  │                    
                                                │   l_partkey    │                    
                  │         2044 Rows           └───┬────────┬───┘                    
                                                    │        │            2M Rows     
                  │                ┌────────────────┘        └──────────────┐         
                                   │                                        │         
                  │                ▼                                        ▼         
                          ┌────────────────┐                       ┌────────────────┐ 
                  │       │  HashJoin (2)  │                       │   Aggregate    │ 
                   ─ ─ ─ ▶│   p_partkey =  │                       │                │ 
                          │   l_partkey    │                       │                │ 
                          └───┬────────┬───┘                       └────────────────┘ 
           60M Rows           │        │        2M Rows                     │         
                    ┌─────────┘        └──────────┐                         │         
                    │                             │                         │         
                    ▼                             ▼                         ▼         
          ┌──────────────────┐          ┌──────────────────┐      ┌──────────────────┐
          │Scan: lineitem    │          │Scan: part        │      │Scan: part        │
          │(no filters)      │          │p_brand = ..      │      │(no filters)      │
          │                  │          │p_container = ..  │      │                  │
          └──────────────────┘          └──────────────────┘      └──────────────────┘

Background: DataFusion Joins (I will also add this as documentation to datafusion)

Why does the order matter so much? To understand it fully, we need to understand how Hash Joins in DataFusion work.

The HashJoin operator in DataFusion takes two inputs:

         ┌───────────┐         
         │ HashJoin  │         
         │           │         
         └───────────┘         
             │   │             
       ┌─────┘   └─────┐       
       ▼               ▼       
┌────────────┐  ┌─────────────┐
│   Input    │  │    Input    │
│    [0]     │  │     [1]     │
└────────────┘  └─────────────┘
                               
 "build side"    "probe side"  

Execution proceeds in 2 stages:

                ┌────────────────┐          ┌────────────────┐                        
                │ ┌─────────┐    │          │ ┌─────────┐    │                        
                │ │  Hash   │    │          │ │  Hash   │    │                        
                │ │  Table  │    │          │ │  Table  │    │                        
                │ │(keys are│    │          │ │(keys are│    │                        
                │ │equi join│    │          │ │equi join│    │  Stage 2: batches from 
 Stage 1: the   │ │columns) │    │          │ │columns) │    │    the probe side are  
*entire* build  │ │         │    │          │ │         │    │  streamed through, and 
 side is read   │ └─────────┘    │          │ └─────────┘    │   checked against the  
into the hash   │      ▲         │          │          ▲     │   contents of the hash 
    table       │       HashJoin │          │  HashJoin      │          table         
                └──────┼─────────┘          └──────────┼─────┘                        
            ─ ─ ─ ─ ─ ─                                 ─ ─ ─ ─ ─ ─ ─                 
           │                                                         │                
                                                                                      
           │                                                         │                
    ┌────────────┐                                            ┌────────────┐          
    │RecordBatch │                                            │RecordBatch │          
    └────────────┘                                            └────────────┘          
    ┌────────────┐                                            ┌────────────┐          
    │RecordBatch │                                            │RecordBatch │          
    └────────────┘                                            └────────────┘          
          ...                                                       ...               
    ┌────────────┐                                            ┌────────────┐          
    │RecordBatch │                                            │RecordBatch │          
    └────────────┘                                            └────────────┘          
                                                                                      
       build side                                                probe side           
                                                                                      

This asymmetry in behavior has the important consequence that it is very important that the smaller side is hashed

So this means in a classic "Star Schema Query", the optimal plan will be a "Right Deep Tree" , where there is one large table and several smaller "dimension" tables, with predicates. The optimal DataFusion will put this large table as the probe side on the lowest join:

            ┌───────────┐                                    
            │ HashJoin  │                                    
            │           │                                    
            └───────────┘                                    
                │   │                                        
        ┌───────┘   └──────────┐                             
        ▼                      ▼                             
┌───────────────┐        ┌───────────┐                       
│ small table 3 │        │ HashJoin  │                       
│  "dimension"  │        │           │                       
└───────────────┘        └───┬───┬───┘                       
                  ┌──────────┘   └───────┐                   
                  │                      │                   
                  ▼                      ▼                   
          ┌───────────────┐        ┌───────────┐             
          │ small table 2 │        │ HashJoin  │             
          │  "dimension"  │        │           │             
          └───────────────┘        └───┬───┬───┘             
                              ┌────────┘   └────────┐        
                              │                     │        
                              ▼                     ▼        
                      ┌───────────────┐     ┌───────────────┐
                      │ small table 1 │     │  large table  │
                      │  "dimension"  │     │    "fact"     │
                      └───────────────┘     └───────────────┘

Additional context

This is likely one of the root causes of #5646

@artorias1024
Copy link

If we only focus on Query 17, is the ideal execution plan for this SQL statement supposed to be like the following?

                ┌───────────────────────────────────┐                     
                │             Aggregate             │                     
                │                                   │                     
                │                                   │                     
                └──────────────────┬────────────────┘                     
                                   │                                      
                                   │                                      
                                   ▼                                      
                 ┌──────────────────────────────────┐                     
                 │  Filter: lineitem.l_quantity <   │                     
                 │((0.2 * (avg(lineitem.l_quantity) │                     
                 │           OVER (?))))            │                     
                 │                                  │                     
                 └──────────────────────────────────┘                     
                                   │                                      
                                   │                                      
                                   │                                      
                                   ▼                                      
                 ┌──────────────────────────────────┐                     
                 │         Window Aggregate         │                     
                 │ output: avg(lineitem.l_quantity) │                     
                 │  OVER (?), lineitem.l_quantity,  │                     
                 │    lineitem.l_extendedprice,     │                     
                 └──────────────────────────────────┘                     
                                   │                                      
                                   │                                      
                                   │                                      
                                   ▼                                      
                 ┌──────────────────────────────────┐                     
                 │            Hash Join             │                     
                 │                                  │                     
                 │                                  │                     
                 │                                  │                     
                 └─────────┬───────────────┬────────┘                     
                                                                          
                           │               │                              
                ┌ ─ ─ ─ ─ ─                 ─ ─ ─ ─ ─ ─ ─                 
                                                         │                
                │                                                         
                                                         │                
                ▼                                        ▼                
┌───────────────────────────────┐        ┌───────────────────────────────┐
│       Table Scan: part        │        │     Table Scan: lineitem      │
│ Filter: part.p_brand = .. AND │        │                               │
│     part.p_container = ..     │        │                               │
│                               │        │                               │
└───────────────────────────────┘        └───────────────────────────────┘

@alamb
Copy link
Contributor Author

alamb commented Nov 6, 2023

If we only focus on Query 17, is the ideal execution plan for this SQL statement supposed to be like the following?

Here is what I think a much better plan would be:

Screenshot 2023-11-06 at 7 12 38 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants