Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for grouping by variable length columns (strings) #9403

Open
alamb opened this issue Feb 29, 2024 · 13 comments
Open

Improve performance for grouping by variable length columns (strings) #9403

alamb opened this issue Feb 29, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 29, 2024

Is your feature request related to a problem or challenge?

As always I would like faster aggregation performance

Describe the solution you'd like

clickbench, Q17 and Q18 include

SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

This is an Int 64 and string

DataFusion CLI v36.0.0
❯ describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name           | data_type | is_nullable |
+-----------------------+-----------+-------------+
...
| UserID                | Int64     | NO          |
...
| SearchPhrase          | Utf8      | NO          |
...
+-----------------------+-----------+-------------+
105 rows in set. Query took 0.035 seconds.

In some profiling of Q19, SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10; I found that 20-30% of the time is spent going from Array --> Row or Row --> Array.

Thus I think adding some special handling for variable length data vs fixed length data in the group management may help

Background

GroupValuesRows, used for the queries above, is here:
https://github.com/apache/arrow-datafusion/blob/edec4189242ab07ac65967490537d77e776aad5c/datafusion/physical-plan/src/aggregates/group_values/row.rs#L32

Given a query like SELECT ... GROUP BY i1, i2, s1, where i1 and i2 are integer columns and s1 is a string column

For input looks like this:

                       ┌─────┐ ┌─────────┐                                                       
                       │  0  │ │TheQuickB│                                                       
┌─────┐   ┌─────┐      ├─────┤ │rownFox..│                                                       
│  1  │   │ 10  │      │ 100 │ │.FooSomeO│               In the input Arrow Arrays, variable     
├─────┤   ├─────┤      ├─────┤ │therVeryL│               length columns have offsets into other  
│  2  │   │ 20  │      │ 103 │ │argeStrin│               buffers                                 
├─────┤   ├─────┤      ├─────┤ │gs       │                                                       
│  5  │   │ 50  │      │ 300 │ │         │                                                       
├─────┤   ├─────┤      ├─────┤ └─────────┘                                                       
│ ... │   │ ... │      │ ... │  data (s1)                                                        
├─────┤   ├─────┤      ├─────┤                                                                   
│  6  │   │ 60  │      │ 600 │                                                                   
└─────┘   └─────┘      └─────┘                                                                   
                       offsets (s1)                                                              
                                                                                                 
   i1        i2                s1                                                                

GroupValuesRows will do

┌────────────────────────────┐                                                                   
│1|10|TheQuickBrownFox....   │                                                                   
└────────────────────────────┘                           In GroupValuesRows, each input row is   
                                                         copied into Row format (simplified      
┌───────────┐                                            version shown here), including the      
│2|20|Foo   │                                            entire string content                   
└───────────┘                                                                                    
                                                                                                 
┌────────────────────────────────────┐                                                           
│3|30|SomeOtherVeryLargeString...    │                                                           
└────────────────────────────────────┘                                                           

One downside of this approach is that for "large" strings, a substantial amount of copying is required simply to check if the group is already present

Describe alternatives you've considered

The idea is to use a modified version of the group keys where the fixed length part still uses row format, but the variable length columns use an approach like in GroupValuesByes

Something like

 ┌────┐   ┌────────────────┐                        
 │1|10│   │    offset 0    │             ┌─────────┐
 └────┘   │     len 3      │             │FooTheQui│
          └────────────────┘             │ckBrownFo│
 ┌────┐   ┌────────────────┐             │x...SomeO│
 │2|20│   │    offset 3    │             │therVeryL│
 └────┘   │    len 100     │             │argeStrin│
          └────────────────┘             │gs       │
 ┌────┐                                  │         │
 │3|30│   ┌────────────────┐             └─────────┘
 └────┘   │   offset 103   │                data    
          │    len 200     │                        
Use Rows  └────────────────┘               out of   
  for                                       line    
 fixed        offsets +                    buffer   
  part     lengths for each                 for     
            variable part                           
                                                    

Additional context

No response

@yjshen
Copy link
Member

yjshen commented Feb 29, 2024

String/binary prefix stored in place similar to ArrowBytesMap might still be a valid plus since it allows us to avoid chase pointers sometimes.

@alamb
Copy link
Contributor Author

alamb commented Feb 29, 2024

String/binary prefix stored in place similar to ArrowBytesMap might still be a valid plus since it allows us to avoid chase pointers sometimes.

Yes, indeed -- I think an approach similar to or maybe even using ArrowBytesMap would be valuable to explore. The approach in ArrowBytesMap minimizes copies (each output string is copied once, and the final results is emitted without copying)

@alamb
Copy link
Contributor Author

alamb commented Feb 29, 2024

BTW I may play around with this approach as a fun side project if/when I have time. In general, my high level strategy would be to hack up GroupValuesRows with this approach enough to try and validate that it would actually improve performance

If it did, then I would spend the time obsessing over / optimizing the corner cases

@alamb
Copy link
Contributor Author

alamb commented Apr 2, 2024

BTW the DuckDB paper (which I have not yet read) seems to describe a very similar layout for variable length strings: https://duckdb.org/2024/03/29/external-aggregation

Screenshot 2024-04-02 at 8 51 55 AM

@jayzhan211
Copy link
Contributor

I will take a look on this first 👀

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jun 16, 2024

I try to an approach that take multiple columns into consideration but found that the time spend on insert_accounted largely increase. Rough idea is split columns into two, fixed width (primitives) and variable length (string, binary). Fixed width columns are converted to arrow::Rows, others follow the idea like ArrowBytesMap.

self.map.insert_accounted(
    new_header,
    |header| header.hash,
    &mut self.map_size,
);

The hash entry includes vector like

struct Entry<O>
where
    O: OffsetSizeTrait,
{
    /// hash of the value (stored to avoid recomputing it in hash table check)
    hash: u64,

   /// each variable length column's offset or inline(short string)
    offset_or_inline: Vec<usize>,

    /// each variable length column's length. None for null.
    len: Vec<Option<O>>,
    group_id: usize,
}

It seems that we need avoid adding Vec into hash entry, so ArrowBytesMap's idea couldn't help much :(

Ref: #10937

I will try converting variable length column info (maybe group values index) from to ArrayRef, and convert them to Rows together 🤔 .

@jayzhan211
Copy link
Contributor

Another approach #10976 though still beaten by trivial Row approach but the time is close compare to #10937

@alamb
Copy link
Contributor Author

alamb commented Jun 19, 2024

#10976 is a cool approach. I have been thinking about this more, especially in context of @jayzhan211 's comment here #10918 (comment)

Here is one potential (intermediate) phase that might be worth exploring. Specifically, change the output type of the first phase of grouping to be StringView -- the reason is that this might help performance but would keep the required changes localized to HashAggregateExec 🤔

                              output DataType from the    
                              final phase could remain    
┌─────────────────────────┐   String                      
│HashAggregateExec        │                               
│(AggregateMode::Final)   │                               
│                         │                               
└─────────────────────────┘                               
             ▲                                            
             │                                            
┌─────────────────────────┐                               
│CoalesceBatches          │   Implement                   
└─────────────────────────┘   StringView/BinaryView       
             ▲                support in RepartitionExec  
             │                and CoalesceBatches         
             │                                            
┌─────────────────────────┐                               
│RepartitionExec          │                               
└─────────────────────────┘                               
             ▲               output DataType of all       
             │               String/Binary *GROUP*        
             │               columns be                   
┌─────────────────────────┐                               
│    HashAggregateExec    │                               
│(AggregateMode::Partial) │                               
└─────────────────────────┘                               
             ▲                                            
             │                                            
             │                                            
             │                                            
       .───────────.                                      
     ,'             `.                                    
    (      Input      )                                   
     '─.           ,─'                                    
        `─────────'                                       

@XiangpengHao
Copy link
Contributor

XiangpengHao commented Jul 16, 2024

Want to share some of my findings here.

My approach is very similar to #10976, except that it will emit StringViewArray and uses ArrowBytesViewMap in #11421 to build the hash idx of the string column.

The performance is similar to (slightly slower) first converting to row-group then convert back. Flamegraph shows that building the ArrowBytesViewMap takes quite a lot of time (compute hash, hash lookup etc).

====

update: I tried longer strings, i.e., "URL" and my approach is ~10% faster than baseline

@XiangpengHao
Copy link
Contributor

One crazy idea I've been thinking: if the string view is loaded from dictionary-encoded parquet, then the underlying buffers are unique (i.e., no duplicated values), then the view values are essentially the hash of the string -> if two strings are the same, the share the same view value; if they are different, they must have different view values.
Then we can just pass the view array to the arrow-row converter instead of the potentially large StringViewArray.

I'm not sure how fast we can get from this, my guess is that the performance gain may not justify the changes it requires.

@alamb
Copy link
Contributor Author

alamb commented Jul 16, 2024

dictionary-encoded parquet, then the underlying buffers are unique (i.e., no duplicated values)

That is probably correct for arrays that share the same data page. But once the next page is emitted the dictionary changes I think and then therefore the dictionary values may be different and the views are no longer unique 🤔

@jayzhan211
Copy link
Contributor

#10976

The bottleneck I found in #10976 is also hashing

@alamb
Copy link
Contributor Author

alamb commented Jul 29, 2024

FWIW I filed #11680 to track some ideas of reducing hash overhead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants