Improve performance for grouping by variable length columns (strings) #9403

alamb · 2024-02-29T16:24:04Z

Is your feature request related to a problem or challenge?

As always I would like faster aggregation performance

Describe the solution you'd like

clickbench, Q17 and Q18 include

SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

This is an Int 64 and string

DataFusion CLI v36.0.0
❯ describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name           | data_type | is_nullable |
+-----------------------+-----------+-------------+
...
| UserID                | Int64     | NO          |
...
| SearchPhrase          | Utf8      | NO          |
...
+-----------------------+-----------+-------------+
105 rows in set. Query took 0.035 seconds.

In some profiling of Q19, SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10; I found that 20-30% of the time is spent going from Array --> Row or Row --> Array.

Thus I think adding some special handling for variable length data vs fixed length data in the group management may help

Background

GroupValuesRows, used for the queries above, is here:
https://github.com/apache/arrow-datafusion/blob/edec4189242ab07ac65967490537d77e776aad5c/datafusion/physical-plan/src/aggregates/group_values/row.rs#L32

Given a query like SELECT ... GROUP BY i1, i2, s1, where i1 and i2 are integer columns and s1 is a string column

For input looks like this:

                       ┌─────┐ ┌─────────┐                                                       
                       │  0  │ │TheQuickB│                                                       
┌─────┐   ┌─────┐      ├─────┤ │rownFox..│                                                       
│  1  │   │ 10  │      │ 100 │ │.FooSomeO│               In the input Arrow Arrays, variable     
├─────┤   ├─────┤      ├─────┤ │therVeryL│               length columns have offsets into other  
│  2  │   │ 20  │      │ 103 │ │argeStrin│               buffers                                 
├─────┤   ├─────┤      ├─────┤ │gs       │                                                       
│  5  │   │ 50  │      │ 300 │ │         │                                                       
├─────┤   ├─────┤      ├─────┤ └─────────┘                                                       
│ ... │   │ ... │      │ ... │  data (s1)                                                        
├─────┤   ├─────┤      ├─────┤                                                                   
│  6  │   │ 60  │      │ 600 │                                                                   
└─────┘   └─────┘      └─────┘                                                                   
                       offsets (s1)                                                              
                                                                                                 
   i1        i2                s1

GroupValuesRows will do

┌────────────────────────────┐                                                                   
│1|10|TheQuickBrownFox....   │                                                                   
└────────────────────────────┘                           In GroupValuesRows, each input row is   
                                                         copied into Row format (simplified      
┌───────────┐                                            version shown here), including the      
│2|20|Foo   │                                            entire string content                   
└───────────┘                                                                                    
                                                                                                 
┌────────────────────────────────────┐                                                           
│3|30|SomeOtherVeryLargeString...    │                                                           
└────────────────────────────────────┘

One downside of this approach is that for "large" strings, a substantial amount of copying is required simply to check if the group is already present

Describe alternatives you've considered

The idea is to use a modified version of the group keys where the fixed length part still uses row format, but the variable length columns use an approach like in GroupValuesByes

Something like

 ┌────┐   ┌────────────────┐                        
 │1|10│   │    offset 0    │             ┌─────────┐
 └────┘   │     len 3      │             │FooTheQui│
          └────────────────┘             │ckBrownFo│
 ┌────┐   ┌────────────────┐             │x...SomeO│
 │2|20│   │    offset 3    │             │therVeryL│
 └────┘   │    len 100     │             │argeStrin│
          └────────────────┘             │gs       │
 ┌────┐                                  │         │
 │3|30│   ┌────────────────┐             └─────────┘
 └────┘   │   offset 103   │                data    
          │    len 200     │                        
Use Rows  └────────────────┘               out of   
  for                                       line    
 fixed        offsets +                    buffer   
  part     lengths for each                 for     
            variable part

Additional context

No response

The text was updated successfully, but these errors were encountered:

yjshen · 2024-02-29T19:16:33Z

String/binary prefix stored in place similar to ArrowBytesMap might still be a valid plus since it allows us to avoid chase pointers sometimes.

alamb · 2024-02-29T19:24:56Z

String/binary prefix stored in place similar to ArrowBytesMap might still be a valid plus since it allows us to avoid chase pointers sometimes.

Yes, indeed -- I think an approach similar to or maybe even using ArrowBytesMap would be valuable to explore. The approach in ArrowBytesMap minimizes copies (each output string is copied once, and the final results is emitted without copying)

alamb · 2024-02-29T19:26:38Z

BTW I may play around with this approach as a fun side project if/when I have time. In general, my high level strategy would be to hack up GroupValuesRows with this approach enough to try and validate that it would actually improve performance

If it did, then I would spend the time obsessing over / optimizing the corner cases

alamb · 2024-04-02T12:51:38Z

BTW the DuckDB paper (which I have not yet read) seems to describe a very similar layout for variable length strings: https://duckdb.org/2024/03/29/external-aggregation

jayzhan211 · 2024-06-13T23:56:29Z

I will take a look on this first 👀

jayzhan211 · 2024-06-16T12:59:17Z

I try to an approach that take multiple columns into consideration but found that the time spend on insert_accounted largely increase. Rough idea is split columns into two, fixed width (primitives) and variable length (string, binary). Fixed width columns are converted to arrow::Rows, others follow the idea like ArrowBytesMap.

self.map.insert_accounted(
    new_header,
    |header| header.hash,
    &mut self.map_size,
);

The hash entry includes vector like

struct Entry<O>
where
    O: OffsetSizeTrait,
{
    /// hash of the value (stored to avoid recomputing it in hash table check)
    hash: u64,

   /// each variable length column's offset or inline(short string)
    offset_or_inline: Vec<usize>,

    /// each variable length column's length. None for null.
    len: Vec<Option<O>>,
    group_id: usize,
}

It seems that we need avoid adding Vec into hash entry, so ArrowBytesMap's idea couldn't help much :(

Ref: #10937

I will try converting variable length column info (maybe group values index) from to ArrayRef, and convert them to Rows together 🤔 .

jayzhan211 · 2024-06-18T07:05:53Z

Another approach #10976 though still beaten by trivial Row approach but the time is close compare to #10937

alamb · 2024-06-19T10:38:01Z

#10976 is a cool approach. I have been thinking about this more, especially in context of @jayzhan211 's comment here #10918 (comment)

Here is one potential (intermediate) phase that might be worth exploring. Specifically, change the output type of the first phase of grouping to be StringView -- the reason is that this might help performance but would keep the required changes localized to HashAggregateExec 🤔

                              output DataType from the    
                              final phase could remain    
┌─────────────────────────┐   String                      
│HashAggregateExec        │                               
│(AggregateMode::Final)   │                               
│                         │                               
└─────────────────────────┘                               
             ▲                                            
             │                                            
┌─────────────────────────┐                               
│CoalesceBatches          │   Implement                   
└─────────────────────────┘   StringView/BinaryView       
             ▲                support in RepartitionExec  
             │                and CoalesceBatches         
             │                                            
┌─────────────────────────┐                               
│RepartitionExec          │                               
└─────────────────────────┘                               
             ▲               output DataType of all       
             │               String/Binary *GROUP*        
             │               columns be                   
┌─────────────────────────┐                               
│    HashAggregateExec    │                               
│(AggregateMode::Partial) │                               
└─────────────────────────┘                               
             ▲                                            
             │                                            
             │                                            
             │                                            
       .───────────.                                      
     ,'             `.                                    
    (      Input      )                                   
     '─.           ,─'                                    
        `─────────'

XiangpengHao · 2024-07-16T14:15:52Z

Want to share some of my findings here.

My approach is very similar to #10976, except that it will emit StringViewArray and uses ArrowBytesViewMap in #11421 to build the hash idx of the string column.

The performance is similar to (slightly slower) first converting to row-group then convert back. Flamegraph shows that building the ArrowBytesViewMap takes quite a lot of time (compute hash, hash lookup etc).

====

update: I tried longer strings, i.e., "URL" and my approach is ~10% faster than baseline

XiangpengHao · 2024-07-16T14:28:19Z

One crazy idea I've been thinking: if the string view is loaded from dictionary-encoded parquet, then the underlying buffers are unique (i.e., no duplicated values), then the view values are essentially the hash of the string -> if two strings are the same, the share the same view value; if they are different, they must have different view values.
Then we can just pass the view array to the arrow-row converter instead of the potentially large StringViewArray.

I'm not sure how fast we can get from this, my guess is that the performance gain may not justify the changes it requires.

alamb · 2024-07-16T21:54:36Z

dictionary-encoded parquet, then the underlying buffers are unique (i.e., no duplicated values)

That is probably correct for arrays that share the same data page. But once the next page is emitted the dictionary changes I think and then therefore the dictionary values may be different and the views are no longer unique 🤔

jayzhan211 · 2024-07-17T09:54:51Z

#10976

The bottleneck I found in #10976 is also hashing

alamb · 2024-07-29T14:05:49Z

FWIW I filed #11680 to track some ideas of reducing hash overhead

alamb added the enhancement New feature or request label Feb 29, 2024

alamb mentioned this issue Feb 29, 2024

[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

Open

17 tasks

alamb mentioned this issue Jun 11, 2024

Apply guarantee rewriter to sql workflow #10456

Open

alamb mentioned this issue Jun 17, 2024

DataFusion weekly project plan (Andrew Lamb) - June 17, 2024 #10955

Closed

5 tasks

alamb mentioned this issue Jun 19, 2024

[Draft v2] Another Multi group by optimization #10976

Closed

This was referenced Jun 19, 2024

[Epic] Implement support for StringView in DataFusion #10918

Closed

use StringViewArray when reading String columns from Parquet #10921

Closed

DataFusion weekly project plan (Andrew Lamb) - June 24, 2024 #11106

Closed

This was referenced Jun 29, 2024

Optionally display schema in explain plan #11176

Closed

Optionally display schema in explain plan #11177

Merged

DataFusion weekly project plan (Andrew Lamb) - July 1, 2024 #11190

Closed

This was referenced Jul 8, 2024

DataFusion weekly project plan (Andrew Lamb) - July 8, 2024 #11334

Closed

2024 Q3-Q4 Roadmap? #11442

Open

alamb mentioned this issue Jul 30, 2024

Update ClickBench benchmarks with DataFusion 40 #11567

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for grouping by variable length columns (strings) #9403

Improve performance for grouping by variable length columns (strings) #9403

alamb commented Feb 29, 2024 •

edited

Loading

yjshen commented Feb 29, 2024

alamb commented Feb 29, 2024

alamb commented Feb 29, 2024

alamb commented Apr 2, 2024 •

edited

Loading

jayzhan211 commented Jun 13, 2024

jayzhan211 commented Jun 16, 2024 •

edited

Loading

jayzhan211 commented Jun 18, 2024

alamb commented Jun 19, 2024

XiangpengHao commented Jul 16, 2024 •

edited

Loading

XiangpengHao commented Jul 16, 2024

alamb commented Jul 16, 2024

jayzhan211 commented Jul 17, 2024

alamb commented Jul 29, 2024

Improve performance for grouping by variable length columns (strings) #9403

Improve performance for grouping by variable length columns (strings) #9403

Comments

alamb commented Feb 29, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Background

Describe alternatives you've considered

Additional context

yjshen commented Feb 29, 2024

alamb commented Feb 29, 2024

alamb commented Feb 29, 2024

alamb commented Apr 2, 2024 • edited Loading

jayzhan211 commented Jun 13, 2024

jayzhan211 commented Jun 16, 2024 • edited Loading

jayzhan211 commented Jun 18, 2024

alamb commented Jun 19, 2024

XiangpengHao commented Jul 16, 2024 • edited Loading

XiangpengHao commented Jul 16, 2024

alamb commented Jul 16, 2024

jayzhan211 commented Jul 17, 2024

alamb commented Jul 29, 2024

alamb commented Feb 29, 2024 •

edited

Loading

alamb commented Apr 2, 2024 •

edited

Loading

jayzhan211 commented Jun 16, 2024 •

edited

Loading

XiangpengHao commented Jul 16, 2024 •

edited

Loading