Generate GroupByHash output in multiple `RecordBatch`es rather than one large one #9562

alamb · 2024-03-11T20:17:03Z

Is your feature request related to a problem or challenge?

The AggregateExec generates one single (giant) RecordBatch on output (source)
Which is then emitted in parts (via RecordBatch::slice(), which does not actually allocate any additional memory) (source)

This has at least two potential downsides:

No memory is freed until the GroupByHash has output every output row
As we see in Further refine the Top K sort operator #9417, if there are upstream operators like TopK that hold references to any of these sliced RecordBatchs, those slices are treated as though they were an additional allocation that needs to be tracked (source)

Something like this in pictures:

                                                 Output             
                           ▲               RecordBatches are        
                           │                 slices into a          
                  ┌────────────────┐          single large          
                  │  RecordBatch   │─ ─ ─ ┐   output batch          
                  └────────────────┴ ─ ─ ┐                          
                           ▲              │                         
                           │             │        ┌────────────────┐
                  ┌────────────────┐      └ ─ ─ ─▶│                │
  output stream   │  RecordBatch   │     │        ├ ─ ─ ─ ─ ─ ─ ─ ─│
                  └────────────────┘      ─ ─ ─ ─▶│                │
                                                  ├ ─ ─ ─ ─ ─ ─ ─ ─│
                         ...                      │                │
                                                  │                │
                           ▲                      │      ...       │
                           │                      │                │
                  ┌────────────────┐              │                │
                  │  RecordBatch   ├ ─ ─ ┐        │                │
                  └────────────────┘              │                │
                           ▲             │        ├ ─ ─ ─ ─ ─ ─ ─ ─│
                           │              ─ ─ ─ ─▶│                │
                           │                      └────────────────┘
                           │                                        
               ┏━━━━━━━━━━━━━━━━━━━━━━━┓                            
               ┃                       ┃       Single RecordBatch   
               ┃                       ┃                            
               ┃                       ┃                            
               ┃                       ┃                            
               ┃                       ┃                            
               ┃    GroupByHashExec    ┃                            
               ┃                       ┃                            
               ┃                       ┃                            
               ┃                       ┃                            
               ┃                       ┃                            
               ┃                       ┃                            
               ┃                       ┃                            
               ┗━━━━━━━━━━━━━━━━━━━━━━━┛

Describe the solution you'd like

If we had infinite time / engineering hours I think a better approach would actually be to change GroupByHash so it didn't create a single giant contiguous RecordBatch

Instead it would be better if GroupByHash produced a Vec<RecordBatch> and then incrementally fed those batches out

Doing this would allow the GroupByHash to release memory incrementally as it output. This is analogous to how @korowa made join output incremental in #8658

Perhaps something like

                            ▲                                        
                            │                                        
                   ┌────────────────┐                    Output      
  output stream    │  RecordBatch   │              RecordBatches are 
                   └────────────────┘              created in smaller
                            ▲                      chunks and emitted
                            │                          one by one    
                            │                                        
                            │                                        
                ┏━━━━━━━━━━━━━━━━━━━━━━━┓          ┌────────────────┐
                ┃                       ┃          │  RecordBatch   │
                ┃                       ┃          └────────────────┘
                ┃                       ┃          ┌────────────────┐
                ┃                       ┃          │  RecordBatch   │
                ┃                       ┃          └────────────────┘
                ┃    GroupByHashExec    ┃          ┌────────────────┐
                ┃                       ┃          │  RecordBatch   │
                ┃                       ┃          └────────────────┘
                ┃                       ┃                 ...        
                ┃                       ┃          ┌────────────────┐
                ┃                       ┃          │  RecordBatch   │
                ┃                       ┃          └────────────────┘
                ┗━━━━━━━━━━━━━━━━━━━━━━━┛                            
                                                    Vec<RecordBatch>

Describe alternatives you've considered

No response

Additional context

@yjshen notes:

To improve AggExec's mono output pattern, #7065 might be similar to the idea of incremental output.

The text was updated successfully, but these errors were encountered:

guojidan · 2024-03-12T10:19:22Z

take

guojidan · 2024-03-12T10:21:27Z

This issue is interesting, let me try implement it

alamb · 2024-03-13T00:05:35Z

FYI this issue may be tricky -- as it will be performance critical -- I will be happy to assist

guojidan · 2024-03-22T08:57:56Z

this issue is a bit tricky for me 😢 , I can only think of the following approaches:
change the GroupedHashAggregateStream::emit() function return a vector Result<Vec<RecordBatch>>, each RecordBatch num_rows equal GroupedHashAggregateStream::batch_size, and change ExecutionState::ProducingOutput(RecordBatch) to ExecutionState::ProducingOutput(Vec<RecordBatch>), then GroupedHashAggregateStream::poll_next() function returns one element of a vector at one loop, like let output = batchs.pop() source

alamb · 2024-03-22T14:24:00Z

this issue is a bit tricky for me 😢 , I can only think of the following approaches: change the GroupedHashAggregateStream::emit() function return a vector Result<Vec<RecordBatch>>, each RecordBatch num_rows equal GroupedHashAggregateStream::batch_size, and change ExecutionState::ProducingOutput(RecordBatch) to ExecutionState::ProducingOutput(Vec<RecordBatch>), then GroupedHashAggregateStream::poll_next() function returns one element of a vector at one loop, like let output = batchs.pop() source

I think this approach sounds good -- nice proposal

One thing that could help keep the PRs small and manageable would be to switch the APIs as described above but you could avoid having to change all the GroupAccumulators in one PR by returning return a Vec<> of size 1 for most of them.

Then we can make subsequent PRs to switch over other groups accumulators as needed

Rachelint · 2024-07-30T15:33:10Z

Actually, I found the slice function is not trivial because the eager computation for the null count.
One way I could think to solve it is making the computation lazy, see arrow-rs pr(q32 1.10~1.15x faster without the computation for null count):
apache/arrow-rs#6155

The alternative may be the way mentioned in this issue.

Rachelint · 2024-07-30T15:38:50Z

Hi @guojidan, seems there is no update for some months, can I pick this up?
I want to try it and compare the effect with lazy null count computation.

guojidan · 2024-07-31T00:44:30Z

Hi @guojidan, seems there is no update for some months, can I pick this up? I want to try it and compare the effect with lazy null count computation.

Thanks. You can take this. I am working on another project now.

alamb added the enhancement New feature or request label Mar 11, 2024

This was referenced Mar 11, 2024

Further refine the Top K sort operator #9417

Open

[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

Open

github-actions bot assigned guojidan Mar 12, 2024

guojidan mentioned this issue Mar 27, 2024

refactor: Generate GroupByHash output in multiple RecordBatches #9818

Closed

alamb mentioned this issue Jul 26, 2024

Bug(arrow-row): calling convert_raw function cause "offset overflow" panic apache/arrow-rs#6112

Open

JasonLi-cn linked a pull request Aug 1, 2024 that will close this issue

Generate GroupByHash output in multiple RecordBatches #11758

Draft

3 tasks

alamb mentioned this issue Aug 9, 2024

lazily compute for null count(seems help to high cardinality aggr) apache/arrow-rs#6155

Draft

Rachelint mentioned this issue Aug 13, 2024

Sketch for aggregation intermediate results blocked management #11943

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate GroupByHash output in multiple `RecordBatch`es rather than one large one #9562

Generate GroupByHash output in multiple `RecordBatch`es rather than one large one #9562

alamb commented Mar 11, 2024 •

edited

Loading

guojidan commented Mar 12, 2024

guojidan commented Mar 12, 2024

alamb commented Mar 13, 2024

guojidan commented Mar 22, 2024

alamb commented Mar 22, 2024

Rachelint commented Jul 30, 2024 •

edited

Loading

Rachelint commented Jul 30, 2024 •

edited

Loading

guojidan commented Jul 31, 2024

Generate GroupByHash output in multiple RecordBatches rather than one large one #9562

Generate GroupByHash output in multiple RecordBatches rather than one large one #9562

Comments

alamb commented Mar 11, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

guojidan commented Mar 12, 2024

guojidan commented Mar 12, 2024

alamb commented Mar 13, 2024

guojidan commented Mar 22, 2024

alamb commented Mar 22, 2024

Rachelint commented Jul 30, 2024 • edited Loading

Rachelint commented Jul 30, 2024 • edited Loading

guojidan commented Jul 31, 2024

Generate GroupByHash output in multiple `RecordBatch`es rather than one large one #9562

Generate GroupByHash output in multiple `RecordBatch`es rather than one large one #9562

alamb commented Mar 11, 2024 •

edited

Loading

Rachelint commented Jul 30, 2024 •

edited

Loading

Rachelint commented Jul 30, 2024 •

edited

Loading