lazily compute for null count(seems help to high cardinality aggr) #6155

Rachelint · 2024-07-29T21:33:58Z

Which issue does this PR close?

Closes #6146

Rationale for this change

See #6146

What changes are included in this PR?

lazily compute for null count when it is expansive to calculate it.

Are there any user-facing changes?

No.

alamb · 2024-07-29T22:08:13Z

Thank you @Rachelint -- is there any chance we can get a benchmark showing the effect of this PR?

Rachelint · 2024-07-29T22:18:39Z

Thank you @Rachelint -- is there any chance we can get a benchmark showing the effect of this PR?

I run clickbench in datafusion with this pr in local, seems q32 1.10~1.15x faster.
see this branch https://github.com/Rachelint/arrow-datafusion/tree/try-opt-agg

After profile, I found it is due to slice() function will be called many times, and the eager computation of null count seems to cost much cpu.
see #6146

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ try-opt-agg ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.81ms │      0.79ms │     no change │
│ QQuery 1     │    69.11ms │     70.56ms │     no change │
│ QQuery 2     │   169.47ms │    168.31ms │     no change │
│ QQuery 3     │   182.60ms │    187.44ms │     no change │
│ QQuery 4     │  1589.41ms │   1604.52ms │     no change │
│ QQuery 5     │  1597.02ms │   1587.10ms │     no change │
│ QQuery 6     │    57.92ms │     63.17ms │  1.09x slower │
│ QQuery 7     │    72.33ms │     71.48ms │     no change │
│ QQuery 8     │  2415.02ms │   2434.26ms │     no change │
│ QQuery 9     │  1928.42ms │   1918.16ms │     no change │
│ QQuery 10    │   544.39ms │    547.84ms │     no change │
│ QQuery 11    │   605.79ms │    602.19ms │     no change │
│ QQuery 12    │  1767.57ms │   1758.03ms │     no change │
│ QQuery 13    │  4073.33ms │   4077.78ms │     no change │
│ QQuery 14    │  2583.14ms │   2605.21ms │     no change │
│ QQuery 15    │  1784.13ms │   1792.49ms │     no change │
│ QQuery 16    │  5028.55ms │   5087.98ms │     no change │
│ QQuery 17    │  4956.14ms │   4939.23ms │     no change │
│ QQuery 18    │ 10436.51ms │  10515.59ms │     no change │
│ QQuery 19    │   144.11ms │    149.23ms │     no change │
│ QQuery 20    │  3310.77ms │   3299.29ms │     no change │
│ QQuery 21    │  3887.09ms │   3892.00ms │     no change │
│ QQuery 22    │  9398.96ms │   9370.21ms │     no change │
│ QQuery 23    │ 23087.26ms │  23049.61ms │     no change │
│ QQuery 24    │  1168.15ms │   1174.31ms │     no change │
│ QQuery 25    │  1046.92ms │   1046.00ms │     no change │
│ QQuery 26    │  1352.80ms │   1362.05ms │     no change │
│ QQuery 27    │  4711.92ms │   4694.96ms │     no change │
│ QQuery 28    │ 21891.92ms │  21818.50ms │     no change │
│ QQuery 29    │   920.19ms │    897.67ms │     no change │
│ QQuery 30    │  2075.81ms │   2077.62ms │     no change │
│ QQuery 31    │  2961.03ms │   2934.88ms │     no change │
│ QQuery 32    │ 16167.05ms │  14340.64ms │ +1.13x faster │
│ QQuery 33    │  9418.20ms │   9414.53ms │     no change │
│ QQuery 34    │  9388.74ms │   9394.28ms │     no change │
│ QQuery 35    │  3108.34ms │   3117.84ms │     no change │
│ QQuery 36    │   270.02ms │    273.27ms │     no change │
│ QQuery 37    │   166.63ms │    170.44ms │     no change │
│ QQuery 38    │   158.33ms │    155.67ms │     no change │
│ QQuery 39    │   834.47ms │    837.39ms │     no change │
│ QQuery 40    │    63.22ms │     63.04ms │     no change │
│ QQuery 41    │    59.97ms │     58.82ms │     no change │
│ QQuery 42    │    70.34ms │     69.78ms │     no change │
└──────────────┴────────────┴─────────────┴───────────────┘

alamb · 2024-08-08T21:21:20Z

arrow-buffer/src/buffer/null.rs

+                }
+
+                let computed_null_count = self.buffer.len() - self.buffer.count_set_bits();
+                v.store(computed_null_count as i64, Ordering::Release);


I wonder if using Ordering::Relaxed https://doc.rust-lang.org/std/sync/atomic/enum.Ordering.html#variant.Relaxed would make this PR potentially faster

Specifically, it might make the generated code simpler

Also, was there a reason to remove #[inline] - maybe that could account for the slow down 🤔

( They would be similiar in x86, just affect reordering )?

I agree -- I was thinking about ways to reduce the overhead of using atomic instructions on inlining, etc -- it may not be a good idea

atomic is just a load and store instr? but count_set_bits() maybe not suitable for inline( or it's just __builtin counting zeros)?

Maybe let compiler decide it is ok

I wonder if using Ordering::Relaxed https://doc.rust-lang.org/std/sync/atomic/enum.Ordering.html#variant.Relaxed would make this PR potentially faster

Specifically, it might make the generated code simpler

Also, was there a reason to remove #[inline] - maybe that could account for the slow down 🤔

I am not sure should we use Relaxed or Aquire + Release, I switch to Aquire + Release because I am worried about the situation that:

ArrayRef is hold by multiple threads, and they call the null_count function concurrently, if we use Relaxed, the count_set_bits computation is possible to be performed many times...

Another reason is that I found the strongest SeqCst used in arrow cpp
https://github.com/apache/arrow/blob/187197c369058f7d1377c1b161c469a9e4542caf/cpp/src/arrow/array/data.cc#L206-L218

ArrayRef is hold by multiple threads, and they call the null_count function concurrently, if we use Relaxed, the count_set_bits computation is possible to be performed many times...

No memory ordering will prevent this, you'd need some mutual exclusion primitive if this was the goal. However, it doesn't really matter if two threads race to compute the null count as they'll just set it to the same value. As we aren't using the write to synchronise other memory operations, Relaxed would be perfectly valid here.

I do think the proposal to simplify the state is worth considering, as it will potentially remove a conditional load

ArrayRef is hold by multiple threads, and they call the null_count function concurrently, if we use Relaxed, the count_set_bits computation is possible to be performed many times...

No memory ordering will prevent this, you'd need some mutual exclusion primitive if this was the goal. However, it doesn't really matter if two threads race to compute the null count as they'll just set it to the same value. As we aren't using the write to synchronise other memory operations, Relaxed would be perfectly valid here.

I do think the proposal to simplify the state is worth considering, as it will potentially remove a conditional load

Sorry, maybe I didn't state it clearly. Is it possible that, due to compiler reordering or cpu cache consistence problem, and the actual execution order becomes:

if cached_null_count != UNINITIALIZED_NULL_COUNT { return cached_null_count as usize; } let cached_null_count = v.load(Ordering::Relaxed); let computed_null_count = self.buffer.len() - self.buffer.count_set_bits();

And finally, the computation will be performance one more times? Although it actually don't affect the correctness, and maybe rare to happen.

cc @XiangpengHao

No, all processors have coherent caches and compilers respect single threaded data dependencies. You can't observe two different values for the same memory locations simultaneously. The ordering concerns whether operations performed to other memory locations are guaranteed to be visible following this operation, we aren't using this as a synchronisation primitive and so this doesn't matter

No, all processors have coherent caches and compilers respect single threaded data dependencies. You can't observe two different values for the same memory locations simultaneously. The ordering concerns whether operations performed to other memory locations are guaranteed to be visible following this operation, we aren't using this as a synchronisation primitive and so this doesn't matter

Ok, thanks, I misunderstand it.

It may hurt the cpu cache, too? But I know few about the detial.

Comparing to Relaxed, Acq-Rel would just re-ordering in x86, but in weak ordering machine it might affect the cache an would require memory barrier or other instr

XiangpengHao · 2024-08-09T13:06:57Z

arrow-buffer/src/buffer/null.rs

+#[derive(Debug)]
+pub enum NullCount {
+    Eager(usize),
+    Lazy(AtomicI64),


I feel the state here is a bit complicated, here we have three states: Eager, Lazy (initialized), Lazy (uninitialized). And we use both enum and fence value to differentiate them.

I wonder if we can simplify this with just two states: uninitialized and initialized; and when we try to read a uninitialized value, we count the value and set it to initialized state.

struct NullCount { val: AtomicI64, } impl NullCount { fn get(&self, ...) -> i64 { let val = self.val.load(...); if val == UNINIT { val.store(cal_null_count); } else { return val } } }

This way we only have two states to manage, and we also keep the NullCount to be 8 byte, instead of the current 24 bytes, which might help with performance

I feel the state here is a bit complicated, here we have three states: Eager, Lazy (initialized), Lazy (uninitialized). And we use both enum and fence value to differentiate them.

I wonder if we can simplify this with just two states: uninitialized and initialized; and when we try to read a uninitialized value, we count the value and set it to initialized state.

struct NullCount { val: AtomicI64, } impl NullCount { fn get(&self, ...) -> i64 { let val = self.val.load(...); if val == UNINIT { val.store(cal_null_count); } else { return val } } }

This way we only have two states to manage, and we also keep the NullCount to be 8 byte, instead of the current 24 bytes, which might help with performance

I don't want to make it so complicated too... And I impl it with two state at the beginning.

But I found it seems to make some queries slower (maybe noise?) in my first version about this pr, and I encountered a strange performance problem maybe related to atomic in my another pr, so for trying best to avoid the possible cost of atomic, I refactor it to this.

so for trying best to avoid the possible cost of atomic

Atomic with Ordering::Relaxed has no extra cost comparing to directly loading/storing the value, as shown here: https://godbolt.org/z/4oreernqh, they compile to same instructions. The regressions are probably not from atomics.

Relaxed ordering is sufficient here, more strict ordering won't help with correctness and is slower. If we really want to prevent multiple threads to count the null count, we has to use Mutex or RwLock, which is a bit overkill for our use case, imo.

so for trying best to avoid the possible cost of atomic

Atomic with Ordering::Relaxed has no extra cost comparing to directly loading/storing the value, as shown here: https://godbolt.org/z/4oreernqh, they compile to same instructions. The regressions are probably not from atomics.

Relaxed ordering is sufficient here, more strict ordering won't help with correctness and is slower. If we really want to prevent multiple threads to count the null count, we has to use Mutex or RwLock, which is a bit overkill for our use case, imo.

😄 Thanks, I indeed don't know enough about atomic.

tustvold · 2024-08-09T16:29:08Z

I think this is an interesting angle but I do wonder if there is something fishy in what DataFusion is doing here. In particular, whilst lazily computing null masks will save cycles, array slicing is still a non-trivial operation. I wonder if the issue might be that the algorithm in DF is relying on slicing on a hot path where it would be better of tracking offsets and then using one of the selection kernels.

I worry it is going to be hard to reason about the downstream implications of introducing an additional indirection for accessing the null buffer, and there might be less disruptive ways to get the same or better return in DF. Most of the kernels in arrow-rs make use of the null counts

alamb · 2024-08-09T17:06:08Z

I think this is an interesting angle but I do wonder if there is something fishy in what DataFusion is doing here.

I would say there is a lot of room for improvement. What is happening is that for high cardinality aggregates, the output of the hash aggregate operation is currently one giant contiguous RecordBatch which is then sliced

There is more detail here apache/datafusion#9562 (and @JasonLi-cn was looking into improving it) however it is tricky as doing so would imply the intermediate state of the group keys and the hash table and the aggregates would need to be chunked. This isn't impossible, just non trivial

tustvold · 2024-08-09T17:10:49Z

Right so the question becomes whether the proposed optimisation is overfitting to a particular implementation that is known to be at least somewhat degenerate? Is this a general purpose improvement, or just something that helps a single use-case potentially to the detriment of others?

alamb · 2024-08-10T13:29:19Z

Right so the question becomes whether the proposed optimisation is overfitting to a particular implementation that is known to be at least somewhat degenerate? Is this a general purpose improvement, or just something that helps a single use-case potentially to the detriment of others?

Yes, I agree I am not yet convinced that this is a good general optimization for arrow-rs.

I realize my last message was ambiguous about what I thought the best solution was.

In particular, I think we would need to show this PR doesn't slow anything else down. There is at least one small cost that it increases the sizes of arrays by 8 bytes I think

In general I do think making the change in DataFusion is the better option in general and that we would pursue that change regardless of this particular PR in arrow

alamb · 2024-08-13T12:52:33Z

In my opinion we should evaluate this change on its own (would it be good for all arrow-rs users).

Making DataFuson aggregates faster is a parallel track of work (which @Rachelint is also exploring other interesting ideas such as apache/datafusion#11943) which while related is not a sufficient justification in itself for this work

Rachelint · 2024-08-13T14:11:14Z

We should indeed evaluate if it make general benefits for arrow-rs.

Maybe I can try to find the related issues about why arrow c++ support it, and try to reproduce the related cases?

alamb · 2024-08-13T20:04:27Z

Running the arrow benchmarks (cargo bench ... in this repo) is probably also something good to try

alamb · 2024-08-21T13:06:05Z

Marking this as draft to signify it isn't waiting for more review. @Rachelint has some very promising other PRs downstream in DataFusion that might make an optimization such as this less relevant

github-actions bot added the arrow Changes to the arrow crate label Jul 29, 2024

lazily compute for null count when it is expansive to calculate it.

1c3a1be

Rachelint force-pushed the lazy-compute-null-count branch from f8a6d00 to 1c3a1be Compare July 29, 2024 21:35

fix test.

9c225fc

Rachelint marked this pull request as ready for review July 29, 2024 21:54

Rachelint changed the title ~~lazily compute for null count when it is expansive to calculate it~~ lazily compute for null count(seems help to high cardinality aggr) Jul 29, 2024

fix comment.

63a5af9

Rachelint mentioned this pull request Jul 30, 2024

Generate GroupByHash output in multiple RecordBatches rather than one large one apache/datafusion#9562

Open

alamb mentioned this pull request Aug 8, 2024

Improve performance of high cardinality grouping by reusing hash values apache/datafusion#11680

Open

alamb reviewed Aug 9, 2024

View reviewed changes

XiangpengHao reviewed Aug 9, 2024

View reviewed changes

alamb marked this pull request as draft August 21, 2024 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lazily compute for null count(seems help to high cardinality aggr) #6155

lazily compute for null count(seems help to high cardinality aggr) #6155

Rachelint commented Jul 29, 2024

alamb commented Jul 29, 2024

Rachelint commented Jul 29, 2024 •

edited

Loading

alamb Aug 8, 2024

mapleFU Aug 9, 2024

alamb Aug 9, 2024

mapleFU Aug 9, 2024 •

edited

Loading

Rachelint Aug 9, 2024 •

edited

Loading

tustvold Aug 9, 2024 •

edited

Loading

Rachelint Aug 13, 2024 •

edited

Loading

tustvold Aug 13, 2024 •

edited

Loading

Rachelint Aug 13, 2024

mapleFU Aug 14, 2024

XiangpengHao Aug 9, 2024 •

edited

Loading

Rachelint Aug 9, 2024

XiangpengHao Aug 10, 2024

Rachelint Aug 13, 2024

tustvold commented Aug 9, 2024 •

edited

Loading

alamb commented Aug 9, 2024 •

edited

Loading

tustvold commented Aug 9, 2024

alamb commented Aug 10, 2024 •

edited

Loading

alamb commented Aug 13, 2024

Rachelint commented Aug 13, 2024 •

edited

Loading

alamb commented Aug 13, 2024

alamb commented Aug 21, 2024

lazily compute for null count(seems help to high cardinality aggr) #6155

Are you sure you want to change the base?

lazily compute for null count(seems help to high cardinality aggr) #6155

Conversation

Rachelint commented Jul 29, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented Jul 29, 2024

Rachelint commented Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

Rachelint Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

tustvold Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

Rachelint Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

tustvold Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XiangpengHao Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Aug 9, 2024 • edited Loading

alamb commented Aug 9, 2024 • edited Loading

tustvold commented Aug 9, 2024

alamb commented Aug 10, 2024 • edited Loading

alamb commented Aug 13, 2024

Rachelint commented Aug 13, 2024 • edited Loading

alamb commented Aug 13, 2024

alamb commented Aug 21, 2024

Rachelint commented Jul 29, 2024 •

edited

Loading

mapleFU Aug 9, 2024 •

edited

Loading

Rachelint Aug 9, 2024 •

edited

Loading

tustvold Aug 9, 2024 •

edited

Loading

Rachelint Aug 13, 2024 •

edited

Loading

tustvold Aug 13, 2024 •

edited

Loading

XiangpengHao Aug 9, 2024 •

edited

Loading

tustvold commented Aug 9, 2024 •

edited

Loading

alamb commented Aug 9, 2024 •

edited

Loading

alamb commented Aug 10, 2024 •

edited

Loading

Rachelint commented Aug 13, 2024 •

edited

Loading