Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lazily compute for null count(seems help to high cardinality aggr) #6155

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Rachelint
Copy link
Contributor

Which issue does this PR close?

Closes #6146

Rationale for this change

See #6146

What changes are included in this PR?

lazily compute for null count when it is expansive to calculate it.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 29, 2024
@Rachelint Rachelint marked this pull request as ready for review July 29, 2024 21:54
@Rachelint Rachelint changed the title lazily compute for null count when it is expansive to calculate it lazily compute for null count(seems help to high cardinality aggr) Jul 29, 2024
@alamb
Copy link
Contributor

alamb commented Jul 29, 2024

Thank you @Rachelint -- is there any chance we can get a benchmark showing the effect of this PR?

@Rachelint
Copy link
Contributor Author

Rachelint commented Jul 29, 2024

Thank you @Rachelint -- is there any chance we can get a benchmark showing the effect of this PR?

I run clickbench in datafusion with this pr in local, seems q32 1.10~1.15x faster.
see this branch https://github.com/Rachelint/arrow-datafusion/tree/try-opt-agg

After profile, I found it is due to slice() function will be called many times, and the eager computation of null count seems to cost much cpu.
see #6146

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ try-opt-agg ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.81ms │      0.79ms │     no change │
│ QQuery 1     │    69.11ms │     70.56ms │     no change │
│ QQuery 2     │   169.47ms │    168.31ms │     no change │
│ QQuery 3     │   182.60ms │    187.44ms │     no change │
│ QQuery 4     │  1589.41ms │   1604.52ms │     no change │
│ QQuery 5     │  1597.02ms │   1587.10ms │     no change │
│ QQuery 6     │    57.92ms │     63.17ms │  1.09x slower │
│ QQuery 7     │    72.33ms │     71.48ms │     no change │
│ QQuery 8     │  2415.02ms │   2434.26ms │     no change │
│ QQuery 9     │  1928.42ms │   1918.16ms │     no change │
│ QQuery 10    │   544.39ms │    547.84ms │     no change │
│ QQuery 11    │   605.79ms │    602.19ms │     no change │
│ QQuery 12    │  1767.57ms │   1758.03ms │     no change │
│ QQuery 13    │  4073.33ms │   4077.78ms │     no change │
│ QQuery 14    │  2583.14ms │   2605.21ms │     no change │
│ QQuery 15    │  1784.13ms │   1792.49ms │     no change │
│ QQuery 16    │  5028.55ms │   5087.98ms │     no change │
│ QQuery 17    │  4956.14ms │   4939.23ms │     no change │
│ QQuery 18    │ 10436.51ms │  10515.59ms │     no change │
│ QQuery 19    │   144.11ms │    149.23ms │     no change │
│ QQuery 20    │  3310.77ms │   3299.29ms │     no change │
│ QQuery 21    │  3887.09ms │   3892.00ms │     no change │
│ QQuery 22    │  9398.96ms │   9370.21ms │     no change │
│ QQuery 23    │ 23087.26ms │  23049.61ms │     no change │
│ QQuery 24    │  1168.15ms │   1174.31ms │     no change │
│ QQuery 25    │  1046.92ms │   1046.00ms │     no change │
│ QQuery 26    │  1352.80ms │   1362.05ms │     no change │
│ QQuery 27    │  4711.92ms │   4694.96ms │     no change │
│ QQuery 28    │ 21891.92ms │  21818.50ms │     no change │
│ QQuery 29    │   920.19ms │    897.67ms │     no change │
│ QQuery 30    │  2075.81ms │   2077.62ms │     no change │
│ QQuery 31    │  2961.03ms │   2934.88ms │     no change │
│ QQuery 32    │ 16167.05ms │  14340.64ms │ +1.13x faster │
│ QQuery 33    │  9418.20ms │   9414.53ms │     no change │
│ QQuery 34    │  9388.74ms │   9394.28ms │     no change │
│ QQuery 35    │  3108.34ms │   3117.84ms │     no change │
│ QQuery 36    │   270.02ms │    273.27ms │     no change │
│ QQuery 37    │   166.63ms │    170.44ms │     no change │
│ QQuery 38    │   158.33ms │    155.67ms │     no change │
│ QQuery 39    │   834.47ms │    837.39ms │     no change │
│ QQuery 40    │    63.22ms │     63.04ms │     no change │
│ QQuery 41    │    59.97ms │     58.82ms │     no change │
│ QQuery 42    │    70.34ms │     69.78ms │     no change │
└──────────────┴────────────┴─────────────┴───────────────┘

}

let computed_null_count = self.buffer.len() - self.buffer.count_set_bits();
v.store(computed_null_count as i64, Ordering::Release);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using Ordering::Relaxed https://doc.rust-lang.org/std/sync/atomic/enum.Ordering.html#variant.Relaxed would make this PR potentially faster

Specifically, it might make the generated code simpler

Also, was there a reason to remove #[inline] - maybe that could account for the slow down 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

( They would be similiar in x86, just affect reordering )?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- I was thinking about ways to reduce the overhead of using atomic instructions on inlining, etc -- it may not be a good idea

Copy link
Member

@mapleFU mapleFU Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic is just a load and store instr? but count_set_bits() maybe not suitable for inline( or it's just __builtin counting zeros)?

Maybe let compiler decide it is ok

Copy link
Contributor Author

@Rachelint Rachelint Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using Ordering::Relaxed https://doc.rust-lang.org/std/sync/atomic/enum.Ordering.html#variant.Relaxed would make this PR potentially faster

Specifically, it might make the generated code simpler

Also, was there a reason to remove #[inline] - maybe that could account for the slow down 🤔

I am not sure should we use Relaxed or Aquire + Release, I switch to Aquire + Release because I am worried about the situation that:

ArrayRef is hold by multiple threads, and they call the null_count function concurrently, if we use Relaxed, the count_set_bits computation is possible to be performed many times...

Another reason is that I found the strongest SeqCst used in arrow cpp
https://github.com/apache/arrow/blob/187197c369058f7d1377c1b161c469a9e4542caf/cpp/src/arrow/array/data.cc#L206-L218

Copy link
Contributor

@tustvold tustvold Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayRef is hold by multiple threads, and they call the null_count function concurrently, if we use Relaxed, the count_set_bits computation is possible to be performed many times...

No memory ordering will prevent this, you'd need some mutual exclusion primitive if this was the goal. However, it doesn't really matter if two threads race to compute the null count as they'll just set it to the same value. As we aren't using the write to synchronise other memory operations, Relaxed would be perfectly valid here.

I do think the proposal to simplify the state is worth considering, as it will potentially remove a conditional load

Copy link
Contributor Author

@Rachelint Rachelint Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayRef is hold by multiple threads, and they call the null_count function concurrently, if we use Relaxed, the count_set_bits computation is possible to be performed many times...

No memory ordering will prevent this, you'd need some mutual exclusion primitive if this was the goal. However, it doesn't really matter if two threads race to compute the null count as they'll just set it to the same value. As we aren't using the write to synchronise other memory operations, Relaxed would be perfectly valid here.

I do think the proposal to simplify the state is worth considering, as it will potentially remove a conditional load

Sorry, maybe I didn't state it clearly. Is it possible that, due to compiler reordering or cpu cache consistence problem, and the actual execution order becomes:

  if cached_null_count != UNINITIALIZED_NULL_COUNT {
      return cached_null_count as usize;
  }
  let cached_null_count = v.load(Ordering::Relaxed);

  let computed_null_count = self.buffer.len() - self.buffer.count_set_bits();

And finally, the computation will be performance one more times? Although it actually don't affect the correctness, and maybe rare to happen.

cc @XiangpengHao

Copy link
Contributor

@tustvold tustvold Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, all processors have coherent caches and compilers respect single threaded data dependencies. You can't observe two different values for the same memory locations simultaneously. The ordering concerns whether operations performed to other memory locations are guaranteed to be visible following this operation, we aren't using this as a synchronisation primitive and so this doesn't matter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, all processors have coherent caches and compilers respect single threaded data dependencies. You can't observe two different values for the same memory locations simultaneously. The ordering concerns whether operations performed to other memory locations are guaranteed to be visible following this operation, we aren't using this as a synchronisation primitive and so this doesn't matter

Ok, thanks, I misunderstand it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may hurt the cpu cache, too? But I know few about the detial.

Comparing to Relaxed, Acq-Rel would just re-ordering in x86, but in weak ordering machine it might affect the cache an would require memory barrier or other instr

#[derive(Debug)]
pub enum NullCount {
Eager(usize),
Lazy(AtomicI64),
Copy link
Contributor

@XiangpengHao XiangpengHao Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the state here is a bit complicated, here we have three states: Eager, Lazy (initialized), Lazy (uninitialized). And we use both enum and fence value to differentiate them.

I wonder if we can simplify this with just two states: uninitialized and initialized; and when we try to read a uninitialized value, we count the value and set it to initialized state.

struct NullCount {
  val: AtomicI64,
}

impl NullCount {
  fn get(&self, ...) -> i64 {
     let val = self.val.load(...);
     if val == UNINIT {
         val.store(cal_null_count);
     } else {
         return val
     }
  }
}

This way we only have two states to manage, and we also keep the NullCount to be 8 byte, instead of the current 24 bytes, which might help with performance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the state here is a bit complicated, here we have three states: Eager, Lazy (initialized), Lazy (uninitialized). And we use both enum and fence value to differentiate them.

I wonder if we can simplify this with just two states: uninitialized and initialized; and when we try to read a uninitialized value, we count the value and set it to initialized state.

struct NullCount {
  val: AtomicI64,
}

impl NullCount {
  fn get(&self, ...) -> i64 {
     let val = self.val.load(...);
     if val == UNINIT {
         val.store(cal_null_count);
     } else {
         return val
     }
  }
}

This way we only have two states to manage, and we also keep the NullCount to be 8 byte, instead of the current 24 bytes, which might help with performance

I don't want to make it so complicated too... And I impl it with two state at the beginning.

But I found it seems to make some queries slower (maybe noise?) in my first version about this pr, and I encountered a strange performance problem maybe related to atomic in my another pr, so for trying best to avoid the possible cost of atomic, I refactor it to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for trying best to avoid the possible cost of atomic

Atomic with Ordering::Relaxed has no extra cost comparing to directly loading/storing the value, as shown here: https://godbolt.org/z/4oreernqh, they compile to same instructions. The regressions are probably not from atomics.

Relaxed ordering is sufficient here, more strict ordering won't help with correctness and is slower. If we really want to prevent multiple threads to count the null count, we has to use Mutex or RwLock, which is a bit overkill for our use case, imo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for trying best to avoid the possible cost of atomic

Atomic with Ordering::Relaxed has no extra cost comparing to directly loading/storing the value, as shown here: https://godbolt.org/z/4oreernqh, they compile to same instructions. The regressions are probably not from atomics.

Relaxed ordering is sufficient here, more strict ordering won't help with correctness and is slower. If we really want to prevent multiple threads to count the null count, we has to use Mutex or RwLock, which is a bit overkill for our use case, imo.

😄 Thanks, I indeed don't know enough about atomic.

@tustvold
Copy link
Contributor

tustvold commented Aug 9, 2024

I think this is an interesting angle but I do wonder if there is something fishy in what DataFusion is doing here. In particular, whilst lazily computing null masks will save cycles, array slicing is still a non-trivial operation. I wonder if the issue might be that the algorithm in DF is relying on slicing on a hot path where it would be better of tracking offsets and then using one of the selection kernels.

I worry it is going to be hard to reason about the downstream implications of introducing an additional indirection for accessing the null buffer, and there might be less disruptive ways to get the same or better return in DF. Most of the kernels in arrow-rs make use of the null counts

@alamb
Copy link
Contributor

alamb commented Aug 9, 2024

I think this is an interesting angle but I do wonder if there is something fishy in what DataFusion is doing here.

I would say there is a lot of room for improvement. What is happening is that for high cardinality aggregates, the output of the hash aggregate operation is currently one giant contiguous RecordBatch which is then sliced

There is more detail here apache/datafusion#9562 (and @JasonLi-cn was looking into improving it) however it is tricky as doing so would imply the intermediate state of the group keys and the hash table and the aggregates would need to be chunked. This isn't impossible, just non trivial

@tustvold
Copy link
Contributor

tustvold commented Aug 9, 2024

Right so the question becomes whether the proposed optimisation is overfitting to a particular implementation that is known to be at least somewhat degenerate? Is this a general purpose improvement, or just something that helps a single use-case potentially to the detriment of others?

@alamb
Copy link
Contributor

alamb commented Aug 10, 2024

Right so the question becomes whether the proposed optimisation is overfitting to a particular implementation that is known to be at least somewhat degenerate? Is this a general purpose improvement, or just something that helps a single use-case potentially to the detriment of others?

Yes, I agree I am not yet convinced that this is a good general optimization for arrow-rs.

I realize my last message was ambiguous about what I thought the best solution was.

In particular, I think we would need to show this PR doesn't slow anything else down. There is at least one small cost that it increases the sizes of arrays by 8 bytes I think

In general I do think making the change in DataFusion is the better option in general and that we would pursue that change regardless of this particular PR in arrow

@alamb
Copy link
Contributor

alamb commented Aug 13, 2024

In my opinion we should evaluate this change on its own (would it be good for all arrow-rs users).

Making DataFuson aggregates faster is a parallel track of work (which @Rachelint is also exploring other interesting ideas such as apache/datafusion#11943) which while related is not a sufficient justification in itself for this work

@Rachelint
Copy link
Contributor Author

Rachelint commented Aug 13, 2024

We should indeed evaluate if it make general benefits for arrow-rs.

Maybe I can try to find the related issues about why arrow c++ support it, and try to reproduce the related cases?

@alamb
Copy link
Contributor

alamb commented Aug 13, 2024

Running the arrow benchmarks (cargo bench ... in this repo) is probably also something good to try

@alamb alamb marked this pull request as draft August 21, 2024 13:05
@alamb
Copy link
Contributor

alamb commented Aug 21, 2024

Marking this as draft to signify it isn't waiting for more review. @Rachelint has some very promising other PRs downstream in DataFusion that might make an optimization such as this less relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lazily compute the null count of Array to reduce cpu cost in high cardinality aggr
5 participants