ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col) #8222

drusso · 2020-09-19T18:58:50Z

This is a proposal for an initial and partial implementation of the DISTINCT keyword. Only COUNT(DISTINCT) is supported, with the following conditions:

(a) only one argument, i.e. COUNT(DISTINCT col), but not COUNT(DISTINCT col, other),
(b) the argument is an integer type, and
(c) the query must have a GROUP BY clause.

Implementation Overview:

The Expr::AggregateFunction variant has a new field, distinct, which mirrors the distinct flag from SQLExpr::Function (up until now this flag was unused). Any Expr::AggregateFunction may have its distinct flag switched to true if the keyword is present in the SQL query. However, the physical planner respects it only for COUNT expressions.

The count distinct aggregation slots into the existing physical plans as a new set of AggregateExpr. To demonstrate, below are examples of the physical plans for the following query, where c1 may be any data type, and c2 is a UInt8 column:

SELECT c1, COUNT(DISTINCT c2) FROM t1 GROUP BY c1

(a) Multiple Partitions:

HashAggregateExec:
  mode: Final
  group_expr:
    Column(c1)
  aggr_expr:
    DistinctCountReduce(Column(c2))
  schema:
    c1: any
    c2: UInt64
  input:
    MergeExec:
      input:
        HashAggregateExec:
          mode: Partial
          group_expr:
            Column(c1)
          aggr_expr:
            DistinctCount(Column(c2))
          schema:
            c1: any
            c2: LargeList(UInt8)
          input:
            CsvExec:
              schema:
                c1: any
                c2: UInt8

The DistinctCount accumulates each UInt8 into a list of distinct UInt8. No counts are collected yet, this is a partial result: lists of distinct values. In the RecordBatch, this is a LargeListArray<UInt8> column. After the MergeExec, each list in LargeListArray<UInt8> is accumulated by DistinctCountReduce (via accumulate_batch()), producing the final sets of distinct values. Finally, given the finalized sets of distinct values, the counts are computed (always as UInt64).

(b) Single Partition:

HashAggregateExec:
  mode: NoPartial
  group_expr:
    Column(c1)
  aggr_expr:
    DistinctCountReduce(Column(c2))
  schema:
    c1: any
    c2: UInt64
  input:
    CsvExec:
      schema:
        c1: any
        c2: UInt8

This scenario is unlike the multiple partition scenario: DistinctCount is not used, and there are no partial sets of distinct values. Rather, in a single HashAggregateExec stage, each UInt8 is accumulated into a distinct value set, then the counts are computed at the end of the stage. DistinctCountReduce is used, but note that unlike the multiple partition case, it accumulates scalars via accumulate_scalar().

There is a new aggregation mode: NoPartial. In summary, the modes are:

NoPartial: used in single-stage aggregations
Partial: used as the first stage of two-stage aggregations
Final: used as the second stage of two-stage aggregaions

Prior to the new NoPartial mode, Partial was handling both of what are now the responsibilities of Partial and NoPartial. No distinction was required, because non-distinct aggregations (such as count, sum, min, max, and avg) do not need the distinction: the first aggregation stage is always the same, regardless of whether the aggregation is one-stage or two-stage. This is not the case for a distinct count aggregation, and we can see that in the physical plans above.

github-actions · 2020-09-19T19:05:05Z

https://issues.apache.org/jira/browse/ARROW-10043

andygrove · 2020-09-20T16:07:18Z

@jorgecarleitao @alamb fyi

jorgecarleitao

I went through this PR closely, and I like it. Really good work, @drusso !

I agree with the design:

distinct: bool on the logical plan, as part of the Expr
a new mode for a single partition (I also saw the need for a third partition mode in #8172 , which is the reason why it currently forces to always perform 2 passes)

I think that we may be able to use the GroupByKey or the ScalarValue instead of the enum; I am unsure how we work out these two different expressions depending on the mode, but I think that this is the best we can do with the current API.

@drusso , it may be worth take a look at #8172 , where we are trying to improve how to declare and run aggregate expressions. Many parts of this PR are in conflict with that one (at API level, all the implementation here is valid).

There are two additions of this PR that IMO should be incorporated into the API ASAP:

The distinct "mode" of an aggregate expression
The NoPartial/SinglePartition mode

Out of curiosity: is the DISTINCT something that is applicable to an aggregate expression in general, or is something that is only applicable to a subset of aggregations (such as count, sum)?

jorgecarleitao · 2020-09-19T21:00:51Z

rust/datafusion/src/physical_plan/distinct_expressions.rs

Wouldn't it be possible to use ScalarValue, instead of declaring a new enum?

These are used as the values in the iterator's HashSet, and unfortunately it's not possible to derive Hash for ScalarValue since it nests f32 and f64.

As you suggested, GroupByScalar is a good candidate here. (I assumed you meant GroupByScalar rather than GroupByKey?)

jorgecarleitao · 2020-09-19T21:02:10Z

rust/datafusion/src/physical_plan/distinct_expressions.rs

... this way, we do not need to go ScalarValue -> DistinctScalarValue...

jorgecarleitao · 2020-09-19T21:02:27Z

rust/datafusion/src/physical_plan/distinct_expressions.rs

... and back: DistinctScalarValue -> ScalarValue

jorgecarleitao · 2020-09-19T21:03:55Z

rust/datafusion/src/physical_plan/distinct_expressions.rs

Wouldn't it be possible to implement this one here? Loop through the elements one by one.

Certainly. This path isn't reachable in the scenarios this PR addresses, but I think this will be relevant to SELECT COUNT(DISTINCT col) queries that do not include a GROUP BY.

jorgecarleitao · 2020-09-20T16:49:57Z

rust/datafusion/src/physical_plan/distinct_expressions.rs

Note that this will not apply any coercion to the arguments. We are trying to make aggregates::create_aggregate_expr be the entry point to create the physical expressions, to guarantee that the signature is correct and coercion rules apply.

However, I see that this depends on the AggregateMode, which the current create_aggregate_expr does not support. #8172 addresses exactly this. :)

Good catch, thanks for pointing that out. I'll have a look at this as soon as I integrate with #8172.

jorgecarleitao · 2020-09-20T16:55:31Z

rust/datafusion/src/physical_plan/hash_aggregate.rs

A naming idea: SinglePartition

Spark uses the term Complete. Our existing names Partial and Final were based on Spark's naming.

/** * An [[AggregateFunction]] with [[Complete]] mode is used to evaluate this function directly * from original input rows without any partial aggregation. * This function updates the given aggregation buffer with the original input of this * function. When it has processed all input rows, the final result of this function is returned. */ case object Complete extends AggregateMode

Both suggestions sound good. Let me know if there's a preference one way or the other. For now I will go ahead with Complete for consistency with Spark.

jhorstmann · 2020-09-21T08:28:14Z

rust/datafusion/src/physical_plan/distinct_expressions.rs

A more efficient implementation for distinct integers could be a bitset

Good suggestion. A couple of questions:

Would this generalize to (or be easily adapted for) non-integers, like floats or strings?

Is there DataFusion tooling for benchmarking different implementations?

Given that, perhaps this is an optimization we can explore at a later time?

alamb

I think this approach (of keeping the distinct values as set) will work; Given that @drusso coded it up, and it has tests I think it would be good enough to merge after addressing comments.

One potential downside of the approach in this PR is that it will likely struggle when there are a large number of distinct values per group (as the FnvHashSet will be huge) as well as needing special support for each datatype.

Another approach we could consider is to use a HashAggregateExpr operation as a pre-filter to remove duplicates. The example of:

SELECT c1, COUNT(DISTINCT c2) FROM t1 GROUP BY c1

Could be expressed as a (LogicalPlan) like the following (no physical plan changes needed):

HashAggregateExec: // this second phase then counts
  group_expr:
    Column(c1)
  aggr_expr:
    CountReduce(Column(c2))
  input:
    HashAggregateExec: // this first agg expr finds all distinct values of (c1,c2)
      group_expr:
        Column(c1), Column(c2)
        input:
          CsvExec:

The primary advantage of this strategy is that it can take advantage of both the existing GroupBy machinery as well as any future enhancements that are made. For example it would handle all datatypes without any special case code today. Also if we added any other group by types (e.g. GroupByMerge (if the data was sorted by group keys) that could also be used for distinct queries.

andygrove · 2020-09-21T14:11:01Z

rust/datafusion/src/physical_plan/hash_aggregate.rs

nit: I'd prefer that we use expect rather than unwrap so we can add a meaningful message.

Sounds good, will update.

andygrove · 2020-09-21T14:12:01Z

Thanks @drusso this looks great. I agree with Jorge's comments about ScalarValue and I also added some minor comments.

drusso · 2020-09-22T12:30:46Z

Thanks for the review/feedback all!

@jorgecarleitao:

it may be worth take a look at #8172 , where we are trying to improve how to declare and run aggregate expressions. Many parts of this PR are in conflict with that one (at API level, all the implementation here is valid).

I will get the changes here updated and integrated with #8172. It looks like they've already landed 👍

Out of curiosity: is the DISTINCT something that is applicable to an aggregate expression in general, or is something that is only applicable to a subset of aggregations (such as count, sum)?

To my knowledge DISTINCT is applicable to all aggregations. Though there are some cases where DISTINCT won't have any effect on the result, for example the results of MIN(col) and MIN(DISTINCT col) are always the same. This is something the logical (or physical) planner can optimize for – there's no need to do the work to compute the distinct set in those cases.

(Also note that DISTINCT is applicable outside the context of aggregations, for example SELECT DISTINCT c1, c2 FROM t1.)

@alamb:

One potential downside of the approach in this PR is that it will likely struggle when there are a large number of distinct values per group (as the FnvHashSet will be huge) as well as needing special support for each datatype.

Agreed. I'm happy to continue iterating and improving on the work here.

Another approach we could consider is to use a HashAggregateExpr operation as a pre-filter to remove duplicates.

I had this thought as well, however – and correct me if I'm wrong – it doesn't generalize to a scenario like:

SELECT c1, COUNT(DISTINCT c2), COUNT(DISTINCT c3) FROM t1 GROUP BY c1

For the COUNT(DISTINCT c2) expression we would want distinct pairs of (c1, c2); for the COUNT(DISTINCT c3) expression we would want distinct pairs of (c1, c3). But we can't do both simultaneously. The existing implementation handles that scenario. Perhaps there's some ideas here we can use for further optimizations, though.

alamb · 2020-09-22T16:43:54Z

@drusso I think you are correct that we would need a separate group by operator for each count distinct and then combine them together:

so SELECT c1, COUNT(DISTINCT c2), COUNT(DISTINCT c3) FROM t1 GROUP BY c1 might look like

HashAggregateExec: // this second phase then counts
  group_expr:
    Column(c1)
  aggr_expr:
    CountReduce(Column(c2))
  input:
    HashAggregateExec: // this first agg expr finds all distinct values of (c1,c2)
      group_expr:
        Column(c1), Column(c2)
        input:
          CsvExec:

JOIN ON (c1):

HashAggregateExec: // this second phase then counts
  group_expr:
    Column(c1)
  aggr_expr:
    CountReduce(Column(c3))
  input:
    HashAggregateExec: // this first agg expr finds all distinct values of (c1,c2)
      group_expr:
        Column(c1), Column(c3)
        input:
          CsvExec:

Or something. I like your suggestion to get an implementation in (this one) and then iterate as needed

andygrove · 2020-09-25T19:28:18Z

Hi @drusso I would like to review and merge this DF PR next. Would you mind rebasing?

drusso · 2020-09-25T21:28:38Z

@andygrove (cc @jorgecarleitao):

My apologies, I don't have the changes ready yet. Though I did have some time today to look into integrating this with #8172/master.

I'm still digesting the upstream changes, and I see accumulators now have partial "state" that they emit and ingest across stages. This is great, and I think it would work well for a distinct count accumulator implementation.

However, I think there's a small roadblock with porting the current implementation.

I believe the approach would be to implement state_fields() to return one field for each input field. Each state field is a LargeList, and each list's subtype mirrors the type from the corresponding input field. Then, state() returns, for each field, the distinct values.

As an example, given a table:

| c1 | c2 | c3 |
|----|----|----|
| a  |  1 |  5 |
| a  |  1 |  5 |
| a  |  2 |  5 |
| a  |  1 |  6 |
| a  |  1 |  5 |
| b  |  1 |  8 |

And given a query:

SELECT c1, COUNT(DISTINCT c2, c3) as d FROM t1 GROUP BY c1

Then the set of output Arrayfrom the first HashAggregateExec would be:

c1   = [ a         b ]
d_c2 = [ [ 1 2 1 ] [ 1 ] ]
d_c3 = [ [ 5 5 6 ] [ 8 ] ]

Assuming this is all looking correct so far, then the issue is state() is limited to returning a single scalar per field. However, we would need to allow for a list of scalars per field.

There are a number of paths forward. Brainstorming some solutions:

What if the responsibility of Array building is moved away from HashAggregateExec (currently here) to somewhere more specific to an aggregate and its accumulator. Then each aggrgate has the flexibility to create its own arrays as needed. Maybe finalize_aggregation() would supply the accumulators, and recieve back ArrayRef (one per field)?
A new enum to distinguish between single and multiple values. Then state()'s return is Result<Vec<Value>>, assuming Value is defined as:

enum Value {
    Single(ScalarValue),
    Multi(Vec<ScalarValue>),
}

Let me know what you think and how we can proceed. Thanks!

jorgecarleitao · 2020-09-26T02:13:37Z

I agree with your reasoning.

AFAIK, scalar in the context of Arrow is not a mathematical scalar like f64: scalar is a "1-element representation of an Arrow Array" (see pyarrow). In this context, ScalarValue::List(Arc<ScalarValue>) is not implemented in Rust Arrow, but IMO we should have one, because there is a corresponding ListArray in the spec.

With this in mind, another idea is to map your second idea to:

wrap those 2 (N in general) fields in a ScalarValue::List(ScalarValue::List).

I think that this was the original intention of those structures in the Arrow specs: to compose complex structures. This was also the reasoning behind emitting ScalarValue: an accumulator's state is a single row in the group of accumulators, but each state can hold arbitrarily complex datum.

This would keep us using Arrow as the medium of holding and transmitting (immutable) data, and would keep the execution's schema a bit more predictable.

drusso · 2020-09-26T12:46:53Z

Thanks for clarifying Arrow scalars, I was under the wrong impression they were strictly for primitive types and not composite types. An Arrow scalar as a single item/element in any Array (ListArray included) makes sense to me.

With that in mind, and following the lead from pyarrow, the corresponding scalar variants would be:

ScalarValue::List(Vec<ScalarValue>)
ScalarValue::LargeList(Vec<ScalarValue>)

In the scope of this pull request, I would need to add ScalarValue::LargeList (I opted for a LargeListArray instead of a ListArray). No changes to state() necessary.

Let me know if that sounds good and I can proceed.

jorgecarleitao · 2020-09-26T13:02:58Z

Sounds good to me 🚀

drusso · 2020-10-02T11:21:10Z

Just a quick update here, I'll have some time today and over the weekend to work through the changes.

drusso · 2020-10-04T17:39:25Z

I've rebased against the latest changes on the master branch. Some notes and questions about the changes:

Most of the changes are scoped to DataFusion, but an update to the concat kernel was required. Can the Arrow changes be bundled with the DataFusion changes, or should this be a separate issue/pull request?
I kept the changes to the concat kernel as small as possible, scoped strictly to support the count distinct feature in this pull request. However, there is room for improvement here, because we can support lists of any data type.
Is there an implementation to go from Vec<Vec<_>> to ListArray? I wasn't aware of one, but something along those lines I think would improve the readability of the test_concat_primitive_list_arrays test added in the concat kernel.
Made the switch from LargeListArray to ListArray. This was motivated by the take kernel already supporting ListArray, but not LargeListArray. Supporting LargeListArray would require handling of i64 offsets – not for addressing the array's lists (i32 is fine for that), but for addressing the array's lists' values (there is currently a recursive call to take values here). Nothing that can't be solved, but I thought it best to leave that out of scope for this particular pull request.
Following the ARROW-9937: [Rust] [DataFusion] Improved aggregations #8172 changes, and since aggretations now have their own state, there was no longer a need for a new AggregateMode variant. This change was removed.
As discussed, I've replaced DistinctScalarValue with GroupByScalar. To do so, I promoted GroupByScalar to its own module.
As described in the issue, the goal is to implement COUNT(DISTINCT col) for a single argument. Much of this implementation should work as-is for multiple arguments, but I haven't yet tested or tried that.
The state of DistinctCountAccumulator uses one field per input input argument. Each field is emitted as a ListArray, i.e., a vector of N arrays of lists of primitives. An alternate implementation, which may be worth exploring, would be to use a single state field, and it emits a vector on 1 array of lists of structs of primitives.

Let me know what you think and if any changes are needed. Thanks!

jorgecarleitao

I went through this. Really good work, @drusso . The design and implementation looks good. My main comments revolve around tests, as IMO we need a bit more coverage.

IMO it is ok to push to both arrow and datafusion in a single PR. Kernels to the arrow project is a good thing IMO, so that consumers of the arrow (but not datafusion) also benefit from them. If needed, we can create a separate PR for it also. Feel free to add take for LargeList to Jira, in case someone feels motivated to work on it.

I do not think that there is an implementation of Vec<Vec<_>> to List yet, no. :(

I think that it is fine keeping ListArray for now.

The state of DistinctCountAccumulator uses one field per input input argument. [...]

I agree, and I think that we may be able to place all states in a single StructArray, which would simplify things. The trade-off is the vectorization. I imagine we benefit when merge_batch uses an existing horizontal operation, and having a common, aligned types, helps.

As discussed, I've replaced DistinctScalarValue with GroupByScalar. To do so, I promoted GroupByScalar to its own module.

👍

rust/datafusion/src/physical_plan/aggregates.rs

rust/datafusion/src/physical_plan/distinct_expressions.rs

rust/datafusion/src/physical_plan/group_scalar.rs

jorgecarleitao · 2020-10-04T18:31:44Z

rust/datafusion/src/physical_plan/distinct_expressions.rs

+}
+
+impl Accumulator for DistinctCountAccumulator {
+    fn update_batch(&mut self, arrays: &Vec<ArrayRef>) -> Result<()> {


this expression seems similar to the default update_batch (implemented by the trait Accumulator). Do you think there's a way to generalize the default to take this Accumulator into account, so that we can DRY some code? (same for merge_batch)

I will take a look and get back to you sometime today or tomorrow.

The update_batch() were actually equivalent, removed it. See 3c87e89.

There's also a way to remove the specialized merge_batch() implementation if merge() is implemented as:

fn merge(&mut self, states: &Vec<ScalarValue>) -> Result<()> { if states.len() == 0 { return Ok(()); } let col_values = states .iter() .map(|state| match state { ScalarValue::List(Some(values), _) => Ok(values), _ => Err(ExecutionError::InternalError( "Unexpected accumulator state".to_string(), )), }) .collect::<Result<Vec<_>>>()?; (0..col_values[0].len()) .map(|row_index| { let row_values = col_values .iter() .map(|col| col[row_index].clone()) .collect::<Vec<_>>(); self.update(&row_values) }) .collect::<Result<_>>() }

I think that works out well. I don't have any preference one way or the other, should I make the swap?

That is great! Less code for the update :)

For the merge, If they are equivalent speed-wise, I think that it would make it a bit easier to understand: this is how we merge a single element; this is how we update a single element. But both are LGTM for me

I made the switch (see d61ec97). For now I think the DRY approach makes sense, and as a follow-up I can explore how the different options perform. I can use benches/aggregate_query_sql.rs as a starting point for the comparisons, if that makes sense?

Regarding the impact of d61ec97:

I used the benchmarks in #8606 to compare 57893b4 vs. 57893b4+the following patch, and there was indeed no noticeable difference in the implementations 👍

diff --git a/rust/datafusion/src/physical_plan/distinct_expressions.rs b/rust/datafusion/src/physical_plan/distinct_expressions.rs index c2183ca3b..d7745b0fa 100644 --- a/rust/datafusion/src/physical_plan/distinct_expressions.rs +++ b/rust/datafusion/src/physical_plan/distinct_expressions.rs @@ -23,6 +23,7 @@ use std::hash::Hash; use std::sync::Arc; use arrow::datatypes::{DataType, Field}; +use arrow::array::{ArrayRef, ListArray}; use fnv::FnvHashSet; @@ -123,29 +124,27 @@ impl Accumulator for DistinctCountAccumulator { } fn merge(&mut self, states: &Vec<ScalarValue>) -> Result<()> { - if states.len() == 0 { - return Ok(()); - } + self.update(states) + } - let col_values = states + fn merge_batch(&mut self, states: &Vec<ArrayRef>) -> Result<()> { + let list_arrays = states .iter() - .map(|state| match state { - ScalarValue::List(Some(values), _) => Ok(values), - _ => Err(DataFusionError::Internal( - "Unexpected accumulator state".to_string(), - )), + .map(|state_array| { + state_array.as_any().downcast_ref::<ListArray>().ok_or( + DataFusionError::Internal( + "Failed to downcast ListArray".to_string(), + ), + ) }) .collect::<Result<Vec<_>>>()?; - (0..col_values[0].len()) - .map(|row_index| { - let row_values = col_values - .iter() - .map(|col| col[row_index].clone()) - .collect::<Vec<_>>(); - self.update(&row_values) - }) - .collect::<Result<_>>() + let values_arrays = list_arrays + .iter() + .map(|list_array| list_array.values()) + .collect::<Vec<_>>(); + + self.update_batch(&values_arrays) } fn state(&self) -> Result<Vec<ScalarValue>> {

rust/datafusion/src/physical_plan/distinct_expressions.rs

jorgecarleitao · 2020-10-04T18:44:15Z

rust/datafusion/src/test/mod.rs

+                        .downcast_ref::<array::StringArray>()
+                        .unwrap()
+                        .value(row_index),
+                ),


I think that I also miss a test in this module

See 15e6f0b.

I am really sorry, I misread this. What I meant was not to test the format_batch function, but to test count distinct in integration.

However, I see that this was not the module I though it was. The code in this module is perfect as is now.

I meant adding a test here, which is where we perform integration tests.

Something along the lines of:

#[tokio::test] async fn query_count_distinct() -> Result<()> { let schema = Arc::new(Schema::new(vec![ Field::new("c1", DataType::Int32, true), ])); let data = RecordBatch::try_new( schema.clone(), vec![ Arc::new(Int32Array::from(vec![Some(0), Some(1), None, Some(3), Some(3)])), ], )?; let table = MemTable::new(schema, vec![vec![data]])?; let mut ctx = ExecutionContext::new(); ctx.register_table("test", Box::new(table)); let sql = "SELECT COUNT(DISTINCT c1) FROM test"; let actual = execute(&mut ctx, sql).await; let expected = vec![ vec!["0", "1"], vec!["1", "1"], vec!["3", "2"], ]; assert_eq!(expected, actual); Ok(()) }

does this make sense?

No problem!

This makes perfect sense, and I added the test (2d0999a). However, I also had to make the count(distinct) field nullable (4cdb951), otherwise I ran into an assertion error here due to a mismatch of the field's nullability between the logical plan and physical plan.

The logical plan nullability is set here, and is always nullable. All of the regular aggregate expressions mark their field as nullable, for example, here for the regular Count.

I might be mistaken, but I think regular and distinct counts should be non-nullable? In any case, I went with making count(distinct) nullable for consistency with count(). Perhaps there's a follow-up here?

I think that count(distinct) should be null when all entries are null, for consistency with count. I do not think that it blocks this, though. It is an edge case IMO.

… module.

…ersions.

rust/datafusion/src/physical_plan/group_scalar.rs

…mentation.

… distinct aggregations.

…lable. This makes COUNT(DISTINCT) consistent with COUNT().

…CT).

jorgecarleitao

LGTM Really great work, @drusso , both the implementation, the approach to the problem, and the follow-up up to completion.

drusso · 2020-10-09T13:22:25Z

Thanks @jorgecarleitao! And also, thank you for taking the time for the thorough review. This was a fun feature to work on.

[ARROW-10510](https://issues.apache.org/jira/browse/ARROW-10510) This change adds benchmarks for `COUNT(DISTINCT)` queries. This is a small follow-up to [ARROW-10043](https://issues.apache.org/jira/browse/ARROW-10043) / #8222. In that PR, a number of implementation ideas were discussed for follow-ups, and having benchmarks will help evaluate them. --- There are two benchmarks added: * wide: all of the values are distinct; this is looking at worst-case performance * narrow: only a handful of distinct values; this is closer to best-case performance The wide benchmark runs ~ 7x slower than the narrow benchmark. Closes #8606 from drusso/ARROW-10510 Authored-by: Daniel Russo <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

andygrove added Component: Rust Component: Rust - DataFusion labels Sep 20, 2020

jorgecarleitao reviewed Sep 20, 2020

View reviewed changes

jhorstmann reviewed Sep 21, 2020

View reviewed changes

alamb approved these changes Sep 21, 2020

View reviewed changes

andygrove reviewed Sep 21, 2020

View reviewed changes

ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col)

733b6d7

drusso force-pushed the ARROW-10043 branch from d7f1ca0 to 733b6d7 Compare October 4, 2020 17:25

jorgecarleitao reviewed Oct 4, 2020

View reviewed changes

drusso added 3 commits October 7, 2020 08:35

ARROW-10043: [Rust][DataFusion] Add tests for the distinct expression…

213daa2

… module.

ARROW-10043: [Rust][DataFusion] Add tests for the format batch helper.

15e6f0b

ARROW-10043: [Rust][DataFusion] Improve errors for GroupByScalar conv…

0b45657

…ersions.

drusso commented Oct 7, 2020

View reviewed changes

rust/datafusion/src/physical_plan/group_scalar.rs Show resolved Hide resolved

drusso added 5 commits October 7, 2020 18:57

ARROW-10043: [Rust][DataFusion] Remove duplicate update_batch() imple…

3c87e89

…mentation.

ARROW-10043: [Rust][DataFusion] Add missing test module declaration.

fde6439

ARROW-10043: [Rust][DataFusion] Use the trait's default merge_batch().

d61ec97

ARROW-10043: [Rust][DataFusion] Return NotImplemented for SUM and AVG…

7f073f7

… distinct aggregations.

ARROW-10043: [Rust][DataFusion] Mark the COUNT(DISTINCT) field as nul…

4cdb951

…lable. This makes COUNT(DISTINCT) consistent with COUNT().

drusso added 3 commits October 8, 2020 08:29

ARROW-10043: [Rust][DataFusion] Add integration test for COUNT(DISTIN…

2d0999a

…CT).

Merge branch 'master' into ARROW-10043

65bed97

ARROW-10043: [Rust][DataFusion] Fix test after merge with upstream.

f65f544

jorgecarleitao approved these changes Oct 8, 2020

View reviewed changes

andygrove closed this in 1c7581c Oct 8, 2020

drusso mentioned this pull request Nov 6, 2020

ARROW-10510: [Rust] [DataFusion] Benchmark COUNT(DISTINCT) queries. #8606

Closed

This was referenced Nov 6, 2020

[Rust] [DataFusion] Introduce support for DISTINCT by partially implementing COUNT(DISTINCT) #26065

Closed

[Rust] [DataFusion] Add benchmarks for COUNT(DISTINCT) #26480

Closed

ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col) #8222

ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col) #8222

Uh oh!

Conversation

drusso commented Sep 19, 2020

Uh oh!

github-actions bot commented Sep 19, 2020

Uh oh!

andygrove commented Sep 20, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drusso Sep 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Sep 21, 2020

Uh oh!

drusso commented Sep 22, 2020

Uh oh!

alamb commented Sep 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Sep 25, 2020

Uh oh!

drusso commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgecarleitao commented Sep 26, 2020

Uh oh!

drusso commented Sep 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgecarleitao commented Sep 26, 2020

Uh oh!

drusso commented Oct 2, 2020

Uh oh!

drusso commented Oct 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drusso Sep 22, 2020 •

edited

Loading

alamb commented Sep 22, 2020 •

edited

Loading

drusso commented Sep 25, 2020 •

edited

Loading

drusso commented Sep 26, 2020 •

edited

Loading

drusso commented Oct 4, 2020 •

edited

Loading

jorgecarleitao Oct 8, 2020 •

edited

Loading