Skip to content

Conversation

@drusso
Copy link
Contributor

@drusso drusso commented Sep 19, 2020

This is a proposal for an initial and partial implementation of the DISTINCT keyword. Only COUNT(DISTINCT) is supported, with the following conditions:

(a) only one argument, i.e. COUNT(DISTINCT col), but not COUNT(DISTINCT col, other),
(b) the argument is an integer type, and
(c) the query must have a GROUP BY clause.

Implementation Overview:

The Expr::AggregateFunction variant has a new field, distinct, which mirrors the distinct flag from SQLExpr::Function (up until now this flag was unused). Any Expr::AggregateFunction may have its distinct flag switched to true if the keyword is present in the SQL query. However, the physical planner respects it only for COUNT expressions.

The count distinct aggregation slots into the existing physical plans as a new set of AggregateExpr. To demonstrate, below are examples of the physical plans for the following query, where c1 may be any data type, and c2 is a UInt8 column:

SELECT c1, COUNT(DISTINCT c2) FROM t1 GROUP BY c1

(a) Multiple Partitions:

HashAggregateExec:
  mode: Final
  group_expr:
    Column(c1)
  aggr_expr:
    DistinctCountReduce(Column(c2))
  schema:
    c1: any
    c2: UInt64
  input:
    MergeExec:
      input:
        HashAggregateExec:
          mode: Partial
          group_expr:
            Column(c1)
          aggr_expr:
            DistinctCount(Column(c2))
          schema:
            c1: any
            c2: LargeList(UInt8)
          input:
            CsvExec:
              schema:
                c1: any
                c2: UInt8

The DistinctCount accumulates each UInt8 into a list of distinct UInt8. No counts are collected yet, this is a partial result: lists of distinct values. In the RecordBatch, this is a LargeListArray<UInt8> column. After the MergeExec, each list in LargeListArray<UInt8> is accumulated by DistinctCountReduce (via accumulate_batch()), producing the final sets of distinct values. Finally, given the finalized sets of distinct values, the counts are computed (always as UInt64).

(b) Single Partition:

HashAggregateExec:
  mode: NoPartial
  group_expr:
    Column(c1)
  aggr_expr:
    DistinctCountReduce(Column(c2))
  schema:
    c1: any
    c2: UInt64
  input:
    CsvExec:
      schema:
        c1: any
        c2: UInt8

This scenario is unlike the multiple partition scenario: DistinctCount is not used, and there are no partial sets of distinct values. Rather, in a single HashAggregateExec stage, each UInt8 is accumulated into a distinct value set, then the counts are computed at the end of the stage. DistinctCountReduce is used, but note that unlike the multiple partition case, it accumulates scalars via accumulate_scalar().

There is a new aggregation mode: NoPartial. In summary, the modes are:

  • NoPartial: used in single-stage aggregations
  • Partial: used as the first stage of two-stage aggregations
  • Final: used as the second stage of two-stage aggregaions

Prior to the new NoPartial mode, Partial was handling both of what are now the responsibilities of Partial and NoPartial. No distinction was required, because non-distinct aggregations (such as count, sum, min, max, and avg) do not need the distinction: the first aggregation stage is always the same, regardless of whether the aggregation is one-stage or two-stage. This is not the case for a distinct count aggregation, and we can see that in the physical plans above.

@github-actions
Copy link

@andygrove
Copy link
Member

@jorgecarleitao @alamb fyi

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this PR closely, and I like it. Really good work, @drusso !

I agree with the design:

  1. distinct: bool on the logical plan, as part of the Expr
  2. a new mode for a single partition (I also saw the need for a third partition mode in #8172 , which is the reason why it currently forces to always perform 2 passes)

I think that we may be able to use the GroupByKey or the ScalarValue instead of the enum; I am unsure how we work out these two different expressions depending on the mode, but I think that this is the best we can do with the current API.

@drusso , it may be worth take a look at #8172 , where we are trying to improve how to declare and run aggregate expressions. Many parts of this PR are in conflict with that one (at API level, all the implementation here is valid).

There are two additions of this PR that IMO should be incorporated into the API ASAP:

  1. The distinct "mode" of an aggregate expression
  2. The NoPartial/SinglePartition mode

Out of curiosity: is the DISTINCT something that is applicable to an aggregate expression in general, or is something that is only applicable to a subset of aggregations (such as count, sum)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be possible to use ScalarValue, instead of declaring a new enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are used as the values in the iterator's HashSet, and unfortunately it's not possible to derive Hash for ScalarValue since it nests f32 and f64.

As you suggested, GroupByScalar is a good candidate here. (I assumed you meant GroupByScalar rather than GroupByKey?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... this way, we do not need to go ScalarValue -> DistinctScalarValue...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and back: DistinctScalarValue -> ScalarValue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be possible to implement this one here? Loop through the elements one by one.

Copy link
Contributor Author

@drusso drusso Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly. This path isn't reachable in the scenarios this PR addresses, but I think this will be relevant to SELECT COUNT(DISTINCT col) queries that do not include a GROUP BY.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will not apply any coercion to the arguments. We are trying to make aggregates::create_aggregate_expr be the entry point to create the physical expressions, to guarantee that the signature is correct and coercion rules apply.

However, I see that this depends on the AggregateMode, which the current create_aggregate_expr does not support. #8172 addresses exactly this. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks for pointing that out. I'll have a look at this as soon as I integrate with #8172.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A naming idea: SinglePartition

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark uses the term Complete. Our existing names Partial and Final were based on Spark's naming.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
 * An [[AggregateFunction]] with [[Complete]] mode is used to evaluate this function directly
 * from original input rows without any partial aggregation.
 * This function updates the given aggregation buffer with the original input of this
 * function. When it has processed all input rows, the final result of this function is returned.
 */
case object Complete extends AggregateMode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both suggestions sound good. Let me know if there's a preference one way or the other. For now I will go ahead with Complete for consistency with Spark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more efficient implementation for distinct integers could be a bitset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. A couple of questions:

  • Would this generalize to (or be easily adapted for) non-integers, like floats or strings?
  • Is there DataFusion tooling for benchmarking different implementations?

Given that, perhaps this is an optimization we can explore at a later time?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach (of keeping the distinct values as set) will work; Given that @drusso coded it up, and it has tests I think it would be good enough to merge after addressing comments.

One potential downside of the approach in this PR is that it will likely struggle when there are a large number of distinct values per group (as the FnvHashSet will be huge) as well as needing special support for each datatype.

Another approach we could consider is to use a HashAggregateExpr operation as a pre-filter to remove duplicates. The example of:

SELECT c1, COUNT(DISTINCT c2) FROM t1 GROUP BY c1

Could be expressed as a (LogicalPlan) like the following (no physical plan changes needed):

HashAggregateExec: // this second phase then counts
  group_expr:
    Column(c1)
  aggr_expr:
    CountReduce(Column(c2))
  input:
    HashAggregateExec: // this first agg expr finds all distinct values of (c1,c2)
      group_expr:
        Column(c1), Column(c2)
        input:
          CsvExec:

The primary advantage of this strategy is that it can take advantage of both the existing GroupBy machinery as well as any future enhancements that are made. For example it would handle all datatypes without any special case code today. Also if we added any other group by types (e.g. GroupByMerge (if the data was sorted by group keys) that could also be used for distinct queries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer that we use expect rather than unwrap so we can add a meaningful message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will update.

@andygrove
Copy link
Member

Thanks @drusso this looks great. I agree with Jorge's comments about ScalarValue and I also added some minor comments.

@drusso
Copy link
Contributor Author

drusso commented Sep 22, 2020

Thanks for the review/feedback all!

@jorgecarleitao:

it may be worth take a look at #8172 , where we are trying to improve how to declare and run aggregate expressions. Many parts of this PR are in conflict with that one (at API level, all the implementation here is valid).

I will get the changes here updated and integrated with #8172. It looks like they've already landed 👍

Out of curiosity: is the DISTINCT something that is applicable to an aggregate expression in general, or is something that is only applicable to a subset of aggregations (such as count, sum)?

To my knowledge DISTINCT is applicable to all aggregations. Though there are some cases where DISTINCT won't have any effect on the result, for example the results of MIN(col) and MIN(DISTINCT col) are always the same. This is something the logical (or physical) planner can optimize for – there's no need to do the work to compute the distinct set in those cases.

(Also note that DISTINCT is applicable outside the context of aggregations, for example SELECT DISTINCT c1, c2 FROM t1.)

@alamb:

One potential downside of the approach in this PR is that it will likely struggle when there are a large number of distinct values per group (as the FnvHashSet will be huge) as well as needing special support for each datatype.

Agreed. I'm happy to continue iterating and improving on the work here.

Another approach we could consider is to use a HashAggregateExpr operation as a pre-filter to remove duplicates.

I had this thought as well, however – and correct me if I'm wrong – it doesn't generalize to a scenario like:

SELECT c1, COUNT(DISTINCT c2), COUNT(DISTINCT c3) FROM t1 GROUP BY c1

For the COUNT(DISTINCT c2) expression we would want distinct pairs of (c1, c2); for the COUNT(DISTINCT c3) expression we would want distinct pairs of (c1, c3). But we can't do both simultaneously. The existing implementation handles that scenario. Perhaps there's some ideas here we can use for further optimizations, though.

@alamb
Copy link
Contributor

alamb commented Sep 22, 2020

@drusso I think you are correct that we would need a separate group by operator for each count distinct and then combine them together:

so SELECT c1, COUNT(DISTINCT c2), COUNT(DISTINCT c3) FROM t1 GROUP BY c1 might look like

HashAggregateExec: // this second phase then counts
  group_expr:
    Column(c1)
  aggr_expr:
    CountReduce(Column(c2))
  input:
    HashAggregateExec: // this first agg expr finds all distinct values of (c1,c2)
      group_expr:
        Column(c1), Column(c2)
        input:
          CsvExec:

JOIN ON (c1):

HashAggregateExec: // this second phase then counts
  group_expr:
    Column(c1)
  aggr_expr:
    CountReduce(Column(c3))
  input:
    HashAggregateExec: // this first agg expr finds all distinct values of (c1,c2)
      group_expr:
        Column(c1), Column(c3)
        input:
          CsvExec:

Or something. I like your suggestion to get an implementation in (this one) and then iterate as needed

@andygrove
Copy link
Member

Hi @drusso I would like to review and merge this DF PR next. Would you mind rebasing?

@drusso
Copy link
Contributor Author

drusso commented Sep 25, 2020

@andygrove (cc @jorgecarleitao):

My apologies, I don't have the changes ready yet. Though I did have some time today to look into integrating this with #8172/master.

I'm still digesting the upstream changes, and I see accumulators now have partial "state" that they emit and ingest across stages. This is great, and I think it would work well for a distinct count accumulator implementation.

However, I think there's a small roadblock with porting the current implementation.

I believe the approach would be to implement state_fields() to return one field for each input field. Each state field is a LargeList, and each list's subtype mirrors the type from the corresponding input field. Then, state() returns, for each field, the distinct values.

As an example, given a table:

| c1 | c2 | c3 |
|----|----|----|
| a  |  1 |  5 |
| a  |  1 |  5 |
| a  |  2 |  5 |
| a  |  1 |  6 |
| a  |  1 |  5 |
| b  |  1 |  8 |

And given a query:

SELECT c1, COUNT(DISTINCT c2, c3) as d FROM t1 GROUP BY c1

Then the set of output Arrayfrom the first HashAggregateExec would be:

c1   = [ a         b ]
d_c2 = [ [ 1 2 1 ] [ 1 ] ]
d_c3 = [ [ 5 5 6 ] [ 8 ] ]

Assuming this is all looking correct so far, then the issue is state() is limited to returning a single scalar per field. However, we would need to allow for a list of scalars per field.

There are a number of paths forward. Brainstorming some solutions:

  • What if the responsibility of Array building is moved away from HashAggregateExec (currently here) to somewhere more specific to an aggregate and its accumulator. Then each aggrgate has the flexibility to create its own arrays as needed. Maybe finalize_aggregation() would supply the accumulators, and recieve back ArrayRef (one per field)?

  • A new enum to distinguish between single and multiple values. Then state()'s return is Result<Vec<Value>>, assuming Value is defined as:

enum Value {
    Single(ScalarValue),
    Multi(Vec<ScalarValue>),
}

Let me know what you think and how we can proceed. Thanks!

@jorgecarleitao
Copy link
Member

I agree with your reasoning.

AFAIK, scalar in the context of Arrow is not a mathematical scalar like f64: scalar is a "1-element representation of an Arrow Array" (see pyarrow). In this context, ScalarValue::List(Arc<ScalarValue>) is not implemented in Rust Arrow, but IMO we should have one, because there is a corresponding ListArray in the spec.

With this in mind, another idea is to map your second idea to:

  • wrap those 2 (N in general) fields in a ScalarValue::List(ScalarValue::List).

I think that this was the original intention of those structures in the Arrow specs: to compose complex structures. This was also the reasoning behind emitting ScalarValue: an accumulator's state is a single row in the group of accumulators, but each state can hold arbitrarily complex datum.

This would keep us using Arrow as the medium of holding and transmitting (immutable) data, and would keep the execution's schema a bit more predictable.

@drusso
Copy link
Contributor Author

drusso commented Sep 26, 2020

Thanks for clarifying Arrow scalars, I was under the wrong impression they were strictly for primitive types and not composite types. An Arrow scalar as a single item/element in any Array (ListArray included) makes sense to me.

With that in mind, and following the lead from pyarrow, the corresponding scalar variants would be:

  • ScalarValue::List(Vec<ScalarValue>)
  • ScalarValue::LargeList(Vec<ScalarValue>)

In the scope of this pull request, I would need to add ScalarValue::LargeList (I opted for a LargeListArray instead of a ListArray). No changes to state() necessary.

Let me know if that sounds good and I can proceed.

@jorgecarleitao
Copy link
Member

Sounds good to me 🚀

@drusso
Copy link
Contributor Author

drusso commented Oct 2, 2020

Just a quick update here, I'll have some time today and over the weekend to work through the changes.

@drusso
Copy link
Contributor Author

drusso commented Oct 4, 2020

I've rebased against the latest changes on the master branch. Some notes and questions about the changes:

  • Most of the changes are scoped to DataFusion, but an update to the concat kernel was required. Can the Arrow changes be bundled with the DataFusion changes, or should this be a separate issue/pull request?
  • I kept the changes to the concat kernel as small as possible, scoped strictly to support the count distinct feature in this pull request. However, there is room for improvement here, because we can support lists of any data type.
  • Is there an implementation to go from Vec<Vec<_>> to ListArray? I wasn't aware of one, but something along those lines I think would improve the readability of the test_concat_primitive_list_arrays test added in the concat kernel.
  • Made the switch from LargeListArray to ListArray. This was motivated by the take kernel already supporting ListArray, but not LargeListArray. Supporting LargeListArray would require handling of i64 offsets – not for addressing the array's lists (i32 is fine for that), but for addressing the array's lists' values (there is currently a recursive call to take values here). Nothing that can't be solved, but I thought it best to leave that out of scope for this particular pull request.
  • Following the ARROW-9937: [Rust] [DataFusion] Improved aggregations #8172 changes, and since aggretations now have their own state, there was no longer a need for a new AggregateMode variant. This change was removed.
  • As discussed, I've replaced DistinctScalarValue with GroupByScalar. To do so, I promoted GroupByScalar to its own module.
  • As described in the issue, the goal is to implement COUNT(DISTINCT col) for a single argument. Much of this implementation should work as-is for multiple arguments, but I haven't yet tested or tried that.
  • The state of DistinctCountAccumulator uses one field per input input argument. Each field is emitted as a ListArray, i.e., a vector of N arrays of lists of primitives. An alternate implementation, which may be worth exploring, would be to use a single state field, and it emits a vector on 1 array of lists of structs of primitives.

Let me know what you think and if any changes are needed. Thanks!

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this. Really good work, @drusso . The design and implementation looks good. My main comments revolve around tests, as IMO we need a bit more coverage.

IMO it is ok to push to both arrow and datafusion in a single PR. Kernels to the arrow project is a good thing IMO, so that consumers of the arrow (but not datafusion) also benefit from them. If needed, we can create a separate PR for it also. Feel free to add take for LargeList to Jira, in case someone feels motivated to work on it.

I do not think that there is an implementation of Vec<Vec<_>> to List yet, no. :(

I think that it is fine keeping ListArray for now.

The state of DistinctCountAccumulator uses one field per input input argument. [...]

I agree, and I think that we may be able to place all states in a single StructArray, which would simplify things. The trade-off is the vectorization. I imagine we benefit when merge_batch uses an existing horizontal operation, and having a common, aligned types, helps.

As discussed, I've replaced DistinctScalarValue with GroupByScalar. To do so, I promoted GroupByScalar to its own module.

👍

}

impl Accumulator for DistinctCountAccumulator {
fn update_batch(&mut self, arrays: &Vec<ArrayRef>) -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this expression seems similar to the default update_batch (implemented by the trait Accumulator). Do you think there's a way to generalize the default to take this Accumulator into account, so that we can DRY some code? (same for merge_batch)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look and get back to you sometime today or tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update_batch() were actually equivalent, removed it. See 3c87e89.

There's also a way to remove the specialized merge_batch() implementation if merge() is implemented as:

fn merge(&mut self, states: &Vec<ScalarValue>) -> Result<()> {
    if states.len() == 0 {
        return Ok(());
    }

    let col_values = states
        .iter()
        .map(|state| match state {
            ScalarValue::List(Some(values), _) => Ok(values),
            _ => Err(ExecutionError::InternalError(
                "Unexpected accumulator state".to_string(),
            )),
        })
        .collect::<Result<Vec<_>>>()?;

    (0..col_values[0].len())
        .map(|row_index| {
            let row_values = col_values
                .iter()
                .map(|col| col[row_index].clone())
                .collect::<Vec<_>>();
            self.update(&row_values)
        })
        .collect::<Result<_>>()
}

I think that works out well. I don't have any preference one way or the other, should I make the swap?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is great! Less code for the update :)

For the merge, If they are equivalent speed-wise, I think that it would make it a bit easier to understand: this is how we merge a single element; this is how we update a single element. But both are LGTM for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the switch (see d61ec97). For now I think the DRY approach makes sense, and as a follow-up I can explore how the different options perform. I can use benches/aggregate_query_sql.rs as a starting point for the comparisons, if that makes sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the impact of d61ec97:

I used the benchmarks in #8606 to compare 57893b4 vs. 57893b4+the following patch, and there was indeed no noticeable difference in the implementations 👍

diff --git a/rust/datafusion/src/physical_plan/distinct_expressions.rs b/rust/datafusion/src/physical_plan/distinct_expressions.rs
index c2183ca3b..d7745b0fa 100644
--- a/rust/datafusion/src/physical_plan/distinct_expressions.rs
+++ b/rust/datafusion/src/physical_plan/distinct_expressions.rs
@@ -23,6 +23,7 @@ use std::hash::Hash;
 use std::sync::Arc;
 
 use arrow::datatypes::{DataType, Field};
+use arrow::array::{ArrayRef, ListArray};
 
 use fnv::FnvHashSet;
 
@@ -123,29 +124,27 @@ impl Accumulator for DistinctCountAccumulator {
     }
 
     fn merge(&mut self, states: &Vec<ScalarValue>) -> Result<()> {
-        if states.len() == 0 {
-            return Ok(());
-        }
+        self.update(states)
+    }
 
-        let col_values = states
+    fn merge_batch(&mut self, states: &Vec<ArrayRef>) -> Result<()> {
+        let list_arrays = states
             .iter()
-            .map(|state| match state {
-                ScalarValue::List(Some(values), _) => Ok(values),
-                _ => Err(DataFusionError::Internal(
-                    "Unexpected accumulator state".to_string(),
-                )),
+            .map(|state_array| {
+                state_array.as_any().downcast_ref::<ListArray>().ok_or(
+                    DataFusionError::Internal(
+                        "Failed to downcast ListArray".to_string(),
+                    ),
+                )
             })
             .collect::<Result<Vec<_>>>()?;
 
-        (0..col_values[0].len())
-            .map(|row_index| {
-                let row_values = col_values
-                    .iter()
-                    .map(|col| col[row_index].clone())
-                    .collect::<Vec<_>>();
-                self.update(&row_values)
-            })
-            .collect::<Result<_>>()
+        let values_arrays = list_arrays
+            .iter()
+            .map(|list_array| list_array.values())
+            .collect::<Vec<_>>();
+
+        self.update_batch(&values_arrays)
     }
 
     fn state(&self) -> Result<Vec<ScalarValue>> {

.downcast_ref::<array::StringArray>()
.unwrap()
.value(row_index),
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I also miss a test in this module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 15e6f0b.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really sorry, I misread this. What I meant was not to test the format_batch function, but to test count distinct in integration.

However, I see that this was not the module I though it was. The code in this module is perfect as is now.

I meant adding a test here, which is where we perform integration tests.

Something along the lines of:

#[tokio::test]
async fn query_count_distinct() -> Result<()> {
    let schema = Arc::new(Schema::new(vec![
        Field::new("c1", DataType::Int32, true),
    ]));

    let data = RecordBatch::try_new(
        schema.clone(),
        vec![
            Arc::new(Int32Array::from(vec![Some(0), Some(1), None, Some(3), Some(3)])),
        ],
    )?;

    let table = MemTable::new(schema, vec![vec![data]])?;

    let mut ctx = ExecutionContext::new();
    ctx.register_table("test", Box::new(table));
    let sql = "SELECT COUNT(DISTINCT c1) FROM test";
    let actual = execute(&mut ctx, sql).await;
    let expected = vec![
        vec!["0", "1"],
        vec!["1", "1"],
        vec!["3", "2"],
    ];
    assert_eq!(expected, actual);
    Ok(())
}

does this make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem!

This makes perfect sense, and I added the test (2d0999a). However, I also had to make the count(distinct) field nullable (4cdb951), otherwise I ran into an assertion error here due to a mismatch of the field's nullability between the logical plan and physical plan.

The logical plan nullability is set here, and is always nullable. All of the regular aggregate expressions mark their field as nullable, for example, here for the regular Count.

I might be mistaken, but I think regular and distinct counts should be non-nullable? In any case, I went with making count(distinct) nullable for consistency with count(). Perhaps there's a follow-up here?

Copy link
Member

@jorgecarleitao jorgecarleitao Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that count(distinct) should be null when all entries are null, for consistency with count. I do not think that it blocks this, though. It is an edge case IMO.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Really great work, @drusso , both the implementation, the approach to the problem, and the follow-up up to completion.

@andygrove andygrove closed this in 1c7581c Oct 8, 2020
@drusso
Copy link
Contributor Author

drusso commented Oct 9, 2020

Thanks @jorgecarleitao! And also, thank you for taking the time for the thorough review. This was a fun feature to work on.

nevi-me pushed a commit that referenced this pull request Nov 7, 2020
[ARROW-10510](https://issues.apache.org/jira/browse/ARROW-10510)

This change adds benchmarks for `COUNT(DISTINCT)` queries. This is a small follow-up to [ARROW-10043](https://issues.apache.org/jira/browse/ARROW-10043) / #8222. In that PR, a number of implementation ideas were discussed for follow-ups, and having benchmarks will help evaluate them.

---

There are two benchmarks added:

* wide: all of the values are distinct; this is looking at worst-case performance
* narrow: only a handful of distinct values; this is closer to best-case performance

The wide benchmark runs ~ 7x slower than the narrow benchmark.

Closes #8606 from drusso/ARROW-10510

Authored-by: Daniel Russo <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants