"Rare Terms" aggregation

I'd like to propose a `rare_terms` aggregation that collects the set of terms which have `n` or fewer occurrences.  This is essentially the opposite of "top-n" queries.

The motivation is that today, the only way to _accurately_ collect the "lowest-n" is to execute a terms aggregation with `size: INT_MAX` so that all terms are collected, then filter client-side or with `bucket_selector`.  This is obviously expensive: it requires the memory overhead of all the terms, plus executing leaf aggregations on all terms despite only caring about the "lowest-n".

Sorting by count ascending on the terms agg without setting size INT_MAX [is trappy](https://github.com/elastic/elasticsearch/issues/17614) and we'd love to remove it :)
### Algorithm

The algorithm uses a map and a bloom filter.  The map tracks the total set of rare terms.  The bloom filter tracks the set of terms > max_doc_count.

Pseudocode:

``` java
Map<String, int> terms = new Map();
BloomFilter bloom = new BloomFilter();

function collect(term) {

  // Check to see if this term is in our bloom filter
  if (bloom.get(term) == false) {
    int value = map.get(term);
    if (value == null) {

      // We've never seen this term before, initialize it's counter to 1
      map.put(term, 1);
    } else {

      value += 1;

      // We've seen this term before, but less than the threshold
      // so just increment its counter
      if (value < max_doc_count) {
        map.put(term, value);
      } else {
        // Otherwise we've breached the threshold, remove from
        // the map and add to the bloom filter
        map.remove(term);
        bloom.add(term)
      }
    }
  } else {
    /* 
      There is no else clause here, because inclusion in the bloom filter
       means the item has been seen > max_doc_count times and so is no longer
       a "rare" term

       Note that due to the nature of bloom filters, there will be some false-positives,
       which translates into not adding an item to the "rare list" when it should
    */
  }

}
```

Note: this ignores all the technical bits about replaying cached docs, etc.

Some potential extensions:
- If `max_doc_count: 1`, we don't need to store the count and could instead just use a Set to further reduce the size
- We could potentially be clever and not use a full int for each counter and instead do some bit-packing (e.g. `max_doc_count: 16` only needs 4 bits), but we wouldn't be able to use a nice, convenient map :)
### Properties
- Size is linear to the number of "rare" terms, plus some constant overhead.
  - Antagonistic input (majority of terms are "rare") is still undeniably expensive, but still better than a terms aggregation
  - Worst case where all terms are rare is basically equivalent to a terms agg with `size: INT_MAX`
  - Best we can tell, in all cases it is no worse than a terms aggregation in space or time.
- Must be executed as a deferred bucket agg (e.g. `collect_mode: breadth_first`)
  - As a benefit, leaf aggregations are only executed on the actually rare terms
- Resulting rare terms are guaranteed to satisfy `max_doc_count` (e.g. **no false positives**)
- Some terms may be rare but not represented in the list (e.g. **has false negatives**)
- `max_doc_count` threshold is configurable, although the tradeoff between this and the `terms` agg becomes less clear-cut as `max_doc_count` increases (e.g. `max_doc_count: 1000` will likely include a large portion of a zipfian distribution, so a `terms` agg coming from the other direction may be better)

Potentially this is sufficient enough that #17614 can be unblocked, and we can make the terms agg less trappy while providing better accuracy / space / time on rare term aggregations.

I'm going to start poking at an implementation.

/cc @colings86 @jpountz 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"Rare Terms" aggregation #20586

Algorithm

Properties

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"Rare Terms" aggregation #20586

Description

Algorithm

Properties

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions