- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.6k
Description
I'd like to propose a rare_terms aggregation that collects the set of terms which have n or fewer occurrences.  This is essentially the opposite of "top-n" queries.
The motivation is that today, the only way to accurately collect the "lowest-n" is to execute a terms aggregation with size: INT_MAX so that all terms are collected, then filter client-side or with bucket_selector.  This is obviously expensive: it requires the memory overhead of all the terms, plus executing leaf aggregations on all terms despite only caring about the "lowest-n".
Sorting by count ascending on the terms agg without setting size INT_MAX is trappy and we'd love to remove it :)
Algorithm
The algorithm uses a map and a bloom filter. The map tracks the total set of rare terms. The bloom filter tracks the set of terms > max_doc_count.
Pseudocode:
Map<String, int> terms = new Map();
BloomFilter bloom = new BloomFilter();
function collect(term) {
  // Check to see if this term is in our bloom filter
  if (bloom.get(term) == false) {
    int value = map.get(term);
    if (value == null) {
      // We've never seen this term before, initialize it's counter to 1
      map.put(term, 1);
    } else {
      value += 1;
      // We've seen this term before, but less than the threshold
      // so just increment its counter
      if (value < max_doc_count) {
        map.put(term, value);
      } else {
        // Otherwise we've breached the threshold, remove from
        // the map and add to the bloom filter
        map.remove(term);
        bloom.add(term)
      }
    }
  } else {
    /* 
      There is no else clause here, because inclusion in the bloom filter
       means the item has been seen > max_doc_count times and so is no longer
       a "rare" term
       Note that due to the nature of bloom filters, there will be some false-positives,
       which translates into not adding an item to the "rare list" when it should
    */
  }
}Note: this ignores all the technical bits about replaying cached docs, etc.
Some potential extensions:
- If max_doc_count: 1, we don't need to store the count and could instead just use a Set to further reduce the size
- We could potentially be clever and not use a full int for each counter and instead do some bit-packing (e.g. max_doc_count: 16only needs 4 bits), but we wouldn't be able to use a nice, convenient map :)
Properties
- Size is linear to the number of "rare" terms, plus some constant overhead.
- Antagonistic input (majority of terms are "rare") is still undeniably expensive, but still better than a terms aggregation
- Worst case where all terms are rare is basically equivalent to a terms agg with size: INT_MAX
- Best we can tell, in all cases it is no worse than a terms aggregation in space or time.
 
- Must be executed as a deferred bucket agg (e.g. collect_mode: breadth_first)- As a benefit, leaf aggregations are only executed on the actually rare terms
 
- Resulting rare terms are guaranteed to satisfy max_doc_count(e.g. no false positives)
- Some terms may be rare but not represented in the list (e.g. has false negatives)
- max_doc_countthreshold is configurable, although the tradeoff between this and the- termsagg becomes less clear-cut as- max_doc_countincreases (e.g.- max_doc_count: 1000will likely include a large portion of a zipfian distribution, so a- termsagg coming from the other direction may be better)
Potentially this is sufficient enough that #17614 can be unblocked, and we can make the terms agg less trappy while providing better accuracy / space / time on rare term aggregations.
I'm going to start poking at an implementation.
/cc @colings86 @jpountz