elastic · polyfractal · Jul 1, 2019 · Nov 19, 2018 · Jan 22, 2019 · Jan 22, 2019
diff --git a/docs/reference/aggregations/bucket/rare-terms-aggregation.asciidoc b/docs/reference/aggregations/bucket/rare-terms-aggregation.asciidoc
@@ -174,7 +174,23 @@ which match the `max_doc_count` criteria will be returned.  The aggregation func
 the order-by-ascending issues that afflict the `terms` aggregation.
 
 This does, however, mean that  a large number of results can be returned if chosen incorrectly.
-To limit the danger of this setting, the maximum `max_doc_count` is 10.
+To limit the danger of this setting, the maximum `max_doc_count` is 100.
+
+[[search-aggregations-bucket-rare-terms-aggregation-max-buckets]]
+==== `search.max_buckets`
+
+The Rare Terms aggregation is more liable to trip the `search.max_buckets` soft limit than other aggregations due
+to how it works.  The `max_bucket` soft-limit is evaluated on a per-shard basis while the aggregation is collecting
+results.  It is possible for a term to be "rare" on a shard but become "not rare" once all the shard results are
+merged together.  This means that individual shards tend to collect more buckets than are truly rare, because
+they only have their own local view.  This list is ultimately pruned to the correct, smaller list of rare
+terms on the coordinating node... but a shard may have already tripped the `max_buckets` soft limit and aborted
+the request.
+
+When aggregating on fields that have potentially many "rare" terms, you may need to increase the `max_buckets` soft
+limit.  Alternatively, you might need to find a way to filter the results to return fewer rare values (smaller time
+span, filter by category, etc), or re-evaluate your definition of "rare" (e.g. if something
+appears 100,000 times, is it truly "rare"?)
 
 [[search-aggregations-bucket-rare-terms-aggregation-approximate-counts]]
 ==== Document counts are approximate
@@ -200,8 +216,10 @@ a different shard's CuckooFilter) the term is removed from the merged list.  The
 to the user as the "rare" terms.
 
 CuckooFilters have the possibility of returning false positives (they can say a value exists in their collection when
-it does not actually).  Since the CuckooFilter is being used to see if a term is over threshold, this means a false positive
+it actually does not).  Since the CuckooFilter is being used to see if a term is over threshold, this means a false positive
 from the CuckooFilter will mistakenly say a value is common when it is not (and thus exclude it from it final list of buckets).
+Practically, this means the aggregations exhibits false-negative behavior since the filter is being used "in reverse"
+of how people generally think of approximate set membership sketches.
 
 CuckooFilters are described in more detail in the paper:
 
@@ -210,12 +228,44 @@ Proceedings of the 10th ACM International on Conference on emerging Networking E
 
 ==== Precision
 
-Although the internal CuckooFilter is approximate in nature, the false-positive rate can be controlled with a
+Although the internal CuckooFilter is approximate in nature, the false-negative rate can be controlled with a
 `precision` parameter.  This allows the user to trade more runtime memory for more accurate results.
 
-The default precision is `0.01`, and the smallest (e.g. most accurate and largest memory overhead) is `0.00001`.
+The default precision is `0.001`, and the smallest (e.g. most accurate and largest memory overhead) is `0.00001`.
+Below are some charts which demonstrate how the accuracy of the aggregation is affected by precision and number
+of distinct terms.
+
+The X-axis shows the number of distinct values the aggregation has seen, and the Y-axis shows the percent error.
+Each line series represents one "rarity" condition (ranging from one rare item to 100,000 rare items).  For example,
+the orange "10" line means ten of the values were "rare" (`doc_count == 1`), out of 1-20m distinct values (where the
+rest of the values had `doc_count > 1`)
+
+This first chart shows precision `0.01`:
+
+image:images/rare_terms/accuracy_01.png[]
+
+And precision `0.001` (the default):
+
+image:images/rare_terms/accuracy_001.png[]
+
+And finally `precision 0.0001`:
+
+image:images/rare_terms/accuracy_0001.png[]
+
+The default precision of `0.001` maintains an accuracy of < 2.5% for the tested conditions, and accuracy slowly
+degrades in a controlled, linear fashion as the number of distinct values increases.
+
+The default precision of `0.001` has a memory profile of `1.748⁻⁶ * n` bytes, where `n` is the number
+of distinct values the aggregation has seen (it can also be roughly eyeballed, e.g. 20 million unique values is about
+30mb of memory).  The memory usage is linear to the number of distinct values regardless of which precision is chosen,
+the precision only affects the slope of the memory profile as seen in this chart:
+
+image:images/rare_terms/memory.png[]
 
-TODO charts here
+For comparison, an equivalent terms aggregation at 20 million buckets would be roughly
+`20m * 69b == ~1.38gb` (with 69 bytes being a very optimistic estimate of an empty bucket cost, far lower than what
+the circuit breaker accounts for).  So although the `rare_terms` agg is relatively heavy, it is still orders of
+magnitude smaller than the equivalent terms aggregation
 
 ==== Filtering Values
 

diff --git a/docs/reference/images/rare_terms/accuracy_0001.png b/docs/reference/images/rare_terms/accuracy_0001.png
diff --git a/docs/reference/images/rare_terms/accuracy_001.png b/docs/reference/images/rare_terms/accuracy_001.png
diff --git a/docs/reference/images/rare_terms/accuracy_01.png b/docs/reference/images/rare_terms/accuracy_01.png
diff --git a/docs/reference/images/rare_terms/memory.png b/docs/reference/images/rare_terms/memory.png
diff --git a/server/src/main/java/org/elasticsearch/common/util/CuckooFilter.java b/server/src/main/java/org/elasticsearch/common/util/CuckooFilter.java
@@ -21,13 +21,11 @@
 import org.apache.lucene.store.DataInput;
 import org.apache.lucene.store.DataOutput;
 import org.apache.lucene.util.packed.PackedInts;
-import org.elasticsearch.common.hash.MurmurHash3;
 import org.elasticsearch.common.io.stream.StreamInput;
 import org.elasticsearch.common.io.stream.StreamOutput;
 import org.elasticsearch.common.io.stream.Writeable;
 
 import java.io.IOException;
-import java.util.Arrays;
 import java.util.Iterator;
 import java.util.Objects;
 import java.util.Random;
@@ -52,6 +50,10 @@
  * (do not need to waste slots on duplicate fingerprints), and we do not need to worry
  * about inserts "overflowing" a bucket because the same item has been repeated repeatedly
  *
+ * NOTE: this CuckooFilter exposes a number of Expert APIs which assume the caller has
+ * intimate knowledge about how the algorithm works.  It is recommended to avoid these
+ * APIs, or better yet, use {@link SetBackedScalingCuckooFilter} instead.
+ *
  * Based on the paper:
  *
  * Fan, Bin, et al. "Cuckoo filter: Practically better than bloom."
@@ -87,7 +89,7 @@ public class CuckooFilter implements Writeable {
         this.bitsPerEntry = bitsPerEntry(fpp, entriesPerBucket);
         this.numBuckets = getNumBuckets(capacity, loadFactor, entriesPerBucket);
 
-        if (numBuckets * entriesPerBucket > Integer.MAX_VALUE) {
+        if ((long) numBuckets * (long) entriesPerBucket > Integer.MAX_VALUE) {
             throw new IllegalArgumentException("Attempted to create [" + numBuckets * entriesPerBucket
                 + "] entries which is > Integer.MAX_VALUE");
         }
@@ -97,6 +99,9 @@ public class CuckooFilter implements Writeable {
         this.fingerprintMask = (0x80000000 >> (bitsPerEntry - 1)) >>> (Integer.SIZE - bitsPerEntry);
     }
 
+    /**
+     * This ctor is likely slow and should only be used for testing
+     */
     CuckooFilter(CuckooFilter other) {
         this.numBuckets = other.numBuckets;
         this.bitsPerEntry = other.bitsPerEntry;
@@ -107,7 +112,7 @@ public class CuckooFilter implements Writeable {
         this.fingerprintMask = other.fingerprintMask;
 
         // This shouldn't happen, but as a sanity check
-        if (numBuckets * entriesPerBucket > Integer.MAX_VALUE) {
+        if ((long) numBuckets * (long) entriesPerBucket > Integer.MAX_VALUE) {
             throw new IllegalArgumentException("Attempted to create [" + numBuckets * entriesPerBucket
                 + "] entries which is > Integer.MAX_VALUE");
         }
@@ -169,14 +174,44 @@ public int getCount() {
         return count;
     }
 
+    /**
+     * Returns the number of buckets that has been chosen based
+     * on the initial configuration
+     *
+     * Expert-level API
+     */
+    int getNumBuckets() {
+        return numBuckets;
+    }
+
+    /**
+     * Returns the number of bits used per entry
+     *
+     * Expert-level API
+     */
+    int getBitsPerEntry() {
+        return bitsPerEntry;
+    }
+
+    /**
+     * Returns the cached fingerprint mask.  This is simply a mask for the
+     * first bitsPerEntry bits, used by {@link CuckooFilter#fingerprint(int, int, int)}
+     * to generate the fingerprint of a hash
+     *
+     * Expert-level API
+     */
+    int getFingerprintMask() {
+        return fingerprintMask;
+    }
+
     /**
      * Returns an iterator that returns the long[] representation of each bucket.  The value
      * inside each long will be a fingerprint (or 0L, representing empty).
      *
      * Expert-level API
      */
     Iterator<long[]> getBuckets() {
-        return new Iterator<long[]>() {
+        return new Iterator<>() {
             int current = 0;
 
             @Override
@@ -199,21 +234,21 @@ public long[] next() {
      * Returns true if the set might contain the provided value, false otherwise.  False values are
      * 100% accurate, while true values may be a false-positive.
      */
-    boolean mightContain(MurmurHash3.Hash128 hash) {
-        int bucket = hashToIndex((int) hash.h1);
-        int fingerprint = fingerprint((int) hash.h2);
+    boolean mightContain(long hash) {
+        int bucket = hashToIndex((int) hash, numBuckets);
+        int fingerprint = fingerprint((int) (hash  >>> 32), bitsPerEntry, fingerprintMask);
+        int alternateIndex = alternateIndex(bucket, fingerprint, numBuckets);
 
-        return mightContainFingerprint(bucket, fingerprint);
+        return mightContainFingerprint(bucket, fingerprint, alternateIndex);
     }
 
     /**
      * Returns true if the bucket or it's alternate bucket contains the fingerprint.
      *
-     * Expert-level API, use {@link CuckooFilter#mightContain(MurmurHash3.Hash128)} to check if
+     * Expert-level API, use {@link CuckooFilter#mightContain(long)} to check if
      * a value is in the filter.
      */
-    boolean mightContainFingerprint(int bucket, int fingerprint) {
-        int alternateBucket = alternateIndex(bucket, fingerprint);
+    boolean mightContainFingerprint(int bucket, int fingerprint, int alternateBucket) {
 
         // check all entries for both buckets and the evicted slot
         return hasFingerprint(bucket, fingerprint) || hasFingerprint(alternateBucket, fingerprint) || evictedFingerprint == fingerprint;
@@ -227,26 +262,30 @@ private boolean hasFingerprint(int bucket, long fingerprint) {
         int offset = getOffset(bucket, 0);
         data.get(offset, values, 0, entriesPerBucket);
 
-        return Arrays.stream(values).anyMatch(value -> value == fingerprint);
+        for (int i = 0; i < entriesPerBucket; i++) {
+            if (values[i] == fingerprint) {
+                return true;
+            }
+        }
+        return false;
     }
 
     /**
      * Add's the hash to the bucket or alternate bucket.  Returns true if the insertion was
      * successful, false if the filter is saturated.
      */
-    boolean add(MurmurHash3.Hash128 hash) {
-        // can only use 64 of 128 bytes unfortunately (32 for each bucket), simplest
-        // to just truncate h1 and h2 appropriately
-        int bucket = hashToIndex((int) hash.h1);
-        int fingerprint = fingerprint((int) hash.h2);
+    boolean add(long hash) {
+        // Each bucket needs 32 bits, so we truncate for the first bucket and shift/truncate for second
+        int bucket = hashToIndex((int) hash, numBuckets);
+        int fingerprint = fingerprint((int) (hash  >>> 32), bitsPerEntry, fingerprintMask);
         return mergeFingerprint(bucket, fingerprint);
     }
 
     /**
      * Attempts to merge the fingerprint into the specified bucket or it's alternate bucket.
      * Returns true if the insertion was successful, false if the filter is saturated.
      *
-     * Expert-level API, use {@link CuckooFilter#add(MurmurHash3.Hash128)} to insert
+     * Expert-level API, use {@link CuckooFilter#add(long)} to insert
      * values into the filter
      */
     boolean mergeFingerprint(int bucket, int fingerprint) {
@@ -255,7 +294,7 @@ boolean mergeFingerprint(int bucket, int fingerprint) {
             return false;
         }
 
-        int alternateBucket = alternateIndex(bucket, fingerprint);
+        int alternateBucket = alternateIndex(bucket, fingerprint, numBuckets);
         if (tryInsert(bucket, fingerprint) || tryInsert(alternateBucket, fingerprint)) {
             count += 1;
             return true;
@@ -270,7 +309,7 @@ boolean mergeFingerprint(int bucket, int fingerprint) {
             // replace details and start again
             fingerprint = oldFingerprint;
             bucket = alternateBucket;
-            alternateBucket = alternateIndex(bucket, fingerprint);
+            alternateBucket = alternateIndex(bucket, fingerprint, numBuckets);
 
             // Only try to insert into alternate bucket
             if (tryInsert(alternateBucket, fingerprint)) {
@@ -317,28 +356,28 @@ private boolean tryInsert(int bucket, int fingerprint) {
      *
      * If the hash is negative, this flips the bits.  The hash is then modulo numBuckets
      * to get the final index.
+     *
+     * Expert-level API
      */
-    private int hashToIndex(int hash) {
-        // invert the bits if we're negative
-        if (hash < 0) {
-            hash = ~hash;
-        }
-        return hash % numBuckets;
+    static int hashToIndex(int hash, int numBuckets) {
+        return hash & (numBuckets - 1);
     }
 
     /**
      * Calculates the alternate bucket for a given bucket:fingerprint tuple
      *
      * The alternate bucket is the fingerprint multiplied by a mixing constant,
      * then xor'd against the bucket.  This new value is modulo'd against
-     * the buckets via {@link CuckooFilter#hashToIndex(int)} to get the final
+     * the buckets via {@link CuckooFilter#hashToIndex(int, int)} to get the final
      * index.
      *
      * Note that the xor makes this operation reversible as long as we have the
      * fingerprint and current bucket (regardless of if that bucket was the primary
      * or alternate).
+     *
+     * Expert-level API
      */
-    private int alternateIndex(int bucket, int fingerprint) {
+    static int alternateIndex(int bucket, int fingerprint, int numBuckets) {
         /*
             Reference impl uses murmur2 mixing constant:
             https://github.com/efficient/cuckoofilter/blob/master/src/cuckoofilter.h#L78
@@ -349,7 +388,7 @@ private int alternateIndex(int bucket, int fingerprint) {
                 return IndexHash((uint32_t)(index ^ (tag * 0x5bd1e995)));
          */
         int index = bucket ^ (fingerprint * 0x5bd1e995);
-        return hashToIndex(index);
+        return hashToIndex(index, numBuckets);
     }
 
     /**
@@ -365,16 +404,18 @@ private int getOffset(int bucket, int position) {
      *
      * The fingerprint is simply the first `bitsPerEntry` number of bits that are non-zero.
      * If the entire hash is zero, `(int) 1` is used
+     *
+     * Expert-level API
      */
-    private int fingerprint(int hash) {
+    static int fingerprint(int hash, int bitsPerEntry, int fingerprintMask) {
         if (hash == 0) {
             // we use 0 as "empty" so if the hash actually hashes to zero... return 1
             // Some other impls will re-hash with a salt but this seems simpler
             return 1;
         }
 
         for (int i = 0; i + bitsPerEntry <= Long.SIZE; i += bitsPerEntry) {
-            int v = (hash >> i) & this.fingerprintMask;
+            int v = (hash >> i) & fingerprintMask;
             if (v != 0) {
                 return v;
             }
@@ -420,8 +461,8 @@ private double getLoadFactor(int b) {
         }
         /*
           Empirical constants from the paper:
-            "With k = 2 hash functions, the load factor	α is 50% when bucket size b = 1 (i.e
-            the hash table is directly mapped), bu tincreases to 84%, 95%, 98% respectively
+            "With k = 2 hash functions, the load factor α is 50% when bucket size b = 1 (i.e
+            the hash table is directly mapped), but increases to 84%, 95%, 98% respectively
             using bucket size b = 2, 4, 8"
          */
         if (b == 2) {
@@ -477,4 +518,13 @@ public boolean equals(Object other) {
             && Objects.equals(this.count, that.count)
             && Objects.equals(this.evictedFingerprint, that.evictedFingerprint);
     }
+
+    static long murmur64(long h) {
+        h ^= h >>> 33;
+        h *= 0xff51afd7ed558ccdL;
+        h ^= h >>> 33;
+        h *= 0xc4ceb9fe1a85ec53L;
+        h ^= h >>> 33;
+        return h;
+    }
 }