Exponential Histogram implementation #568

sritchie · 2016-10-29T14:03:53Z

This PR implements the exponential histogram, or DGIM, algorithm from this paper: http://ilpubs.stanford.edu:8090/504/1/2001-34.pdf

The code for ExpHist and ExpHist.Canonical have a good description of what's going on and how the algorithm works. The supplied tut-based documentation has a description of the l-canonical representation procedure.

Next Steps

Add a Monoid implementation based on the algorithm presented here http://megaslides.com/doc/1974/ecm-sketches
wrap this impl with a queue that buffers N items before pushing them all into the ExpHist, and update relevant error calculations.
Get some benchmarks going. I'm going to propose that we do this in the next PR.

cc @non @johnynek

sritchie · 2016-11-01T17:23:51Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+   * of 2 matches `s`'s l-canonical representation (for the supplied
+   * l).
+   */
+  def bucketsFromLong(s: Long, l: Int): Vector[Long] = {


@non @johnynek would love some advice on how to efficiently share the calculations going on here and in fromLong, without creating an intermediate return tuple for the values that both functions need.

That said... bucketsFromLong is the only thing we're actually using in the implementation. I could just rebuild fromLong from the results of bucketsFromLong.

this is exactly toBuckets(fromLong(s, k)) btw

can you add that comment: bucketsFromLong(s, l) == toBuckets(fromLong(s, k))

This is a different representation right? Can we have a different AnyVal wrapper to distinguish?

For sure, I'll add some AnyVal wrappers now.

sritchie · 2016-11-01T17:25:07Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+      total = total + delta)
+  }
+
+  def oldestBucketSize: Long = if (total == 0) 0L else buckets.last.size


I wonder if it'd be a good idea to skip the zero checking by adding a trait ExpHist and a case object EmptyExpHist.

sritchie · 2016-11-01T17:25:25Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+   * window of size `windowSize`.
+   */
+  case class Config(epsilon: Double, windowSize: Long) {
+    val k: Int = math.ceil(1 / epsilon).toInt


we never really need k, but it shows up in the paper.

sritchie · 2016-11-01T17:26:06Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+    val (b @ Bucket(count, _)) +: tail = input
+    (toDrop - count) match {
+      case 0 => tail
+      case x if x < 0 => b.copy(size = -x) +: tail


this is the only reason I'm not able to use an iterator, this pushing back into the vector. I wonder if we can implement this better with some carry Option. thoughts?

sritchie · 2016-11-01T17:26:42Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+      if (delta == 0)
+        step(timestamp)
+      else {
+        addAllWithoutStep(sorted, delta).step(timestamp)


this was a little subtle. We have to step AFTER adding because the range of timestamps of the new items we're pushing in might be larger than windowSize.

sritchie · 2016-11-01T17:27:01Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+  def add(delta: Long, timestamp: Long): ExpHist = {
+    val self = step(timestamp)
+    if (delta == 0) self
+    else self.addAllWithoutStep(Vector(Bucket(delta, timestamp)), delta)


here we can step BEFORE, because we know that there's only a single timestamp coming in (vs addAll, see my comment below)

oscar-stripe

Can we add the benchmarks before committing. We don't need to optimize them, but at least add them.

Thanks for working so much on this. Could be really nice if we can figure out to control the growth of error in the associative context (the free tricks, etc..)

oscar-stripe · 2016-11-01T18:00:53Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+  def guess: Double =
+    if (total == 0) 0.0
+    else (total - (oldestBucketSize - 1) / 2.0)
+


can you add a Approximate[Long] return value? Those compose somewhat.

boom, will do

don't we want approximate[Double], since we want the estimate to fall in the middle of the range?

let's talk about this one before I add it, so I can add appropriate tests as well.

okay, added!

oscar-stripe · 2016-11-01T18:01:38Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+  @inline private[this] def floorPowerOfTwo(x: Long): Int =
+    JLong.numberOfTrailingZeros(JLong.highestOneBit(x))
+
+  @inline private[this] def modPow2(i: Int, exp2: Int): Int = i & ((1 << exp2) - 1)


what about the bugs on inlining you found? Are we sure none of these exhibit this? Downstream folks still could if they compile with -optimize.

oscar-stripe · 2016-11-01T18:01:50Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+   * - ret(i) for all i < j == l or l + 1
+   * - ret(j) < l + 1
+   */
+  def fromLong(s: Long, l: Int): Vector[Int] = {


no curlies...

oscar-stripe · 2016-11-01T18:03:22Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+   *
+   * the "l" means that
+   *
+   * - ret(i) for all i < j == l or l + 1


what is ret? The returned array? Can we comment to that?

oscar-stripe · 2016-11-01T18:05:02Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+   * (i = vector index, j = index of last entry)
+   *
+   * returns a vector of the the coefficients of s^i in the
+   * l-canonical representation of s.


this confuses me. If you are returning the coefficients of s^i in the i'th position of the vector, then why isn't the vector infinite? We can take infinite powers?

huge typo! should have been 2^i. Fixed up the docs too.

oscar-stripe · 2016-11-01T18:06:39Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+   * - ret(i) for all i < j == l or l + 1
+   * - ret(j) < l + 1
+   */
+  def fromLong(s: Long, l: Int): Vector[Int] = {


what about final case class CanonicalRepresentation(toVector: Vector[Int]) extends AnyVal

Also, maybe add these specialized types under object ExpHist.

I like it. I moved the whole object inside ExpHist - will add these as well.

oscar-stripe · 2016-11-01T18:06:58Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+   * @param rep l-canonical representation of some number s for some l
+   * @return The original s
+   */
+  def toLong(rep: Vector[Int]): Long =


can this be a method on an AnyVal class?

boom, moved that and toBuckets

oscar-stripe · 2016-11-01T18:08:12Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+   * of 2 matches `s`'s l-canonical representation (for the supplied
+   * l).
+   */
+  def bucketsFromLong(s: Long, l: Int): Vector[Long] = {


can you add that comment: bucketsFromLong(s, l) == toBuckets(fromLong(s, k))

This is a different representation right? Can we have a different AnyVal wrapper to distinguish?

oscar-stripe · 2016-11-01T18:11:58Z

algebird-test/src/test/scala/com/twitter/algebird/ExpHistLaws.scala

+    }
+  }
+
+  def isPowerOfTwo(i: Long): Boolean = (i & -i) == i


fancy! :) Hacker's Delight in the house.

codecov-io · 2016-11-01T19:13:09Z

Current coverage is 64.32% (diff: 96.62%)

Merging #568 into develop will increase coverage by 0.76%

@@            develop       #568   diff @@
==========================================
  Files           110        111     +1   
  Lines          4435       4524    +89   
  Methods        4041       4111    +70   
  Messages          0          0          
  Branches        355        374    +19   
==========================================
+ Hits           2819       2910    +91   
+ Misses         1616       1614     -2   
  Partials          0          0

Powered by Codecov. Last update 2c5aa7a...84b299b

sritchie · 2016-11-04T21:01:13Z

Btw the test failure in the update paper link push is fixed by the sbt-mima-plugin bump to 0.1.11 in #556.

johnynek

couple of comments, let me know what you think. I think it looks great.

Do you want to add the monoid or wait on that? It is a bummer that the monoid increases the error.

johnynek · 2016-11-11T02:55:40Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+  def addAll(unsorted: Vector[Bucket]): ExpHist =
+    if (unsorted.isEmpty) this
+    else {
+      val sorted = unsorted.sorted(Ordering[Bucket].reverse)


can we move this inside the else branch below? We can compute delta before sorting and if it is zero, just step to the max time?

johnynek · 2016-11-11T02:58:05Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+   *  the returned ExpHist will have the same timestamp, equal to
+   *  `ts`.
+   */
+  def from(i: Long, ts: Long, conf: Config): ExpHist = {


I wonder if we should make an case class Timestamp(toLong: Long) extends AnyVal to prevent value/timestamp confusion. Seems really easy to get those wrong. What do you think?

yeah, let me thread it through and see what it looks like

johnynek · 2016-11-11T03:01:27Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+      else {
+        rep.iterator.zipWithIndex
+          .map { case (i, exp) => i.toLong << exp }
+          .reduce(_ + _)


there is going to be a fair amount of boxing cost here. Actually calling Monoid.sum( ) will avoid some of that since Monoid is specialized on Long

how about

Monoid.sum { rep.iterator.zipWithIndex .map { case (i, exp) => i.toLong << exp } }

johnynek · 2016-11-11T03:02:16Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+     * @return vector of powers of 2 (where ret.sum == the original s)
+     */
+    def toBuckets: Vector[Long] =
+      rep.zipWithIndex.flatMap { case (i, exp) => List.fill(i)(1L << exp) }


what about:

rep.iterator .zipWithIndex .flatMap { case (i, exp) => Iterator.fill(i)(1L << exp) } .toVector

johnynek · 2016-11-11T03:04:53Z

algebird-core/src/main/scala/com/twitter/algebird/ExpHist.scala

+   * into this exponential histogram instance.
+   */
+  def fold: Fold[Bucket, ExpHist] =
+    Fold.foldLeft(this) {


I wonder about using a foldMutable here and accumulating a batch of Bucket and then doing an addAll which might be a lot faster, but we can punt on this if you like.

added a test and this impl:

def fold: Fold[Bucket, ExpHist] = Fold.foldMutable[Builder[Bucket, Vector[Bucket]], Bucket, ExpHist]( { case (b, bucket) => b += bucket }, { _ => Vector.newBuilder[Bucket] }, { x => addAll(x.result) })

way better imo.

sritchie · 2016-11-11T19:18:57Z

Okay @johnynek, addressed all comments. I think we should add the Monoid in the next round, so we can experiment with controlling the relative error. We'll probably need to track the error separately from the calculation I have here...

We might need to create an ApproximateHistogram trait that we can use for the ExpHist, the Windows implementation, the ZeroExpHist and the ExpHist that keeps a queue of items around before adding.

sritchie · 2016-11-11T19:33:03Z

Amazing, the docs build failed after my change!! Fixing now.

oscar-stripe · 2016-11-11T22:01:14Z

So good! Love tut docs! Thanks for an excellent example for us to follow @sritchie !

👍

Any concerns @isnotinvain ?

sritchie · 2016-11-14T13:27:05Z

Thanks for the review, @johnynek! @isnotinvain, would love a final sign off.

johnynek · 2016-11-14T18:31:50Z

I think the rules say it is fine to merge now.

exponential histogram, slow but tested.

67c6688

sritchie added the algorithm label Oct 29, 2016

sritchie-stripe added 6 commits October 31, 2016 13:03

better impl working

0a62347

working again

a4a3731

get drop impl together

f6df592

checkpoint

370abf9

convert to buckets with addAll

48fbcc9

cleaned up and ready for the dance

344acdc

sritchie changed the title ~~[WIP] Exponential Histogram implementation~~ Exponential Histogram implementation Nov 1, 2016

sritchie commented Nov 1, 2016

View reviewed changes

oscar-stripe reviewed Nov 1, 2016

View reviewed changes

add docs, address oscar's comments

9c5f375

sritchie-stripe added 2 commits November 1, 2016 13:47

add more tests based on the paper

df313a7

update paper link

c360cad

sritchie force-pushed the sritchie/exponential_histogram branch from 1679c94 to 04bcf47 Compare November 3, 2016 18:33

big documentation push

8ebc986

sritchie force-pushed the sritchie/exponential_histogram branch from 04bcf47 to 8ebc986 Compare November 3, 2016 18:33

sritchie mentioned this pull request Nov 4, 2016

Generate Algebird microsite via sbt-microsites #556

Merged

sritchie-stripe added 3 commits November 4, 2016 16:00

Merge branch 'develop' into sritchie/exponential_histogram

62459a0

move docs to tut template

b6a8544

fix api docs

6017ad7

sritchie mentioned this pull request Nov 4, 2016

implement exponential histogram for sliding window sums (DGIM) #558

Closed

final push

606715e

add demo

45f8c5a

sritchie force-pushed the sritchie/exponential_histogram branch from f50f1ae to 45f8c5a Compare November 8, 2016 18:57

johnynek reviewed Nov 11, 2016

View reviewed changes

oscar's comments

752fc07

fix docs

84b299b

johnynek merged commit 9dd8f68 into develop Nov 14, 2016

johnynek deleted the sritchie/exponential_histogram branch November 14, 2016 18:31

Exponential Histogram implementation #568

Exponential Histogram implementation #568

Conversation

sritchie commented Oct 29, 2016 • edited Loading

Next Steps

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sritchie Nov 1, 2016 • edited Loading

Choose a reason for hiding this comment

oscar-stripe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Nov 1, 2016 • edited Loading

Current coverage is 64.32% (diff: 96.62%)

sritchie commented Nov 4, 2016

johnynek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sritchie commented Nov 11, 2016

sritchie commented Nov 11, 2016

oscar-stripe commented Nov 11, 2016

sritchie commented Nov 14, 2016

johnynek commented Nov 14, 2016

sritchie commented Oct 29, 2016 •

edited

Loading

sritchie Nov 1, 2016 •

edited

Loading

codecov-io commented Nov 1, 2016 •

edited

Loading