Stack safety and constant factor improvements of Catenable #784

mpilquist · 2016-12-02T19:13:59Z

This PR drastically improves constant factor performance of Catenable in certain usage patterns. API is changed a little bit:

push was renamed cons
:: was renamed to +:
:+/snoc was added
toStream replaced with toList
foldLeft and foreach added
fixed stack safety bug in uncons and fromSeq -- reduceRight is not stack safe so the input is now reversed

Of particular note is the fact that singleton catenable construction was only about 1.5x as fast as creating a singleton vector, and small catenable construction was a bit slower than small vector creation.

Please excuse the while loops, but this is performance sensitive code so 🤷.

8e20c6f (before the changes in this PR)

[info] Benchmark                                 Mode  Cnt          Score         Error  Units
[info] CatenableBenchmark.consLargeCatenable    thrpt   10  123886314.876 ± 1230223.043  ops/s
[info] CatenableBenchmark.consLargeVector       thrpt   10   10566409.027 ±  710173.165  ops/s
[info] CatenableBenchmark.consSmallCatenable    thrpt   10  120474774.758 ± 3335136.723  ops/s
[info] CatenableBenchmark.consSmallVector       thrpt   10   16325204.308 ±  299289.520  ops/s
[info] CatenableBenchmark.createSmallCatenable  thrpt   10    8947210.997 ±  325165.771  ops/s
[info] CatenableBenchmark.createSmallVector     thrpt   10   11255936.788 ±  203349.167  ops/s
[info] CatenableBenchmark.createTinyCatenable   thrpt   10   22871648.117 ±  613434.137  ops/s
[info] CatenableBenchmark.createTinyVector      thrpt   10   15123925.504 ±  484690.580  ops/s
[info] CatenableBenchmark.mapLargeCatenable     thrpt   10          2.580 ±       0.740  ops/s
[info] CatenableBenchmark.mapLargeVector        thrpt   10         87.447 ±       6.032  ops/s
[info] CatenableBenchmark.mapSmallCatenable     thrpt   10    1500240.055 ±   32720.321  ops/s
[info] CatenableBenchmark.mapSmallVector        thrpt   10   13814716.675 ±  319184.790  ops/s

796700b (this PR)

[info] Benchmark                                   Mode  Cnt          Score         Error  Units
[info] CatenableBenchmark.consLargeCatenable      thrpt   10  150443328.810 ± 4262697.027  ops/s
[info] CatenableBenchmark.consLargeVector         thrpt   10   12215271.536 ±  780910.325  ops/s
[info] CatenableBenchmark.consSmallCatenable      thrpt   10  146698308.995 ± 8798103.672  ops/s
[info] CatenableBenchmark.consSmallVector         thrpt   10   19154715.531 ±  854095.070  ops/s
[info] CatenableBenchmark.createSmallCatenable    thrpt   10   10458084.536 ±  111327.117  ops/s
[info] CatenableBenchmark.createSmallVector       thrpt   10   12795767.588 ±   50284.190  ops/s
[info] CatenableBenchmark.createTinyCatenable     thrpt   10  128573263.762 ± 1670802.767  ops/s
[info] CatenableBenchmark.createTinyVector        thrpt   10   17131861.034 ±  315356.904  ops/s
[info] CatenableBenchmark.foldLeftLargeCatenable  thrpt   10         80.424 ±       0.877  ops/s
[info] CatenableBenchmark.foldLeftLargeVector     thrpt   10         77.336 ±       2.238  ops/s
[info] CatenableBenchmark.foldLeftSmallCatenable  thrpt   10   10220962.669 ±   96901.748  ops/s
[info] CatenableBenchmark.foldLeftSmallVector     thrpt   10   18855179.003 ±  175591.352  ops/s
[info] CatenableBenchmark.mapLargeCatenable       thrpt   10         16.990 ±       4.297  ops/s
[info] CatenableBenchmark.mapLargeVector          thrpt   10         67.306 ±       2.419  ops/s
[info] CatenableBenchmark.mapSmallCatenable       thrpt   10    8988928.329 ±  252565.606  ops/s
[info] CatenableBenchmark.mapSmallVector          thrpt   10   15259942.354 ± 1126953.820  ops/s

…y fix

pchiusano · 2016-12-02T19:39:21Z

core/shared/src/main/scala/fs2/util/Catenable.scala

-        case c :: rights => go(c, rights)
+  final def uncons: Option[(A, Catenable[A])] = {
+    var c: Catenable[A] = this
+    var rights: List[Catenable[A]] = Nil


Did you look into using an ArrayList or something similar here? Advantage would be less allocation, and no need to reverse it to do the reduce right, since you have random access.

If you're going to go imperative, go big. :)

I didn't benchmark it but I figured it would be significantly slower since whatever we use here needs O(1) cons/uncons too.

Here's what I was thinking -

On each step through the loop, you snoc onto the end of the ArrayList. No consing.

Now you get to the end, and do a foldLeft over your ArrayList. Notice that the first element of the ArrayList is the first element you snoc'd (whereas the first element of the List you are building here is the last element you pushed), that is, the ArrayList is already in exactly the right order to implement the reduceRight operation via a foldLeft or reduceLeft! (Assuming I haven't mixed this up... )

pchiusano · 2016-12-02T19:41:06Z

core/shared/src/main/scala/fs2/util/Catenable.scala

+            case h :: t => c = h; rights = t
+          }
+        case Single(a) =>
+          val next = if (rights.isEmpty) empty else rights.reverse.reduceLeft((x, y) => Append(y,x))


basically, you could avoid this reverse call

mpilquist · 2016-12-02T19:45:07Z

BTW, @edmundnoble recommended taking a look at http://www.math.tau.ac.il/~haimk/adv-ds-2000/jacm-final.pdf for some other implementations

pchiusano · 2016-12-02T19:47:29Z

benchmark/src/main/scala/fs2/benchmark/CatenableBenchmark.scala

+
+  @Benchmark def createTinyCatenable = Catenable(1)
+  @Benchmark def createTinyVector = Vector(1)
+  @Benchmark def createSmallCatenable = Catenable(1, 2, 3, 4, 5)


Not saying do this, but if you added a Many(seq: Seq[A]) constructor to Catenable, it would obviously speed up fromSeq since it would just be a no-op. However it would probably slow down other operations and would make the implementation more complicated.

I thought the same thing actually. Before I went down that path, I wanted to look in to the paper I linked above.

IMO that paper is overkill for this use case (amortized is totally fine) and I'm sure the constant factors will be worse. If you think about it, the implementation of append for Catenable is literally the fastest it could possibly be since it does absolutely nothing. And the reassociating is also going to be very hard to beat in terms of speed since it can be done with a mutable data structure as I suggested above, rather than some fancy deamortized functional version of a similar algorithm.

Paper might still be worth reading just for fun though, don't want to discourage that. :) Also maybe I am totally wrong in my intuition here. :)

@mpilquist I want to look into adding this data structure into Scala with 2.13, if that's alright with you. @pchiusano I agree that append is as fast as it could be, but I wonder about adding Many or looking at the balanced-tree creation approach in scalaz/scalaz#1022 which may make reassociating more performant.

I am not an expert on the subject but Ed Kmett said that amortized constant is not enough for reflection without remorse; but I am not sure why. Either way it's much more complicated, but I would be very interested in an implementation of this paper in Scala regardless for real-time applications.

@pchiusano @edmundnoble I've done a lot of work with amortized factors in functional data structures (a few years ago). The problem with amortization arguments in all contexts, but especially functional data structures, is they often require enormously high values of n to come into play. Additionally, amortization arguments often ignore constant factors (in fact, that's literally what the argument is doing in an inductive fashion), which is not a fair argument to make when your constant factors are extremely high.

There are two excellent and classic examples of this problem: BankersQueue and FingerTree. BankersQueue is basically just a lazy version of the eager "front/back list queue" thing that everyone's tried at least once. And there is a proof that it is more efficient than the eager version… for some arbitrarily large value of n. It turns out that, if you benchmark it, the eager version is almost twice as fast (in Scala) for any queue size that fits into 32bit memory, which is sort of insane. FingerTree is a similar, even more dramatic example. FingerTree is an amortized constant double-ended queue, which is something that the eager banker's queue can't provide, and so in a very real sense it is offering (and proving) better asymptotes, not just better amortization costs. But on the JVM, and for queue sizes less than the billions, it is massively, hilariously slower than the alternatives.

So we have to be careful about this stuff. Amortization arguments, discard by definition performance factors which are relevant and even dominant in practical workloads. And this is especially true on the JVM and with a functional style, where amortization often relies in thunking and other pipeline- and PIC-defeating constructs that deoptimize code (by a constant factor) in ways below the level of algorithmic analysis.

pchiusano · 2016-12-02T20:12:23Z

core/shared/src/main/scala/fs2/util/Catenable.scala

+    while (result eq null) {
+      c match {
+        case Empty =>
+          rights match {


With the mutable approach, this is going to become an isEmpty check, then a call to .last, and then a popping off of the last element of the ArrayList (just by mutably decrementing the internal max index).

pchiusano · 2016-12-02T20:17:32Z

core/shared/src/main/scala/fs2/util/Catenable.scala

+    as.size match {
+      case 0 => empty
+      case 1 => single(as.head)
+      case n if n <= 1024 =>


isn't in practice an A* passed as something with random access? (Like an ArraySeq?) Maybe exploit this...

The 1024 limit might blow some stacks. Makes me a little nervous.

Yeah, scala.collection.mutable.WrappedArray. Maybe 1024 is too high -- we could limit to 16 or 32 and cover 99.9% of cases.

I'd imagine that exploiting the WrappedArray is going to be faster even for small n, since it's a loop vs a bunch of function calls that build and then tear down the call stack...

mpilquist · 2016-12-02T21:52:40Z

@pchiusano Pushed an update that addresses your comments. Got another boost in performance. I may still tinker with adding a Many constructor but that can be a future PR if ever.

[info] Benchmark                                   Mode  Cnt          Score          Error  Units
[info] CatenableBenchmark.consLargeCatenable      thrpt   10  152883829.037 ± 12223849.424  ops/s
[info] CatenableBenchmark.consLargeVector         thrpt   10   13489577.965 ±   715723.593  ops/s
[info] CatenableBenchmark.consSmallCatenable      thrpt   10  156993433.223 ±  7799540.641  ops/s
[info] CatenableBenchmark.consSmallVector         thrpt   10   19930621.209 ±   518427.072  ops/s
[info] CatenableBenchmark.createSmallCatenable    thrpt   10   27335860.431 ±  1827976.461  ops/s
[info] CatenableBenchmark.createSmallVector       thrpt   10   13250602.786 ±   382153.385  ops/s
[info] CatenableBenchmark.createTinyCatenable     thrpt   10  163136376.378 ±  8516888.123  ops/s
[info] CatenableBenchmark.createTinyVector        thrpt   10   17768578.590 ±   858127.126  ops/s
[info] CatenableBenchmark.foldLeftLargeCatenable  thrpt   10         69.367 ±        1.523  ops/s
[info] CatenableBenchmark.foldLeftLargeVector     thrpt   10         84.076 ±        5.170  ops/s
[info] CatenableBenchmark.foldLeftSmallCatenable  thrpt   10   13244837.749 ±  1049926.997  ops/s
[info] CatenableBenchmark.foldLeftSmallVector     thrpt   10   19390404.670 ±   412811.551  ops/s
[info] CatenableBenchmark.mapLargeCatenable       thrpt   10         20.109 ±        6.632  ops/s
[info] CatenableBenchmark.mapLargeVector          thrpt   10         69.397 ±        3.935  ops/s
[info] CatenableBenchmark.mapSmallCatenable       thrpt   10    9430900.661 ±   915738.623  ops/s
[info] CatenableBenchmark.mapSmallVector          thrpt   10   16609637.224 ±   440437.876  ops/s

pchiusano · 2016-12-02T22:03:01Z

Nice... not sure where I should be looking to see boost in perf, but looks good to me. Merge when ready.

djspiewak · 2016-12-03T03:51:38Z

Super-cool!

pchiusano · 2016-12-04T13:09:43Z

Sure, go for it (looking into adding to 2.13). Also if you or anyone else would like to investigate adding Many that would be welcome also. If it ends up paying off you can open a PR. However the way this data structure is used in FS2, amortized is totally fine, so it would not make sense to use a fancier deamortized structure with worse constant factors. And for the reasons I gave earlier, I suspect this data structure will be very difficult to beat for amortized performance. Didn't read the scalaz stuff on balanced traversals too closely, but don't see its relevance here. We are using a mutable stack to perform the reassociating, it will be very difficult to do better than that.

…

On Sun, Dec 4, 2016 at 1:55 AM Edmund Noble ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In benchmark/src/main/scala/fs2/benchmark/CatenableBenchmark.scala <#784>: > + @benchmark def mapLargeCatenable = largeCatenable.map(_ + 1) + @benchmark def mapLargeVector = largeVector.map(_ + 1) + + @benchmark def foldLeftSmallCatenable = smallCatenable.foldLeft(0)(_ + _) + @benchmark def foldLeftSmallVector = smallVector.foldLeft(0)(_ + _) + @benchmark def foldLeftLargeCatenable = largeCatenable.foldLeft(0)(_ + _) + @benchmark def foldLeftLargeVector = largeVector.foldLeft(0)(_ + _) + + @benchmark def consSmallCatenable = 0 +: smallCatenable + @benchmark def consSmallVector = 0 +: smallVector + @benchmark def consLargeCatenable = 0 +: largeCatenable + @benchmark def consLargeVector = 0 +: largeVector + + @benchmark def createTinyCatenable = Catenable(1) + @benchmark def createTinyVector = Vector(1) + @benchmark def createSmallCatenable = Catenable(1, 2, 3, 4, 5) @mpilquist <https://github.com/mpilquist> I want to look into adding this data structure into Scala with 2.13, if that's alright with you. @pchiusano <https://github.com/pchiusano> I agree that append is as fast as it could be, but I wonder about adding Many or looking at the balanced-tree creation approach in scalaz/scalaz#1022 <scalaz/scalaz#1022> which may make reassociating more performant. I am not an *expert* on the subject but Ed Kmett said that amortized constant is not enough for reflection without remorse; but I am not sure why. Either way it's much more complicated, but I would be very interested in an implementation of this paper in Scala regardless for real-time applications. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#784>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAArQmk0fwo7twB4L50XK_itR89gXvBvks5rEmPIgaJpZM4LC45O> .

edmundnoble · 2016-12-05T02:56:17Z

Essentially, the relevance the Scalaz thing has is that it's built up from 1. a reassociating loop and 2. a tree-building loop, to build balanced ap trees. That seems to imply it is possible to do the exact same thing here to build a balanced binary tree, providing us with an efficient double-ended queue implementation. That would however be a new datatype separate from Catenable and might require using an Array instead of a mutable stack to preserve asymptotics, because it uses random access, but it might be a good candidate for a replacement for Vector in the stdlib.

Also it appears that the use of Catenable in StreamCore is very similar to that in reflection without remorse (repeated appends), which might run into the same issues. The biggest performance cost should be for alternately appending single stream segments and stepping through the stream: switching between continuing (append) and observing (uncons). If that is a common use case the amortization argument falls apart.

pchiusano · 2016-12-05T06:30:08Z

Switching between append and uncons does not seem to be a problem for Catenable, unless I am missing something. uncons does the minimum reassociating work necessary to produce the head and tail of the sequence:

Example, if I have:

Append(One(hd), tl), uncons takes O(1).
Append(Append(One(hd), tl), z), uncons would push [tl, z] on the stack, and return Some(hd, Append(tl,z)). Still fast.

In general your uncons takes as many steps as there are left-associated appends, but that work isn't repeated since the tail then has the correct associativity.

I think the only pathological case for the structure is alternating the direction of your traversal, like if you uncons then unsnoc repeatedly, that's going to be quadratic. Otherwise you are just doing the reassociating in a consistent direction, and not repeating any work.

I suspect there's a nice proof that any sequence of m appends and n uncons operations takes O(n + m).

Here's what I had to say last time I discussed this with Ed:

An interesting observation (to me anyway) is that in reflection without remorse, they observe a performance problem with DLists / CPS when alternating between 'building' and 'observing' the structure, then reach immediately for various fancier functional data structures, but it's possible to get away with something simpler... if you just need to eliminate that one performance problem.

Another observation is that while you may be building up the data structure in various ways, for most applications (like streams), you only care about efficient traversal left to right.

mpilquist added 2 commits December 2, 2016 13:05

Added benchmark of Catenable

8e20c6f

Constant factor performance improvements in Catenable and stack safet…

796700b

…y fix

pchiusano reviewed Dec 2, 2016

View reviewed changes

Further constant factor improvements to Catenable

4d2cdf3

mpilquist force-pushed the topic/catenable-improvements branch from 1b51d9e to 4d2cdf3 Compare December 2, 2016 21:53

mpilquist merged commit 01ec76b into typelevel:series/1.0 Dec 2, 2016

mpilquist deleted the topic/catenable-improvements branch November 27, 2017 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack safety and constant factor improvements of Catenable #784

Stack safety and constant factor improvements of Catenable #784

mpilquist commented Dec 2, 2016

pchiusano Dec 2, 2016

mpilquist Dec 2, 2016

pchiusano Dec 2, 2016

pchiusano Dec 2, 2016

mpilquist commented Dec 2, 2016

pchiusano Dec 2, 2016

mpilquist Dec 2, 2016

pchiusano Dec 2, 2016

edmundnoble Dec 4, 2016

djspiewak Dec 4, 2016

pchiusano Dec 2, 2016

pchiusano Dec 2, 2016

mpilquist Dec 2, 2016

pchiusano Dec 2, 2016 •

edited

Loading

mpilquist commented Dec 2, 2016 •

edited

Loading

pchiusano commented Dec 2, 2016

djspiewak commented Dec 3, 2016

pchiusano commented Dec 4, 2016 via email

edmundnoble commented Dec 5, 2016

pchiusano commented Dec 5, 2016

Stack safety and constant factor improvements of Catenable #784

Stack safety and constant factor improvements of Catenable #784

Conversation

mpilquist commented Dec 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpilquist commented Dec 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pchiusano Dec 2, 2016 • edited Loading

Choose a reason for hiding this comment

mpilquist commented Dec 2, 2016 • edited Loading

pchiusano commented Dec 2, 2016

djspiewak commented Dec 3, 2016

pchiusano commented Dec 4, 2016 via email

edmundnoble commented Dec 5, 2016

pchiusano commented Dec 5, 2016

pchiusano Dec 2, 2016 •

edited

Loading

mpilquist commented Dec 2, 2016 •

edited

Loading