Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack safety and constant factor improvements of Catenable #784

Merged

Conversation

mpilquist
Copy link
Member

This PR drastically improves constant factor performance of Catenable in certain usage patterns. API is changed a little bit:

  • push was renamed cons
  • :: was renamed to +:
  • :+/snoc was added
  • toStream replaced with toList
  • foldLeft and foreach added
  • fixed stack safety bug in uncons and fromSeq -- reduceRight is not stack safe so the input is now reversed

Of particular note is the fact that singleton catenable construction was only about 1.5x as fast as creating a singleton vector, and small catenable construction was a bit slower than small vector creation.

Please excuse the while loops, but this is performance sensitive code so 🤷.

8e20c6f (before the changes in this PR)

[info] Benchmark                                 Mode  Cnt          Score         Error  Units
[info] CatenableBenchmark.consLargeCatenable    thrpt   10  123886314.876 ± 1230223.043  ops/s
[info] CatenableBenchmark.consLargeVector       thrpt   10   10566409.027 ±  710173.165  ops/s
[info] CatenableBenchmark.consSmallCatenable    thrpt   10  120474774.758 ± 3335136.723  ops/s
[info] CatenableBenchmark.consSmallVector       thrpt   10   16325204.308 ±  299289.520  ops/s
[info] CatenableBenchmark.createSmallCatenable  thrpt   10    8947210.997 ±  325165.771  ops/s
[info] CatenableBenchmark.createSmallVector     thrpt   10   11255936.788 ±  203349.167  ops/s
[info] CatenableBenchmark.createTinyCatenable   thrpt   10   22871648.117 ±  613434.137  ops/s
[info] CatenableBenchmark.createTinyVector      thrpt   10   15123925.504 ±  484690.580  ops/s
[info] CatenableBenchmark.mapLargeCatenable     thrpt   10          2.580 ±       0.740  ops/s
[info] CatenableBenchmark.mapLargeVector        thrpt   10         87.447 ±       6.032  ops/s
[info] CatenableBenchmark.mapSmallCatenable     thrpt   10    1500240.055 ±   32720.321  ops/s
[info] CatenableBenchmark.mapSmallVector        thrpt   10   13814716.675 ±  319184.790  ops/s

796700b (this PR)

[info] Benchmark                                   Mode  Cnt          Score         Error  Units
[info] CatenableBenchmark.consLargeCatenable      thrpt   10  150443328.810 ± 4262697.027  ops/s
[info] CatenableBenchmark.consLargeVector         thrpt   10   12215271.536 ±  780910.325  ops/s
[info] CatenableBenchmark.consSmallCatenable      thrpt   10  146698308.995 ± 8798103.672  ops/s
[info] CatenableBenchmark.consSmallVector         thrpt   10   19154715.531 ±  854095.070  ops/s
[info] CatenableBenchmark.createSmallCatenable    thrpt   10   10458084.536 ±  111327.117  ops/s
[info] CatenableBenchmark.createSmallVector       thrpt   10   12795767.588 ±   50284.190  ops/s
[info] CatenableBenchmark.createTinyCatenable     thrpt   10  128573263.762 ± 1670802.767  ops/s
[info] CatenableBenchmark.createTinyVector        thrpt   10   17131861.034 ±  315356.904  ops/s
[info] CatenableBenchmark.foldLeftLargeCatenable  thrpt   10         80.424 ±       0.877  ops/s
[info] CatenableBenchmark.foldLeftLargeVector     thrpt   10         77.336 ±       2.238  ops/s
[info] CatenableBenchmark.foldLeftSmallCatenable  thrpt   10   10220962.669 ±   96901.748  ops/s
[info] CatenableBenchmark.foldLeftSmallVector     thrpt   10   18855179.003 ±  175591.352  ops/s
[info] CatenableBenchmark.mapLargeCatenable       thrpt   10         16.990 ±       4.297  ops/s
[info] CatenableBenchmark.mapLargeVector          thrpt   10         67.306 ±       2.419  ops/s
[info] CatenableBenchmark.mapSmallCatenable       thrpt   10    8988928.329 ±  252565.606  ops/s
[info] CatenableBenchmark.mapSmallVector          thrpt   10   15259942.354 ± 1126953.820  ops/s

case c :: rights => go(c, rights)
final def uncons: Option[(A, Catenable[A])] = {
var c: Catenable[A] = this
var rights: List[Catenable[A]] = Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you look into using an ArrayList or something similar here? Advantage would be less allocation, and no need to reverse it to do the reduce right, since you have random access.

If you're going to go imperative, go big. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't benchmark it but I figured it would be significantly slower since whatever we use here needs O(1) cons/uncons too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I was thinking -

  • On each step through the loop, you snoc onto the end of the ArrayList. No consing.
  • Now you get to the end, and do a foldLeft over your ArrayList. Notice that the first element of the ArrayList is the first element you snoc'd (whereas the first element of the List you are building here is the last element you pushed), that is, the ArrayList is already in exactly the right order to implement the reduceRight operation via a foldLeft or reduceLeft! (Assuming I haven't mixed this up... )

case h :: t => c = h; rights = t
}
case Single(a) =>
val next = if (rights.isEmpty) empty else rights.reverse.reduceLeft((x, y) => Append(y,x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically, you could avoid this reverse call

@mpilquist
Copy link
Member Author

BTW, @edmundnoble recommended taking a look at http://www.math.tau.ac.il/~haimk/adv-ds-2000/jacm-final.pdf for some other implementations


@Benchmark def createTinyCatenable = Catenable(1)
@Benchmark def createTinyVector = Vector(1)
@Benchmark def createSmallCatenable = Catenable(1, 2, 3, 4, 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not saying do this, but if you added a Many(seq: Seq[A]) constructor to Catenable, it would obviously speed up fromSeq since it would just be a no-op. However it would probably slow down other operations and would make the implementation more complicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same thing actually. Before I went down that path, I wanted to look in to the paper I linked above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that paper is overkill for this use case (amortized is totally fine) and I'm sure the constant factors will be worse. If you think about it, the implementation of append for Catenable is literally the fastest it could possibly be since it does absolutely nothing. And the reassociating is also going to be very hard to beat in terms of speed since it can be done with a mutable data structure as I suggested above, rather than some fancy deamortized functional version of a similar algorithm.

Paper might still be worth reading just for fun though, don't want to discourage that. :) Also maybe I am totally wrong in my intuition here. :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpilquist I want to look into adding this data structure into Scala with 2.13, if that's alright with you. @pchiusano I agree that append is as fast as it could be, but I wonder about adding Many or looking at the balanced-tree creation approach in scalaz/scalaz#1022 which may make reassociating more performant.

I am not an expert on the subject but Ed Kmett said that amortized constant is not enough for reflection without remorse; but I am not sure why. Either way it's much more complicated, but I would be very interested in an implementation of this paper in Scala regardless for real-time applications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pchiusano @edmundnoble I've done a lot of work with amortized factors in functional data structures (a few years ago). The problem with amortization arguments in all contexts, but especially functional data structures, is they often require enormously high values of n to come into play. Additionally, amortization arguments often ignore constant factors (in fact, that's literally what the argument is doing in an inductive fashion), which is not a fair argument to make when your constant factors are extremely high.

There are two excellent and classic examples of this problem: BankersQueue and FingerTree. BankersQueue is basically just a lazy version of the eager "front/back list queue" thing that everyone's tried at least once. And there is a proof that it is more efficient than the eager version… for some arbitrarily large value of n. It turns out that, if you benchmark it, the eager version is almost twice as fast (in Scala) for any queue size that fits into 32bit memory, which is sort of insane. FingerTree is a similar, even more dramatic example. FingerTree is an amortized constant double-ended queue, which is something that the eager banker's queue can't provide, and so in a very real sense it is offering (and proving) better asymptotes, not just better amortization costs. But on the JVM, and for queue sizes less than the billions, it is massively, hilariously slower than the alternatives.

So we have to be careful about this stuff. Amortization arguments, discard by definition performance factors which are relevant and even dominant in practical workloads. And this is especially true on the JVM and with a functional style, where amortization often relies in thunking and other pipeline- and PIC-defeating constructs that deoptimize code (by a constant factor) in ways below the level of algorithmic analysis.

while (result eq null) {
c match {
case Empty =>
rights match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the mutable approach, this is going to become an isEmpty check, then a call to .last, and then a popping off of the last element of the ArrayList (just by mutably decrementing the internal max index).

as.size match {
case 0 => empty
case 1 => single(as.head)
case n if n <= 1024 =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't in practice an A* passed as something with random access? (Like an ArraySeq?) Maybe exploit this...

The 1024 limit might blow some stacks. Makes me a little nervous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, scala.collection.mutable.WrappedArray. Maybe 1024 is too high -- we could limit to 16 or 32 and cover 99.9% of cases.

Copy link
Contributor

@pchiusano pchiusano Dec 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine that exploiting the WrappedArray is going to be faster even for small n, since it's a loop vs a bunch of function calls that build and then tear down the call stack...

@mpilquist
Copy link
Member Author

mpilquist commented Dec 2, 2016

@pchiusano Pushed an update that addresses your comments. Got another boost in performance. I may still tinker with adding a Many constructor but that can be a future PR if ever.

[info] Benchmark                                   Mode  Cnt          Score          Error  Units
[info] CatenableBenchmark.consLargeCatenable      thrpt   10  152883829.037 ± 12223849.424  ops/s
[info] CatenableBenchmark.consLargeVector         thrpt   10   13489577.965 ±   715723.593  ops/s
[info] CatenableBenchmark.consSmallCatenable      thrpt   10  156993433.223 ±  7799540.641  ops/s
[info] CatenableBenchmark.consSmallVector         thrpt   10   19930621.209 ±   518427.072  ops/s
[info] CatenableBenchmark.createSmallCatenable    thrpt   10   27335860.431 ±  1827976.461  ops/s
[info] CatenableBenchmark.createSmallVector       thrpt   10   13250602.786 ±   382153.385  ops/s
[info] CatenableBenchmark.createTinyCatenable     thrpt   10  163136376.378 ±  8516888.123  ops/s
[info] CatenableBenchmark.createTinyVector        thrpt   10   17768578.590 ±   858127.126  ops/s
[info] CatenableBenchmark.foldLeftLargeCatenable  thrpt   10         69.367 ±        1.523  ops/s
[info] CatenableBenchmark.foldLeftLargeVector     thrpt   10         84.076 ±        5.170  ops/s
[info] CatenableBenchmark.foldLeftSmallCatenable  thrpt   10   13244837.749 ±  1049926.997  ops/s
[info] CatenableBenchmark.foldLeftSmallVector     thrpt   10   19390404.670 ±   412811.551  ops/s
[info] CatenableBenchmark.mapLargeCatenable       thrpt   10         20.109 ±        6.632  ops/s
[info] CatenableBenchmark.mapLargeVector          thrpt   10         69.397 ±        3.935  ops/s
[info] CatenableBenchmark.mapSmallCatenable       thrpt   10    9430900.661 ±   915738.623  ops/s
[info] CatenableBenchmark.mapSmallVector          thrpt   10   16609637.224 ±   440437.876  ops/s

@mpilquist mpilquist force-pushed the topic/catenable-improvements branch from 1b51d9e to 4d2cdf3 Compare December 2, 2016 21:53
@pchiusano
Copy link
Contributor

Nice... not sure where I should be looking to see boost in perf, but looks good to me. Merge when ready.

@mpilquist mpilquist merged commit 01ec76b into typelevel:series/1.0 Dec 2, 2016
@djspiewak
Copy link
Member

Super-cool!

@pchiusano
Copy link
Contributor

pchiusano commented Dec 4, 2016 via email

@edmundnoble
Copy link

Essentially, the relevance the Scalaz thing has is that it's built up from 1. a reassociating loop and 2. a tree-building loop, to build balanced ap trees. That seems to imply it is possible to do the exact same thing here to build a balanced binary tree, providing us with an efficient double-ended queue implementation. That would however be a new datatype separate from Catenable and might require using an Array instead of a mutable stack to preserve asymptotics, because it uses random access, but it might be a good candidate for a replacement for Vector in the stdlib.

Also it appears that the use of Catenable in StreamCore is very similar to that in reflection without remorse (repeated appends), which might run into the same issues. The biggest performance cost should be for alternately appending single stream segments and stepping through the stream: switching between continuing (append) and observing (uncons). If that is a common use case the amortization argument falls apart.

@pchiusano
Copy link
Contributor

Switching between append and uncons does not seem to be a problem for Catenable, unless I am missing something. uncons does the minimum reassociating work necessary to produce the head and tail of the sequence:

Example, if I have:

Append(One(hd), tl), uncons takes O(1).
Append(Append(One(hd), tl), z), uncons would push [tl, z] on the stack, and return Some(hd, Append(tl,z)). Still fast.

In general your uncons takes as many steps as there are left-associated appends, but that work isn't repeated since the tail then has the correct associativity.

I think the only pathological case for the structure is alternating the direction of your traversal, like if you uncons then unsnoc repeatedly, that's going to be quadratic. Otherwise you are just doing the reassociating in a consistent direction, and not repeating any work.

I suspect there's a nice proof that any sequence of m appends and n uncons operations takes O(n + m).

Here's what I had to say last time I discussed this with Ed:

An interesting observation (to me anyway) is that in reflection without remorse, they observe a performance problem with DLists / CPS when alternating between 'building' and 'observing' the structure, then reach immediately for various fancier functional data structures, but it's possible to get away with something simpler... if you just need to eliminate that one performance problem.

Another observation is that while you may be building up the data structure in various ways, for most applications (like streams), you only care about efficient traversal left to right.

@mpilquist mpilquist deleted the topic/catenable-improvements branch November 27, 2017 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants