-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack safety and constant factor improvements of Catenable #784
Stack safety and constant factor improvements of Catenable #784
Conversation
case c :: rights => go(c, rights) | ||
final def uncons: Option[(A, Catenable[A])] = { | ||
var c: Catenable[A] = this | ||
var rights: List[Catenable[A]] = Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you look into using an ArrayList
or something similar here? Advantage would be less allocation, and no need to reverse it to do the reduce right, since you have random access.
If you're going to go imperative, go big. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't benchmark it but I figured it would be significantly slower since whatever we use here needs O(1) cons/uncons too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I was thinking -
- On each step through the loop, you snoc onto the end of the ArrayList. No consing.
- Now you get to the end, and do a foldLeft over your ArrayList. Notice that the first element of the ArrayList is the first element you snoc'd (whereas the first element of the
List
you are building here is the last element you pushed), that is, theArrayList
is already in exactly the right order to implement thereduceRight
operation via a foldLeft or reduceLeft! (Assuming I haven't mixed this up... )
case h :: t => c = h; rights = t | ||
} | ||
case Single(a) => | ||
val next = if (rights.isEmpty) empty else rights.reverse.reduceLeft((x, y) => Append(y,x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically, you could avoid this reverse
call
BTW, @edmundnoble recommended taking a look at http://www.math.tau.ac.il/~haimk/adv-ds-2000/jacm-final.pdf for some other implementations |
|
||
@Benchmark def createTinyCatenable = Catenable(1) | ||
@Benchmark def createTinyVector = Vector(1) | ||
@Benchmark def createSmallCatenable = Catenable(1, 2, 3, 4, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not saying do this, but if you added a Many(seq: Seq[A])
constructor to Catenable
, it would obviously speed up fromSeq
since it would just be a no-op. However it would probably slow down other operations and would make the implementation more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the same thing actually. Before I went down that path, I wanted to look in to the paper I linked above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that paper is overkill for this use case (amortized is totally fine) and I'm sure the constant factors will be worse. If you think about it, the implementation of append
for Catenable
is literally the fastest it could possibly be since it does absolutely nothing. And the reassociating is also going to be very hard to beat in terms of speed since it can be done with a mutable data structure as I suggested above, rather than some fancy deamortized functional version of a similar algorithm.
Paper might still be worth reading just for fun though, don't want to discourage that. :) Also maybe I am totally wrong in my intuition here. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpilquist I want to look into adding this data structure into Scala with 2.13, if that's alright with you. @pchiusano I agree that append is as fast as it could be, but I wonder about adding Many or looking at the balanced-tree creation approach in scalaz/scalaz#1022 which may make reassociating more performant.
I am not an expert on the subject but Ed Kmett said that amortized constant is not enough for reflection without remorse; but I am not sure why. Either way it's much more complicated, but I would be very interested in an implementation of this paper in Scala regardless for real-time applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pchiusano @edmundnoble I've done a lot of work with amortized factors in functional data structures (a few years ago). The problem with amortization arguments in all contexts, but especially functional data structures, is they often require enormously high values of n to come into play. Additionally, amortization arguments often ignore constant factors (in fact, that's literally what the argument is doing in an inductive fashion), which is not a fair argument to make when your constant factors are extremely high.
There are two excellent and classic examples of this problem: BankersQueue and FingerTree. BankersQueue is basically just a lazy version of the eager "front/back list queue" thing that everyone's tried at least once. And there is a proof that it is more efficient than the eager version… for some arbitrarily large value of n. It turns out that, if you benchmark it, the eager version is almost twice as fast (in Scala) for any queue size that fits into 32bit memory, which is sort of insane. FingerTree is a similar, even more dramatic example. FingerTree is an amortized constant double-ended queue, which is something that the eager banker's queue can't provide, and so in a very real sense it is offering (and proving) better asymptotes, not just better amortization costs. But on the JVM, and for queue sizes less than the billions, it is massively, hilariously slower than the alternatives.
So we have to be careful about this stuff. Amortization arguments, discard by definition performance factors which are relevant and even dominant in practical workloads. And this is especially true on the JVM and with a functional style, where amortization often relies in thunking and other pipeline- and PIC-defeating constructs that deoptimize code (by a constant factor) in ways below the level of algorithmic analysis.
while (result eq null) { | ||
c match { | ||
case Empty => | ||
rights match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the mutable approach, this is going to become an isEmpty
check, then a call to .last
, and then a popping off of the last element of the ArrayList (just by mutably decrementing the internal max index).
as.size match { | ||
case 0 => empty | ||
case 1 => single(as.head) | ||
case n if n <= 1024 => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't in practice an A*
passed as something with random access? (Like an ArraySeq
?) Maybe exploit this...
The 1024 limit might blow some stacks. Makes me a little nervous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, scala.collection.mutable.WrappedArray
. Maybe 1024 is too high -- we could limit to 16 or 32 and cover 99.9% of cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd imagine that exploiting the WrappedArray is going to be faster even for small n, since it's a loop vs a bunch of function calls that build and then tear down the call stack...
@pchiusano Pushed an update that addresses your comments. Got another boost in performance. I may still tinker with adding a
|
1b51d9e
to
4d2cdf3
Compare
Nice... not sure where I should be looking to see boost in perf, but looks good to me. Merge when ready. |
Super-cool! |
Sure, go for it (looking into adding to 2.13). Also if you or anyone else
would like to investigate adding Many that would be welcome also. If it
ends up paying off you can open a PR.
However the way this data structure is used in FS2, amortized is totally
fine, so it would not make sense to use a fancier deamortized structure
with worse constant factors. And for the reasons I gave earlier, I suspect
this data structure will be very difficult to beat for amortized
performance.
Didn't read the scalaz stuff on balanced traversals too closely, but don't
see its relevance here. We are using a mutable stack to perform the
reassociating, it will be very difficult to do better than that.
…On Sun, Dec 4, 2016 at 1:55 AM Edmund Noble ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In benchmark/src/main/scala/fs2/benchmark/CatenableBenchmark.scala
<#784>:
> + @benchmark def mapLargeCatenable = largeCatenable.map(_ + 1)
+ @benchmark def mapLargeVector = largeVector.map(_ + 1)
+
+ @benchmark def foldLeftSmallCatenable = smallCatenable.foldLeft(0)(_ + _)
+ @benchmark def foldLeftSmallVector = smallVector.foldLeft(0)(_ + _)
+ @benchmark def foldLeftLargeCatenable = largeCatenable.foldLeft(0)(_ + _)
+ @benchmark def foldLeftLargeVector = largeVector.foldLeft(0)(_ + _)
+
+ @benchmark def consSmallCatenable = 0 +: smallCatenable
+ @benchmark def consSmallVector = 0 +: smallVector
+ @benchmark def consLargeCatenable = 0 +: largeCatenable
+ @benchmark def consLargeVector = 0 +: largeVector
+
+ @benchmark def createTinyCatenable = Catenable(1)
+ @benchmark def createTinyVector = Vector(1)
+ @benchmark def createSmallCatenable = Catenable(1, 2, 3, 4, 5)
@mpilquist <https://github.com/mpilquist> I want to look into adding this
data structure into Scala with 2.13, if that's alright with you.
@pchiusano <https://github.com/pchiusano> I agree that append is as fast
as it could be, but I wonder about adding Many or looking at the
balanced-tree creation approach in scalaz/scalaz#1022
<scalaz/scalaz#1022> which may make reassociating
more performant.
I am not an *expert* on the subject but Ed Kmett said that amortized
constant is not enough for reflection without remorse; but I am not sure
why. Either way it's much more complicated, but I would be very interested
in an implementation of this paper in Scala regardless for real-time
applications.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#784>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAArQmk0fwo7twB4L50XK_itR89gXvBvks5rEmPIgaJpZM4LC45O>
.
|
Essentially, the relevance the Scalaz thing has is that it's built up from 1. a reassociating loop and 2. a tree-building loop, to build balanced Also it appears that the use of |
Switching between append and uncons does not seem to be a problem for Catenable, unless I am missing something. uncons does the minimum reassociating work necessary to produce the head and tail of the sequence: Example, if I have: Append(One(hd), tl), uncons takes O(1). In general your uncons takes as many steps as there are left-associated appends, but that work isn't repeated since the tail then has the correct associativity. I think the only pathological case for the structure is alternating the direction of your traversal, like if you uncons then unsnoc repeatedly, that's going to be quadratic. Otherwise you are just doing the reassociating in a consistent direction, and not repeating any work. I suspect there's a nice proof that any sequence of m appends and n uncons operations takes O(n + m). Here's what I had to say last time I discussed this with Ed:
|
This PR drastically improves constant factor performance of
Catenable
in certain usage patterns. API is changed a little bit:push
was renamedcons
::
was renamed to+:
:+
/snoc
was addedtoStream
replaced withtoList
foldLeft
andforeach
addeduncons
andfromSeq
--reduceRight
is not stack safe so the input is now reversedOf particular note is the fact that singleton catenable construction was only about 1.5x as fast as creating a singleton vector, and small catenable construction was a bit slower than small vector creation.
Please excuse the while loops, but this is performance sensitive code so 🤷.
8e20c6f (before the changes in this PR)
796700b (this PR)