-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding MutableSumAll and Semigroup#bufferedSumOption,statefulSummer #294
base: develop
Are you sure you want to change the base?
Conversation
…Refactoring sumOption optimizations
I like this. The only thing I wonder about is whether it's worth having both sumOption and bufferedSumOption, or whether it should just be assumed that sumOption might buffer, and it perhaps takes an extra parameter for the buffer size which you can tune if necessary (which it would pass through to statefulSummer in the default implementation). |
I like having BufferedSumAll#operate leverage Semigroup summing. But without a separate target for BufferedSumAll#operate, it becomes easy to shoot one's self in the foot via: override def statefulSummer() = new BufferedSumAll(100)(this) Having said that, this is half a real candidate and half an effort to articulate the approach as described in #291. |
Sorry, I'm not following that last. Are you just worried that people will set up an infinite recursion from sumOption -> statefulSummer -> BufferedSumAll -> sumOption? |
Yes, that's correct. |
def updateInto(mutable: T, item: V): Unit | ||
def fromMutable(mutable: T): V | ||
|
||
var mutable: Option[T] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected?
@jnievelt I'm super excited that you are working on optimizations, but so is Ian (as a big part of his quarterly goals). Can you guys sit together and map out a plan so that we can minimize duplication of effort? |
if (flushed.isDefined) Some(plus(partial.get, flushed.get)) else partial | ||
} else { | ||
flushed | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like Monoid.plus[OptionMonoid[T]](partial, flushed)
?
So increasing the API surface area i'm a little concerned with, given we won't be removing sumOption, maybe we could deprecate it if we want to go with this approach? Do we have a solid notion when this will improve perf? I'm not sure it doesn't make our API more complex, with possibly limited gains from this in our most common use cases (scalding, summingbird) at least. |
I don't think there's a huge perf gain -- in summingbird online (and similar applications) it would enable arbitrarily long accumulations without buffering (and without GC?). The thing I like more about it is the cleaner separation of logic. Rather than writing all of SketchMap's current #sumOption, we can simply designate that it will do a BufferedSumAll and the logic for reducing down a buffer. For HLL/Map, we have clearly labeled methods. I believe this sort of simplification (in client code) was at the heart of #291, and for this much I like the idea of adding #statefulSummer. I certainly understand the concerns around #bufferedSumOption, and I'm fairly neutral on whether it's added or not. The only reason I used it in this PR instead of putting that logic into an overloaded BufferedSumAll#operate is that it prevents a very simple infinite recursion. |
I think this proposal has great merits. Here is some more thinking around this idea (thanks to @ianoc for helping me grokking this):
Based on that definition, they are exactly the same thing except one pushes and the other pulls. Here are some example on how to deal with the hypothetical new API: for simple cases where we define only plus:
For cases where we implement a summer:
For cases where sumOption is implemented:
Here is what the eventually monoid could look like in that case:
Now we can choose the push model vs the pull model depending on what stile is preferred depending on the monoid. As a side point I think we should decouple the new API discussion from the proposed changes to existing HLL monoid as I think they can progress separately. |
Wanted to comment here:
trait Summer[T] {
def increment(that: T): Summer[T]
def result: Option[T]
} Do we have reason to think this formulation would cost performance? If we push onto a list with increment, and if we keep an atomic ref to Either[T, List[T]], we could swap the list out for a computed sum when result is called, and the whole thing is still referentially transparent. I agree we should break up the pull requests here:
Lastly: I think it is great to have design discussions. This is a great PR for that reason: it sparks a lot of thought. |
A lot of this sounds really good. Adding a separate trait could be done relatively cleanly, I think. I would also encourage folks to comment on the original issue inasmuch as they are thinking generally about it (and not just this particular approach). The push/pull comparison is quite valid, but I don't think it tells the entire story. In particular, I think the #summer should be something that, e.g., map-side reduce could use (though I'm not entirely sure why it, of all things, wouldn't use an explicit sum-all), especially if we add a maxBufferedCount parameter. I'm also fine with a PR breakdown, and I'll proceed with this one as "Oberver/Summer pattern" unless there is an objection. If/when it gets merged we can file issues for the other two. Regarding perf, I'll work on writing up caliper tests for various uses of the new code in comparison to existing approaches. Is there a place in the tree for such tests (even if they aren't part of 'sbt test' etc.)? Also should we update CONTRIBUTING.md with guidelines on this? Finally, I would mention that this change isn't being driven by an acute need of mine (it just seemed like an interesting problem), and as a result I work on it only occasionally. If there is a need elsewhere, feel free to let me know and/or pick it up and run with it. |
@jnievelt Quickly just on the benchmark stuff, algebird-caliper has some of the first additions for the benchmarks. Caliper is proving to be a bit of a nightmare to work with however so I'll probably be looking to see if there's a good alternative. Once we have it set we should definitely update those guidelines. But continue/make entries in the caliper one, it has a runner there you can use sbt run against to run some tests with. |
@johnynek |
Sorry my bad, git foo on cmd line broke stuff and closed all of these |
Joe Nievelt seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Refactoring some Semigroup#sumOption optimizations to use these.
Closes #291
These should be getting test coverage indirectly, but I'm open to suggestions for additional tests to add.