-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clear candidates (suspects) in parallel: entanglement management perf improvement (and other fixes) #168
Conversation
…gement);signifiicantly less overhead on entangled read barriers
why are we using list of chunks? why not array of chunks (possibly with doubling to grow)? |
Originally we chose list-of-chunks for simplicity. O(1) for both insertion and concatenate is really nice. We use both operations extensively... I don't think it would be easy to rework things to avoid concatenate. Also, the amortization for doubling wouldn't play nice with span. The ideal data structure would be non-amortized with:
IIRC there is a "bag" data structure that has these guarantees. I can't remember it off the top of my head at the moment. If we're willing to pay log-cost for concatenate, then there are lots of good options. |
every concat corresponds to a join right? |
Essentially yes. Both logical joins and true joins. |
cool, if so, would the following work: the length of such a thing should be bounded by span... |
Hmm, I'll have to think about this one. Seems like it could increase the overall span to A bound like |
ok let's discuss in person |
Be warned! `#ifdef ASSERT` is true in all builds. This was causing the debug version of `traverseAndCheck` to run in all builds, with significant performance degradation in entangled benchmarks. I cleaned up the header and definition a little here, too.
After discussion today: Altogether, this patch implements three fixes / performance improvements:
The results are Very Very Good™️ Merging now 🎉 |
Integrate with the scheduler to clear suspects in parallel. This patch subsumes #167 but I will keep the other open for now, for potential discussion.
Performance improvements. Big! On 72 procs,
ms-queue
is now at ~30x speedup over sequential (old: 19x), andlinden-pq
is now at ~36x speedup over sequential (old: ~30x)Algorithm description. The algorithm in this patch is straightforward. Clearing suspects is a bucketed filter: some elements are eliminated, and some survive; each of the survivors is moved into one of$D$ buckets (where $D$ is the depth of the heap whose suspects are being cleared). To parallelize this, we (1) break up the input, (2) run multiple sequential filters in parallel, and then finally (3) merge the results.
TODO: algorithmic improvements? This algorithm has three phases. The middle phase is highly parallel, but both the first and last phases are sequential. In the first phase, the algorithm converts the input list (of chunks) into an array (of chunks). In the last phase, it sequentially merges bucketed outputs. Probably neither of these is a performance bottleneck at the moment. But in the future, there are opportunities for more parallelism. To parallelize the first phase, we could maintain all suspect sets as balanced trees, to enable fast balanced splitting. To parallelize the last phase, we could merge results with a D&C reduction.
TODO: implementation optimizations? This particular implementation could be further optimized in a few ways.
ES_deleteClearSet
. But we could instead free them inES_processClearSetGrain
, by eagerly freeing each processed chunk we see along the way.