Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear candidates (suspects) in parallel: entanglement management perf improvement (and other fixes) #168

Merged
merged 5 commits into from
Oct 19, 2022

Conversation

shwestrick
Copy link
Collaborator

Integrate with the scheduler to clear suspects in parallel. This patch subsumes #167 but I will keep the other open for now, for potential discussion.

Performance improvements. Big! On 72 procs, ms-queue is now at ~30x speedup over sequential (old: 19x), and linden-pq is now at ~36x speedup over sequential (old: ~30x)

Algorithm description. The algorithm in this patch is straightforward. Clearing suspects is a bucketed filter: some elements are eliminated, and some survive; each of the survivors is moved into one of $D$ buckets (where $D$ is the depth of the heap whose suspects are being cleared). To parallelize this, we (1) break up the input, (2) run multiple sequential filters in parallel, and then finally (3) merge the results.

TODO: algorithmic improvements? This algorithm has three phases. The middle phase is highly parallel, but both the first and last phases are sequential. In the first phase, the algorithm converts the input list (of chunks) into an array (of chunks). In the last phase, it sequentially merges bucketed outputs. Probably neither of these is a performance bottleneck at the moment. But in the future, there are opportunities for more parallelism. To parallelize the first phase, we could maintain all suspect sets as balanced trees, to enable fast balanced splitting. To parallelize the last phase, we could merge results with a D&C reduction.

TODO: implementation optimizations? This particular implementation could be further optimized in a few ways.

  1. Don't rely on malloc/free for intermediate data. Use the block allocator directly instead, to avoid the malloc/free bottleneck.
  2. Free old chunks in parallel. Currently we free these chunks sequentially in ES_deleteClearSet. But we could instead free them in ES_processClearSetGrain, by eagerly freeing each processed chunk we see along the way.

@umutacar
Copy link
Collaborator

why are we using list of chunks? why not array of chunks (possibly with doubling to grow)?

@shwestrick
Copy link
Collaborator Author

Originally we chose list-of-chunks for simplicity. O(1) for both insertion and concatenate is really nice. We use both operations extensively... I don't think it would be easy to rework things to avoid concatenate.

Also, the amortization for doubling wouldn't play nice with span.

The ideal data structure would be non-amortized with:

  • O(1) insertion
  • O(1) concatenate
  • O(log n) split (or O(1), of course 😄)

IIRC there is a "bag" data structure that has these guarantees. I can't remember it off the top of my head at the moment. If we're willing to pay log-cost for concatenate, then there are lots of good options.

@umutacar
Copy link
Collaborator

every concat corresponds to a join right?

@shwestrick
Copy link
Collaborator Author

Essentially yes. Both logical joins and true joins.

@umutacar
Copy link
Collaborator

cool, if so, would the following work:
-- base case: create array
-- at joins/concats: simply link up the arrays with a pointer, or possibly with a "node" that has them as its children

the length of such a thing should be bounded by span...

@shwestrick
Copy link
Collaborator Author

Hmm, I'll have to think about this one. Seems like it could increase the overall span to $O(S^2)$ by paying $O(S)$ span per join.

A bound like $O(S log C)$ where $C$ is the maximum number of candidates might be better, and we can get this bound with a balanced tree.

@umutacar
Copy link
Collaborator

ok let's discuss in person

Be warned! `#ifdef ASSERT` is true in all builds.

This was causing the debug version of `traverseAndCheck` to run
in all builds, with significant performance degradation in entangled
benchmarks.

I cleaned up the header and definition a little here, too.
@shwestrick shwestrick changed the title Clear candidates (suspects) in parallel: entanglement management performance improvement Clear candidates (suspects) in parallel: entanglement management perf improvement (and other fixes) Oct 19, 2022
@shwestrick
Copy link
Collaborator Author

After discussion today:

Altogether, this patch implements three fixes / performance improvements:

  • Parallelization of suspect clearing
  • Eliminate EBR overhead for read barriers
  • Bugfix: #if vs #ifdef typo

The results are Very Very Good™️

Merging now 🎉

@shwestrick shwestrick merged commit bb4eb70 into performance-eval Oct 19, 2022
@shwestrick shwestrick deleted the par-clear-candidates branch September 11, 2024 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants