Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
316 changes: 316 additions & 0 deletions text/2018-01-14-skiplist.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,316 @@
# Summary

Introduce crate `crossbeam-skiplist` containing a lock-free skip list.

The crate provides an ordered map and an ordered set akin to `BTreeMap` and `BTreeSet`.
In the future, we might want to build other data structures on top of the
skip list as well, for example a priority queue.

# Motivation

These are the first concurrent map and set data structures to be added to Crossbeam.

Skip list is often touted as a relatively easy data structure to make concurrent, or at least
easy compared to other maps/sets. However, supporting the remove operation in a language without
GC and coming up with an API that isn't overly restrictive is very difficult.

This crate aims to provide a concurrent map/set that aims to be as powerful and
ergonomic as `BTreeMap`/`BTreeSet`. The API must be reasonably easy to use and provide all
operations one would expect from any other map/set without jumping through hoops.

A good example of that is Java - one can simply replace any use of [`TreeMap`] with
[`ConcurrentSkipListMap`] and expect the code to *just work*. The concurrent map in Java is an
almost perfect drop-in replacement for the non-concurrent one. Unfortunately, I don't think
we can achieve quite the same thing in Rust - and the reason is that Java and Rust have much
different memory models (e.g. Java doesn't have move semantics and pervasively allocates objects
on the heap). With that said, I believe we can still model a lock-free map that mimics
`BTreeMap` fairly closely.

Regarding performance, a skip list is fundamentally disadvantaged compared to a B-tree.
Every node in a skip list is separately allocated on the heap, while a B-tree
allocates nodes in large blocks, thus greatly improving cache utilization. The problem
of scattered skip list nodes in memory can be somewhat mitigated using custom allocators
(by trying to allocate adjacent nodes in a skip list as close as possible in memory), but
typically with great difficulty and underwhelming results.

One can think of a B-tree as a kind of a compacting garbage collector. Consider what
happens when a B-tree block becomes full: a new block may be allocated
and elements are redistributed among blocks as needed. This is reminiscent of compacting
garbage collectors. Note that moving elements in memory makes concurrency more difficult in Rust:
a thread cannot hold a reference to an element if another thread may potentially move it
to a different location in memory at the same time.

Skip lists, however, allocate each node separately on the heap. A node contains a key, a value,
and a tower of next-pointers. The node is never moved to a different location in memory - once
allocated, it stays there until it is is destroyed. This makes cache utilization worse,
but also makes borrowing elements in presence of parallel modify operations easier.

Long story short, a lock-free skip list will scale much better than a mutex-protected `BTreeMap`,
but in single-threaded scenarios it will have no chance competing with `BTreeMap` due to poorer
cache utilization.

[`ConcurrentSkipListMap`]: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentSkipListMap.html
[`TreeMap`]: https://docs.oracle.com/javase/8/docs/api/java/util/TreeMap.html

### Previous work

Notable implementations of concurrent skip lists in other languages:

1. [java.util.concurrent](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/concurrent/ConcurrentSkipListMap.java#ConcurrentSkipListMap) (Java):
Has the most extensive API - and really feels like a drop-in replacement for any other map.
However, the implementation is not particularly efficient and has a few interesting quirks.
For example, each pointer in a tower is separately allocated. Another one is - instead of
tagging pointers to mark a node as deleted, a dummy successor node is allocated.

2. [libcds](https://github.com/khizmax/libcds/blob/19af81b7c61480ed705b91b4d01ee5d717a97cd2/cds/intrusive/skip_list_rcu.h) (C++):
Fairly complete and general API (one can even choose between no GC, EBR, and HP).

3. [RocksDB](https://github.com/facebook/rocksdb/blob/68829ed89cec64186557dc0860fc693c118ff1c6/memtable/skiplist.h) (C++):
The skip list does not support removal nor multiple concurrent inserts. However, an ongoing
insert operation does not block other threads from reading the skip list. Once the skip list
becomes full, it is flushed to disk-based storage and a new skip list is constructed to replace
the old one.

4. [Folly](https://github.com/facebook/folly/blob/98d1077ce0603b0713353d638cb1436a28827af6/folly/ConcurrentSkipList.h) (C++):
A concurrent skip list, but not lock-free: it uses per-node locking.
Also, removed nodes are not freed until the skip list is destroyed.

5. [libgee](https://github.com/GNOME/libgee/blob/da95e830524ffa309eb57925320666e5085b9d66/gee/concurrentset.vala) (Vala):
A hazard pointer-based skip list. Looks very interesting.

There are also several concurrent skip lists in Rust, but none of them have been published to crates.io so far
and look like works in progress:

1. [danburkert/pawn](https://github.com/danburkert/pawn/blob/8b6806d944d830f552d496cd3ee605d1707fdc51/src/util/skip_list.rs) (Rust):
A rather old insert-only lock-free skip list. Looks like an abandoned project.

2. [Vtec234/lists-rs](https://github.com/Vtec234/lists-rs/blob/f83e516039dc4a421172af1cdbdcec85b0e73d74/src/epoch_skiplist.rs) (Rust):
A lock-free skip list that supports remove and uses Crossbeam for memory reclamation.
Interestingly, keys are always hashed so it's technically a hash map.

3. [boats/skiplist](https://gitlab.com/boats/skiplist/tree/master/src/skiplist) (Rust):
Insert-only lock-free skip list by withoutboats. Published very recently.

# Detailed design

The proposed implementation is currently residing in [stjepang/skiplist](https://github.com/stjepang/skiplist),
but will be moved into a new repository `crossbeam-rs/crossbeam-skiplist`. It is a
lock-free skip list using epoch-based memory reclamation from `crossbeam-epoch`.

The implementation is based on the following work:

1. [Practical lock-freedom](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf)
(see *4.3.3 CAS-based design*)

2. [Linked Lists: Locking, Lock-Free and Beyond...](http://janvitek.org/events/TiC06/B-SLIDES/mh2.pdf)

The codebase consists of three main source files:

* [`base.rs`](https://github.com/stjepang/skiplist/blob/master/src/base.rs) -
Contains the base skip list implementation details. This file does not attempt to
expose something ergonomic, but instead aims to provide a skip list 'engine' that
is intended to be wrapped into a nicer interface.

* [`map.rs`](https://github.com/stjepang/skiplist/blob/master/src/map.rs) -
Wraps the base implementation into a map interface similar to `BTreeMap`.

* [`set.rs`](https://github.com/stjepang/skiplist/blob/master/src/set.rs) -
Wraps the base implementation into a set interface similar to `BTreeSet`.

**Note:** These map and set wrappers are just tentative interfaces - they're
finished, and there's a possibility we'll want to completely change them.
For now, consider them just a proof of concept.

## Tentative map API

Here's a quick demo. The following code is the [first example](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html#examples)
taken from `BTreeMap`'s documentation, except it uses `SkipMap` instead of `BTreeMap`.
A few other minor changes were required to make it compile, but other than that it doesn't depart
too far from the original:

```rust
// type inference lets us omit an explicit type signature (which
// would be `SkipMap<&str, &str>` in this example).
let movie_reviews = SkipMap::new();

// review some movies.
movie_reviews.insert("Office Space", "Deals with real issues in the workplace.");
movie_reviews.insert("Pulp Fiction", "Masterpiece.");
movie_reviews.insert("The Godfather", "Very enjoyable.");
movie_reviews.insert("The Blues Brothers", "Eye lyked it alot.");

// check for a specific one.
if !movie_reviews.contains_key("Les Misérables") {
println!("We've got {} reviews, but Les Misérables ain't one.",
movie_reviews.len());
}

// oops, this review has a lot of spelling mistakes, let's delete it.
movie_reviews.remove("The Blues Brothers");

// look up the values associated with some keys.
let to_find = ["Up!", "Office Space"];
for book in &to_find {
match movie_reviews.get(book) {
Some(entry) => println!("{}: {}", book, entry.value()),
None => println!("{} is unreviewed.", book)
}
}

// iterate over everything.
for entry in &movie_reviews {
let movie = entry.key();
let review = entry.value();
println!("{}: \"{}\"", movie, review);
}
```

Take a look at [map.rs](https://github.com/stjepang/skiplist/blob/master/src/map.rs) to
see the full interface of `SkipMap`.

An interesting difference from `BTreeMap` is that methods like `insert` and `get` return
an `Entry<'a, K, V>`, which is essentially just a reference-counted pointer to an entry in
the skip list. Note that it is possible to hold an entry and remove it at the same time (you
can even call `entry.remove()`), but the actual contents of the entry will not be destroyed
before the last reference is dropped.

### Performance

It has already been mentioned that `SkipMap` will have a hard time competing with `BTreeMap` in
single-threaded scenarios.
Let's see that through a simple benchmark that just inserts a million pseudorandom
numbers into a map. This is a very unscientific benchmark, but it should at least give
us a feeling for how different map implementations fare against each other.

Machine: Intel Core i7-5600U (2 physical cores, 4 logical cores)

First, here's `BTreeMap` in three different scenarios:

* [`BTreeMap<u64, u64>` (1 thread)](https://gist.github.com/stjepang/9b1bf73c2fdb0309cefda66b91f633dd): 315 ms
* [`Mutex<BTreeMap<u64, u64>>` (1 thread)](https://gist.github.com/stjepang/437b82134b401d3fa2c9c439a003c1ea): 321 ms
* [`Mutex<BTreeMap<u64, u64>>` (2 threads)](https://gist.github.com/stjepang/66000dfae15c8046b91ff3612c7d881f): 752 ms

Notice how there is very little overhead of locking if only one thread is used. However, as soon as
we add more threads, contended locking brings a huge penalty on performance.

But `SkipMap` doesn't suffer from the same problem. In fact, adding more threads improves performance:

* [`SkipMap<u64, u64>` (1 thread)](https://gist.github.com/stjepang/1980ab811009e94f2adfe8b230c20047): 1028 ms
* [`SkipMap<u64, u64>` (2 threads)](https://gist.github.com/stjepang/a3f8f6dddac56d43e7dbfb2928cd3bfe): 561 ms

Let's also see some numbers for a mutex-protected `std::map` in C++:

* [`std::map<uint64_t, uint64_t>` (1 thread)](https://gist.github.com/stjepang/6aa80020b6edac1f6ea9af518e4ad989): 881 ms
* [`std::map<uint64_t, uint64_t>` (2 threads)](https://gist.github.com/stjepang/b172a4259c0439d2855bc68fd47b3ab7): 1127 ms

And here are mutex-protected `TreeMap` and `ConcurrentSkipListMap` in Java:

* [`TreeMap<long, long> (1 thread)`](https://gist.github.com/stjepang/3bc21528f5cf82ecd564778f8a861b11): 1211 ms
* [`TreeMap<long, long> (2 threads)`](https://gist.github.com/stjepang/da69ad273ea2cf2e13b4322c0ea6bd74): 1409 ms
* [`ConcurrentSkipListMap<long, long> (1 thread)`](https://gist.github.com/stjepang/f6f289c07759f47a72b0565fd6b992c7): 2181 ms
* [`ConcurrentSkipListMap<long, long> (2 threads)`](https://gist.github.com/stjepang/74d1abc7230ad6e6dd0c4aec1f4cab4b): 1353 ms

The bottom line is: in single-threaded scenarios `SkipMap` should be comparable in performance
to any typical binary search tree (although not to a B-tree). As we add more threads, it seems to
scale quite well. I don't have a machine with a high number of cores to test scalability more thoroughly,
but these numbers seem promising so far nonetheless.

### Iteration

The skip list supports easy iteration. Note that when iterating over a `SkipMap` we hand out an `Entry`
for each entry in it. Creating an entry involves incrementing its reference count, and when moving
from one entry to another we also have to pin the current thread. This is a lot of reference count
updating and a lot of pinning.

Here are some benchmark numbers for iterating over a million randomly inserted entries:

* `BTreeMap` (Rust): 18 ms
* `SkipMap` (Rust): 113 ms
* `std::map` (C++): 93 ms
* `TreeMap` (Java): 41 ms
* `ConcurrentSkipListMap` (Java): 32 ms

Interesting observations:

* Iteration over a `BTreeMap` is very fast - this shouldn't be surprising since adjacent
elements are grouped into blocks.

* `SkipMap` is the slowest map here. I've tried measuring how long iteration takes withut
reference count updating and without pinning, and it turns out to be around 95 ms. That is
very similar to `std::map` in C++. Also, reference counting and pinning definitely incurs
a measurable cost, but it's not a *terrible* one.

* Java is fast - even `TreeMap` is faster than `std::map` in C++. How is that possible?
Well, the answer lies in the fact that Java's GC kicks from time to time, moves
allocated nodes in memory (it's a compacting GC), and tries to lay out linked nodes
as close as possible, thus optimizing for cache efficiency.

* Let's try tuning Java's GC by using option `-XX:NewSize=1024m`. This option sets the
size of the new generation to 1024 MB (a huge number), which means compaction should
never kick in. Indeed, iteration timings are much different now - with `TreeMap` it takes
124 ms and with `ConcurrentSkipListMap` it takes 110 ms. Now that's much closer to `SkipMap`
and `std::map`.

### The cost of reference counting in `Entry`

When iterating over a skip list we use `Entry`s, which are essentially reference-counted
pointers to skip list nodes. That means iterating over 100 elements incurs the
cost of 200 atomic increments and 200 atomic decrements.

Methods that insert, remove, or search for an element return `Entry`, which means
they too incur some cost spent on incrementing and decrementing a node's reference count.

The current skip list implementation doesn't provide alternative methods that avoid
reference counting (i.e. avoid using `Entry`), but in the future we should discuss how
to add them. Broadly speaking, there are three general alternatives to entries:
clones, guards, and closures. Here's an illustration on the `SkipMap::get` method:

```rust
// Reference counting: return an `Entry`.
//
// This is the method signature we currently have.
fn get(&self, k: &K) -> Option<Entry<K, V>>;

// Alternative #1: return a clone of the element.
//
// This means we're paying the price of cloning, but that's
// not a problem if cloning is cheap.
fn get_clone(&self, k: &K) -> Option<V> where V: Clone;

// Alternative #2: return a guarded reference to the element
// keeping the thread pinned.
//
// The main drawback here is that the user must be careful
// not to keep the guard live for too long, or else garbage
// collection will get stuck.
fn get_guard(&self, k: &K) -> Option<Guard<K, V>>;

// Alternative #3: take a closure that does something with
// the found element while the thread is still pinned.
//
// Again, the drawback here is that the user must be careful
// not to keep the closure running for too long, or else
// garbage collection will get stuck.
fn get_with<F: FnOnce(&V)>(&self, k: &K, f: F);
```

# Drawbacks

Skip lists are not very exciting when it comes to performance. Hash tables, B-trees (Bw-Tree is
a lock-free B-tree variant), and radix trees (ART - adaptive radix tree can be made concurrent)
are usually more performant. However, these faster data structures are not as general as skip lists
and have to make sacrifices by restricting the set of supported operations or by making the API less ergonomic.

# Alternatives

A few possible similar but alternative data structures might be:

1. Adaptive radix tree (keys can only be byte arrays).
2. Skip tree (moves elements in memory, thus constraining the API).
3. Bw-Tree (moves elements in memory, thus constraining the API).

# Unresolved questions

* Should `Entry` be renamed to `Cursor`?
* How do we make iteration faster by avoiding reference counting?
* What alternatives to the `Entry` API do we need and how to incorporate them?