diff --git a/text/2018-01-14-skiplist.md b/text/2018-01-14-skiplist.md new file mode 100644 index 0000000..bebd7a2 --- /dev/null +++ b/text/2018-01-14-skiplist.md @@ -0,0 +1,316 @@ +# Summary + +Introduce crate `crossbeam-skiplist` containing a lock-free skip list. + +The crate provides an ordered map and an ordered set akin to `BTreeMap` and `BTreeSet`. +In the future, we might want to build other data structures on top of the +skip list as well, for example a priority queue. + +# Motivation + +These are the first concurrent map and set data structures to be added to Crossbeam. + +Skip list is often touted as a relatively easy data structure to make concurrent, or at least +easy compared to other maps/sets. However, supporting the remove operation in a language without +GC and coming up with an API that isn't overly restrictive is very difficult. + +This crate aims to provide a concurrent map/set that aims to be as powerful and +ergonomic as `BTreeMap`/`BTreeSet`. The API must be reasonably easy to use and provide all +operations one would expect from any other map/set without jumping through hoops. + +A good example of that is Java - one can simply replace any use of [`TreeMap`] with +[`ConcurrentSkipListMap`] and expect the code to *just work*. The concurrent map in Java is an +almost perfect drop-in replacement for the non-concurrent one. Unfortunately, I don't think +we can achieve quite the same thing in Rust - and the reason is that Java and Rust have much +different memory models (e.g. Java doesn't have move semantics and pervasively allocates objects +on the heap). With that said, I believe we can still model a lock-free map that mimics +`BTreeMap` fairly closely. + +Regarding performance, a skip list is fundamentally disadvantaged compared to a B-tree. +Every node in a skip list is separately allocated on the heap, while a B-tree +allocates nodes in large blocks, thus greatly improving cache utilization. The problem +of scattered skip list nodes in memory can be somewhat mitigated using custom allocators +(by trying to allocate adjacent nodes in a skip list as close as possible in memory), but +typically with great difficulty and underwhelming results. + +One can think of a B-tree as a kind of a compacting garbage collector. Consider what +happens when a B-tree block becomes full: a new block may be allocated +and elements are redistributed among blocks as needed. This is reminiscent of compacting +garbage collectors. Note that moving elements in memory makes concurrency more difficult in Rust: +a thread cannot hold a reference to an element if another thread may potentially move it +to a different location in memory at the same time. + +Skip lists, however, allocate each node separately on the heap. A node contains a key, a value, +and a tower of next-pointers. The node is never moved to a different location in memory - once +allocated, it stays there until it is is destroyed. This makes cache utilization worse, +but also makes borrowing elements in presence of parallel modify operations easier. + +Long story short, a lock-free skip list will scale much better than a mutex-protected `BTreeMap`, +but in single-threaded scenarios it will have no chance competing with `BTreeMap` due to poorer +cache utilization. + +[`ConcurrentSkipListMap`]: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentSkipListMap.html +[`TreeMap`]: https://docs.oracle.com/javase/8/docs/api/java/util/TreeMap.html + +### Previous work + +Notable implementations of concurrent skip lists in other languages: + +1. [java.util.concurrent](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/concurrent/ConcurrentSkipListMap.java#ConcurrentSkipListMap) (Java): + Has the most extensive API - and really feels like a drop-in replacement for any other map. + However, the implementation is not particularly efficient and has a few interesting quirks. + For example, each pointer in a tower is separately allocated. Another one is - instead of + tagging pointers to mark a node as deleted, a dummy successor node is allocated. + +2. [libcds](https://github.com/khizmax/libcds/blob/19af81b7c61480ed705b91b4d01ee5d717a97cd2/cds/intrusive/skip_list_rcu.h) (C++): + Fairly complete and general API (one can even choose between no GC, EBR, and HP). + +3. [RocksDB](https://github.com/facebook/rocksdb/blob/68829ed89cec64186557dc0860fc693c118ff1c6/memtable/skiplist.h) (C++): + The skip list does not support removal nor multiple concurrent inserts. However, an ongoing + insert operation does not block other threads from reading the skip list. Once the skip list + becomes full, it is flushed to disk-based storage and a new skip list is constructed to replace + the old one. + +4. [Folly](https://github.com/facebook/folly/blob/98d1077ce0603b0713353d638cb1436a28827af6/folly/ConcurrentSkipList.h) (C++): + A concurrent skip list, but not lock-free: it uses per-node locking. + Also, removed nodes are not freed until the skip list is destroyed. + +5. [libgee](https://github.com/GNOME/libgee/blob/da95e830524ffa309eb57925320666e5085b9d66/gee/concurrentset.vala) (Vala): + A hazard pointer-based skip list. Looks very interesting. + +There are also several concurrent skip lists in Rust, but none of them have been published to crates.io so far +and look like works in progress: + +1. [danburkert/pawn](https://github.com/danburkert/pawn/blob/8b6806d944d830f552d496cd3ee605d1707fdc51/src/util/skip_list.rs) (Rust): + A rather old insert-only lock-free skip list. Looks like an abandoned project. + +2. [Vtec234/lists-rs](https://github.com/Vtec234/lists-rs/blob/f83e516039dc4a421172af1cdbdcec85b0e73d74/src/epoch_skiplist.rs) (Rust): + A lock-free skip list that supports remove and uses Crossbeam for memory reclamation. + Interestingly, keys are always hashed so it's technically a hash map. + +3. [boats/skiplist](https://gitlab.com/boats/skiplist/tree/master/src/skiplist) (Rust): + Insert-only lock-free skip list by withoutboats. Published very recently. + +# Detailed design + +The proposed implementation is currently residing in [stjepang/skiplist](https://github.com/stjepang/skiplist), +but will be moved into a new repository `crossbeam-rs/crossbeam-skiplist`. It is a +lock-free skip list using epoch-based memory reclamation from `crossbeam-epoch`. + +The implementation is based on the following work: + +1. [Practical lock-freedom](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf) + (see *4.3.3 CAS-based design*) + +2. [Linked Lists: Locking, Lock-Free and Beyond...](http://janvitek.org/events/TiC06/B-SLIDES/mh2.pdf) + +The codebase consists of three main source files: + +* [`base.rs`](https://github.com/stjepang/skiplist/blob/master/src/base.rs) - + Contains the base skip list implementation details. This file does not attempt to + expose something ergonomic, but instead aims to provide a skip list 'engine' that + is intended to be wrapped into a nicer interface. + +* [`map.rs`](https://github.com/stjepang/skiplist/blob/master/src/map.rs) - + Wraps the base implementation into a map interface similar to `BTreeMap`. + +* [`set.rs`](https://github.com/stjepang/skiplist/blob/master/src/set.rs) - + Wraps the base implementation into a set interface similar to `BTreeSet`. + +**Note:** These map and set wrappers are just tentative interfaces - they're +finished, and there's a possibility we'll want to completely change them. +For now, consider them just a proof of concept. + +## Tentative map API + +Here's a quick demo. The following code is the [first example](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html#examples) +taken from `BTreeMap`'s documentation, except it uses `SkipMap` instead of `BTreeMap`. +A few other minor changes were required to make it compile, but other than that it doesn't depart +too far from the original: + +```rust +// type inference lets us omit an explicit type signature (which +// would be `SkipMap<&str, &str>` in this example). +let movie_reviews = SkipMap::new(); + +// review some movies. +movie_reviews.insert("Office Space", "Deals with real issues in the workplace."); +movie_reviews.insert("Pulp Fiction", "Masterpiece."); +movie_reviews.insert("The Godfather", "Very enjoyable."); +movie_reviews.insert("The Blues Brothers", "Eye lyked it alot."); + +// check for a specific one. +if !movie_reviews.contains_key("Les Misérables") { + println!("We've got {} reviews, but Les Misérables ain't one.", + movie_reviews.len()); +} + +// oops, this review has a lot of spelling mistakes, let's delete it. +movie_reviews.remove("The Blues Brothers"); + +// look up the values associated with some keys. +let to_find = ["Up!", "Office Space"]; +for book in &to_find { + match movie_reviews.get(book) { + Some(entry) => println!("{}: {}", book, entry.value()), + None => println!("{} is unreviewed.", book) + } +} + +// iterate over everything. +for entry in &movie_reviews { + let movie = entry.key(); + let review = entry.value(); + println!("{}: \"{}\"", movie, review); +} +``` + +Take a look at [map.rs](https://github.com/stjepang/skiplist/blob/master/src/map.rs) to +see the full interface of `SkipMap`. + +An interesting difference from `BTreeMap` is that methods like `insert` and `get` return +an `Entry<'a, K, V>`, which is essentially just a reference-counted pointer to an entry in +the skip list. Note that it is possible to hold an entry and remove it at the same time (you +can even call `entry.remove()`), but the actual contents of the entry will not be destroyed +before the last reference is dropped. + +### Performance + +It has already been mentioned that `SkipMap` will have a hard time competing with `BTreeMap` in +single-threaded scenarios. +Let's see that through a simple benchmark that just inserts a million pseudorandom +numbers into a map. This is a very unscientific benchmark, but it should at least give +us a feeling for how different map implementations fare against each other. + +Machine: Intel Core i7-5600U (2 physical cores, 4 logical cores) + +First, here's `BTreeMap` in three different scenarios: + +* [`BTreeMap` (1 thread)](https://gist.github.com/stjepang/9b1bf73c2fdb0309cefda66b91f633dd): 315 ms +* [`Mutex>` (1 thread)](https://gist.github.com/stjepang/437b82134b401d3fa2c9c439a003c1ea): 321 ms +* [`Mutex>` (2 threads)](https://gist.github.com/stjepang/66000dfae15c8046b91ff3612c7d881f): 752 ms + +Notice how there is very little overhead of locking if only one thread is used. However, as soon as +we add more threads, contended locking brings a huge penalty on performance. + +But `SkipMap` doesn't suffer from the same problem. In fact, adding more threads improves performance: + +* [`SkipMap` (1 thread)](https://gist.github.com/stjepang/1980ab811009e94f2adfe8b230c20047): 1028 ms +* [`SkipMap` (2 threads)](https://gist.github.com/stjepang/a3f8f6dddac56d43e7dbfb2928cd3bfe): 561 ms + +Let's also see some numbers for a mutex-protected `std::map` in C++: + +* [`std::map` (1 thread)](https://gist.github.com/stjepang/6aa80020b6edac1f6ea9af518e4ad989): 881 ms +* [`std::map` (2 threads)](https://gist.github.com/stjepang/b172a4259c0439d2855bc68fd47b3ab7): 1127 ms + +And here are mutex-protected `TreeMap` and `ConcurrentSkipListMap` in Java: + +* [`TreeMap (1 thread)`](https://gist.github.com/stjepang/3bc21528f5cf82ecd564778f8a861b11): 1211 ms +* [`TreeMap (2 threads)`](https://gist.github.com/stjepang/da69ad273ea2cf2e13b4322c0ea6bd74): 1409 ms +* [`ConcurrentSkipListMap (1 thread)`](https://gist.github.com/stjepang/f6f289c07759f47a72b0565fd6b992c7): 2181 ms +* [`ConcurrentSkipListMap (2 threads)`](https://gist.github.com/stjepang/74d1abc7230ad6e6dd0c4aec1f4cab4b): 1353 ms + +The bottom line is: in single-threaded scenarios `SkipMap` should be comparable in performance +to any typical binary search tree (although not to a B-tree). As we add more threads, it seems to +scale quite well. I don't have a machine with a high number of cores to test scalability more thoroughly, +but these numbers seem promising so far nonetheless. + +### Iteration + +The skip list supports easy iteration. Note that when iterating over a `SkipMap` we hand out an `Entry` +for each entry in it. Creating an entry involves incrementing its reference count, and when moving +from one entry to another we also have to pin the current thread. This is a lot of reference count +updating and a lot of pinning. + +Here are some benchmark numbers for iterating over a million randomly inserted entries: + +* `BTreeMap` (Rust): 18 ms +* `SkipMap` (Rust): 113 ms +* `std::map` (C++): 93 ms +* `TreeMap` (Java): 41 ms +* `ConcurrentSkipListMap` (Java): 32 ms + +Interesting observations: + +* Iteration over a `BTreeMap` is very fast - this shouldn't be surprising since adjacent + elements are grouped into blocks. + +* `SkipMap` is the slowest map here. I've tried measuring how long iteration takes withut + reference count updating and without pinning, and it turns out to be around 95 ms. That is + very similar to `std::map` in C++. Also, reference counting and pinning definitely incurs + a measurable cost, but it's not a *terrible* one. + +* Java is fast - even `TreeMap` is faster than `std::map` in C++. How is that possible? + Well, the answer lies in the fact that Java's GC kicks from time to time, moves + allocated nodes in memory (it's a compacting GC), and tries to lay out linked nodes + as close as possible, thus optimizing for cache efficiency. + +* Let's try tuning Java's GC by using option `-XX:NewSize=1024m`. This option sets the + size of the new generation to 1024 MB (a huge number), which means compaction should + never kick in. Indeed, iteration timings are much different now - with `TreeMap` it takes + 124 ms and with `ConcurrentSkipListMap` it takes 110 ms. Now that's much closer to `SkipMap` + and `std::map`. + +### The cost of reference counting in `Entry` + +When iterating over a skip list we use `Entry`s, which are essentially reference-counted +pointers to skip list nodes. That means iterating over 100 elements incurs the +cost of 200 atomic increments and 200 atomic decrements. + +Methods that insert, remove, or search for an element return `Entry`, which means +they too incur some cost spent on incrementing and decrementing a node's reference count. + +The current skip list implementation doesn't provide alternative methods that avoid +reference counting (i.e. avoid using `Entry`), but in the future we should discuss how +to add them. Broadly speaking, there are three general alternatives to entries: +clones, guards, and closures. Here's an illustration on the `SkipMap::get` method: + +```rust +// Reference counting: return an `Entry`. +// +// This is the method signature we currently have. +fn get(&self, k: &K) -> Option>; + +// Alternative #1: return a clone of the element. +// +// This means we're paying the price of cloning, but that's +// not a problem if cloning is cheap. +fn get_clone(&self, k: &K) -> Option where V: Clone; + +// Alternative #2: return a guarded reference to the element +// keeping the thread pinned. +// +// The main drawback here is that the user must be careful +// not to keep the guard live for too long, or else garbage +// collection will get stuck. +fn get_guard(&self, k: &K) -> Option>; + +// Alternative #3: take a closure that does something with +// the found element while the thread is still pinned. +// +// Again, the drawback here is that the user must be careful +// not to keep the closure running for too long, or else +// garbage collection will get stuck. +fn get_with(&self, k: &K, f: F); +``` + +# Drawbacks + +Skip lists are not very exciting when it comes to performance. Hash tables, B-trees (Bw-Tree is +a lock-free B-tree variant), and radix trees (ART - adaptive radix tree can be made concurrent) +are usually more performant. However, these faster data structures are not as general as skip lists +and have to make sacrifices by restricting the set of supported operations or by making the API less ergonomic. + +# Alternatives + +A few possible similar but alternative data structures might be: + +1. Adaptive radix tree (keys can only be byte arrays). +2. Skip tree (moves elements in memory, thus constraining the API). +3. Bw-Tree (moves elements in memory, thus constraining the API). + +# Unresolved questions + +* Should `Entry` be renamed to `Cursor`? +* How do we make iteration faster by avoiding reference counting? +* What alternatives to the `Entry` API do we need and how to incorporate them?