A B-tree based persistent sorted set for Clojure/Script.
PersistentSortedSet supports:
- transients,
- custom comparators,
- fast iteration,
- efficient slices (iterator over a part of the set)
- efficient
rseq
on slices.
Almost a drop-in replacement for clojure.core/sorted-set
, the only difference being this one can’t store nil
.
Implementations are provided for Clojure and ClojureScript.
export JAVA8_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_202.jdk/Contents/Home"
lein jar
Dependency:
[persistent-sorted-set "0.3.0"]
Code:
(require '[me.tonsky.persistent-sorted-set :as set])
(set/sorted-set 3 2 1)
;=> #{1 2 3}
(-> (set/sorted-set 1 2 3 4)
(conj 2.5))
;=> #{1 2 2.5 3 4}
(-> (set/sorted-set 1 2 3 4)
(disj 3))
;=> #{1 2 4}
(-> (set/sorted-set 1 2 3 4)
(contains? 3))
;=> true
(-> (apply set/sorted-set (range 10000))
(set/slice 5000 5010))
;=> (5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010)
(-> (apply set/sorted-set (range 10000))
(set/rslice 5010 5000))
;=> (5010 5009 5008 5007 5006 5005 5004 5003 5002 5001 5000)
(set/sorted-set-by > 1 2 3)
;=> #{3 2 1}
One can also efficiently seek on the iterators.
(-> (seq (into (set/sorted-set) (range 10)))
(set/seek 5))
;; => (5 6 7 8 9)
(-> (into (set/sorted-set) (range 100))
(set/rslice 75 25)
(set/seek 60)
(set/seek 30))
;; => (30 29 28 27 26 25)
Clojure version allows efficient storage of Persistent Sorted Set on disk/DB/anywhere.
To do that, implement IStorage
interface:
(defrecord Storage [*storage]
IStorage
(store [_ node]
(let [address (random-uuid)]
(swap! *storage assoc address
(pr-str
(if (instance? Branch node)
{:level (.level ^Branch node)
:keys (.keys ^Branch node)
:addresses (.addresses ^Branch node)}
(.keys ^Leaf node))))
address))
(restore [_ address]
(let [value (-> (get @*storage address)
(edn/read-string))]
(if (map? value)
(Branch. (int (:level value)) ^java.util.List (:keys value) ^java.util.List (:addresses value))
(Leaf. ^java.util.List value)))))
Storing Persistent Sorted Set works per node. This will save each node once:
(def set
(into (set/sorted-set) (range 1000000)))
(def storage
(Storage. (atom {})))
(def root
(set/store set storage))
If you try to store once again, no store operations will be issued:
(assert
(= root
(set/store set storage)))
If you modify set and store new one, only nodes that were changed will be stored. For a tree of depth 3, it’s usually just ~3 nodes. The root will be new, though:
(def set2
(into set [-1 -2 -3]))
(assert
(not= root
(set/store set2 storage)))
Finally, one can construct a new set from its stored snapshot. You’ll need address for that:
(def set-lazy
(set/restore root storage))
Restore operation is lazy. By default it won’t do anything, but when you start accessing returned set, IStorage::restore
operations will be issued and part of the set will be reconstructed in memory. Only nodes needed for a particular operation will be loaded.
E.g. this will load ~3 nodes for a set of depth 3:
(first set-lazy)
This will load ~50 nodes on default settings:
(take 5000 set-lazy)
Internally Persistent Sorted Set does not caches returned nodes, so don’t be surprised if subsequent first
loads the same nodes again. One must implement cache inside IStorage implementation for efficient retrieval of already loaded nodes. Also see IStorage::accessed
for access stats, e.g. for LRU.
Any operation that can be done on in-memory PSS can be done on a lazy one, too. It will fetch required nodes when needed, completely transparently for the user. Lazy PSS can exist arbitrary long without ever being fully realized in memory:
(def set3
(conj set-lazy [-1 -2 -3]))
(def set4
(disj set-lazy [4 5 6 7 8]))
(contains? set-lazy 5000)
Last piece of the puzzle: set/walk-addresses
. Use it to check which nodes are actually in use by current PSS and optionally clean up garbage in your storage that is not referenced by it anymore:
(let [*alive-addresses (volatile! [])]
(set/walk-addresses set #(vswap! *alive-addresses conj %))
@*alive-addresses)
See test_storage.clj for more examples.
Durability for ClojureScript is not yet supported.
To reproduce:
- Install
[com.datomic/datomic-free "0.9.5703"]
locally. - Run
lein bench
.
PersistentTreeSet
is Clojure’s Red-black tree based sorted-set.
BTSet
is Datomic’s B-tree based sorted set (no transients, no disjoins).
PersistentSortedSet
is this implementation.
Numbers I get on my 3.2 GHz i7-8700B:
PersistentTreeSet 143..165ms
BTSet 125..141ms
PersistentSortedSet 105..121ms
PersistentSortedSet (transient) 50..54ms
PersistentTreeSet 51..54ms
BTSet 45..47ms
PersistentSortedSet 46..47ms
PersistentTreeSet 70..77ms
PersistentSortedSet 10..11ms
PersistentTreeSet 116..124ms
BTSet 92..105ms
PersistentSortedSet 56..68ms
For PersistentTreeSet
we use ISeq produced by (take-while #(<= % 999999) (.seqFrom set 1 true))
.
For PersistentSortedSet
we use (.slice set 1 999999)
.
PersistentTreeSet 238..256ms
PersistentSortedSet 70..91ms
PersistentTreeSet 151..155ms
PersistentSortedSet 91..98ms
PersistentSortedSet (transient) 47..50ms
- Datascript, persistent in-memory database
Copyright © 2019 Nikita Prokopov
Licensed under MIT (see LICENSE).