Conversation
|
CC @xokdvium |
|
Random tidbit: during eval of a NixOS system, we allocate a lot of common small strings: |
|
@tomberek How are these strings allocated? Is |
43c3477 to
1fbe558
Compare
|
Great work! For reference, some ad-hoc measurements really show an amazing improvement in terms of speed as well as memory usage (compared against current master):
|
|
Can you expand a bit on how well this would be amenable to multi-threaded evaluation? In #10938 I reduced lock contention for symbol table insertions by sharding the mapping from strings to symbol offsets, and made symbol reads lock-free entirely by making sure that symbols never move in memory. |
Looking at the API https://live.boost.org/doc/libs/develop/libs/unordered/doc/html/unordered/reference/unordered_flat_set.html this does look a drop-in replacement. Boost's implementation doesn't use sharding, but does something more clever it seems: https://bannalia.blogspot.com/2023/07/inside-boostconcurrentflatmap.html. Judging by the benchmarks it seems like its performance is very agreeable and I doubt we could hand-roll something more performant. |
I'm not sure whether this is thread safe. There is no guarantee that the thread, which calls Read locks are also quite expensive. Having a max. size of the symbol table seems to be questionable. Also I'm not sure whether reserving so much virtual memory space on a 32 bit system is a good idea. Sharding could be applied to
|
|
Builds and tests pass on {x86_64,aarch64}-{darwin,linux} locally as well. Seeing a ~8% memory and CPU usage decrease. |
roberth
left a comment
There was a problem hiding this comment.
Very nice!
What's more is that this removes a significant number of Values from the GC-managed heap, so this saves not just allocations, but also a bit of time spent in GC.
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/a-look-at-nixos-nixpkgs-evaluation-times-over-the-years/65114/9 |
Motivation
The symbol table is an append only data structure and it allocates a lot of small strings. Thus the first optimization is to use a
std::pmr::monotonic_buffer_resourcefor string allocations.Then there is no need to store a
std::string_viewin the set because a pointer into theChunkedVectorcan be used instead, which frees asize_t. Furthermore the size ofstd::stringis 32 bytes onx86_64-linuxwith libstdc++, but the strings aren't dynamic, i.e. storing a pointer and a size is sufficient.Some
builtinstransform symbols stored in the symbol table, i.e. a value is allocated, which points to a string in the symbol table. The allocation can be avoided, if the value is stored in theChunkedVector. Onx86_64-linuxthe size ofValueis 24 bytes for storing a pointer to a string, i.e. it adds some overhead.Combining everything into
SymbolValue32 bytes are stored, while the index into theChunkedVectoris no longer stored in the set, freeing up anotheruint32_tor 8 bytes due to alignment.The
Valueoptimization reduces the max. memory usage fornix searchnoticeable, i.e. I measured a 6% reduction (~300 MiB) with this PR and current nixpkgs. A 2%-3% speedup was observed, too.Finally, the symbol table does not require that pointers to set entries are stable, i.e. the cache friendly
boost::unordered_flat_setcan be used, which stores the set entries in-place, so that it has less memory overhead, too.This implementation requires a good hash function, where different bits of the output are distributed uniformly. I don't know whether the hash functions of libstdc++ and libc++ (especially older versions) are good enough, so that I decided to use the hash function provided by boost.
Context
Making the symbol table thread-safe isn't too much effort anymore by using
boost::concurrent_flat_setinstead. The question is, which operations shall be lock free. Makingoperator[]thread-safe requires some changes ofChunkedVector.Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.