Reduce padding in the MPSC Channel + introduce a no count flag #93

mratsim · 2020-01-05T18:43:16Z

Padding

The MPSC channel padding is very memory hungry:

weave/weave/channels/channels_mpsc_unbounded_batch.nim

Lines 13 to 36 in 5d90172

    
           ChannelMpscUnboundedBatch*[T: Enqueueable] = object 
        
             ## Lockless multi-producer single-consumer channel 
        
             ## 
        
             ## Properties: 
        
             ## - Lockless 
        
             ## - Wait-free for producers 
        
             ## - Consumer can be blocked by producers when they swap 
        
             ##   the tail, the tail can grow but the consumer sees it as nil 
        
             ##   until it's published all at once. 
        
             ## - Unbounded 
        
             ## - Intrusive List based 
        
             ## - Keep an approximate count on enqueued 
        
             ## - Support batching on both the producers and consumer side 
        
             # TODO: pass this through Relacy and Valgrind/Helgrind 
        
             #       to make sure there are no bugs 
        
             #       on arch with relaxed memory models 
        
             # Accessed by all 
        
             count{.align: WV_CacheLinePadding.}: Atomic[int] 
        
             # Producers and consumer slow-path 
        
             back{.align: WV_CacheLinePadding.}: Atomic[pointer] # Workaround generic atomics bug: https://github.com/nim-lang/Nim/issues/12695 
        
             # Consumer only - front is a dummy node 
        
             front{.align: WV_CacheLinePadding.}: typeof(default(T)[])

WV_CacheLinePadding is at 2x cachelinesize = 128, which means 384 bytes are taken. The value of 128 was chosen because Intel CPU prefetches cache lines by pairs. Facebook's Folly also did in-depth experiments to come up with this value.

This was OK when the MPSC channel was used in a fixed manner for incoming steal requests and incoming freed memory from remote thread however the dataflow parallelism protocol described in #92 (comment) requires allocating ephemeral MPSC channel.
If an application relies exclusively on dataflow graph parallelism, it will incur huge memory overhead as the memory pool only allocated 256 bytes.

As a compromise (hopefully) between cache invalidation prevention and memory usage, the data could be reorganized the following way:

  ChannelMpscUnboundedBatch*[T: Enqueueable] = object
    # Producers and consumer slow-path
    back{.align: WV_CacheLinePadding div 2.}: Atomic[pointer]
    # Accessed by all
    count{.align: WV_CacheLinePadding div 2.}: Atomic[int]
    # Consumer only - front is a dummy node
    front{.align: WV_CacheLinePadding div 2.}: typeof(default(T)[])

Padding is now 64 bytes for a total of 128 + sizeof(T) bytes taken, well within the memory pool block size of 256 bytes, it can be made intrusive to another datastructure with 256 - 128 - sizeof(T) bytes of metadata to save on allocations.

In terms of cache conflict, front/back are still 2x cache-line apart and there was cache invalidation on count anyway.

The ordering producers field then consumer field assumes that:

If the MPSC channel is used intrusively it's the first field of the datatype
the consumer is the likely owner and updater of that type
This may or may not be true

Count

Count is needed for steal requests to approximate (give a lower-bound) the number of thieves in steal adaptative mode.
It is needed for remote freed memory for the memory pool to give a lower bound to the number of memory blocks that can be collected back in the memory arena.

However in the dataflow parallelism protocol it is not needed to keep track of the enqueued tasks count. Similarly for the non-adaptative steal it is not needed to keep track of the number of steal requests.
Atomic increment/decrement are very taxing as they require flushing the caches. The MPSC channel should have an optional count.
Note that in the case of an optional count, padding between back and front should be 2x cachelines.

Status

Reduce Padding
Introduce a nocount flag

mratsim · 2020-01-12T20:55:45Z

Upstream nim-lang/Nim#13122 {.align.} pragma is not applied if there is a generic field.

…ution bug nim-lang/Nim#8677 (static sandwich? nim-lang/Nim#11225)

mratsim · 2020-01-12T22:20:31Z

Seems like the second part, avoiding counting for pledges will require duplicating the channel.

Using a static bool parameter does not work.

There is a static early resolution bug nim-lang/Nim#8677 or static sandwich nim-lang/Nim#11225 that prevents the following

weave/weave/channels/channels_mpsc_unbounded_batch.nim

Lines 14 to 71 in ba6f1f4

    
           macro derefMPSC*(T: typedesc): typedesc = 
        
             # This somehows isn't bound properly when used in a typesection (it needs to be exported in this module) 
        
             # and it's even worse if that type section involves a static (it needs to be reexported without hops to all modules!) 
        
             let instantiated = T.getTypeInst 
        
             instantiated.expectkind(nnkBracketExpr) 
        
             doAssert instantiated[0].eqIdent"typeDesc" 
        
             let ptrT = instantiated[1] 
        
             if ptrT.kind == nnkPtrTy: 
        
               return ptrT[0] 
        
             let ptrTImpl = instantiated[1].getImpl 
        
             ptrTimpl.expectKind(nnkTypeDef) 
        
             ptrTImpl[2].expectKind(nnkPtrTy) 
        
             ptrTImpl[2][0].expectKind({nnkObjectTy, nnkSym}) 
        
             return ptrTImpl[2][0] 
        
           # MPSC channel 
        
           # ------------------------------------------------ 
        
           const MpscPadding = WV_CacheLinePadding div 2 
        
             ## The padding is chosen so that: 
        
             ## - The data structure fits in WV_MemBlockSize (4x cache-line by default) 
        
             ## - The front and back fields which are respectively owned by consumer and the producers 
        
             ##   accessed only by the consumer for the first and mostly by producers for the second 
        
             ##   are kept 2 cache-line apart. 
        
           type 
        
             Enqueueable = concept x, type T 
        
               x is ptr 
        
               x.next is Atomic[pointer] 
        
               # Workaround generic atomics bug: https://github.com/nim-lang/Nim/issues/12695 
        
             ChannelMpscUnboundedBatch*[keepCount: static bool, T: Enqueueable] = object 
        
               ## Lockless multi-producer single-consumer channel 
        
               ## 
        
               ## Properties: 
        
               ## - Lockless 
        
               ## - Wait-free for producers 
        
               ## - Consumer can be blocked by producers when they swap 
        
               ##   the tail, the tail can grow but the consumer sees it as nil 
        
               ##   until it's published all at once. 
        
               ## - Unbounded 
        
               ## - Intrusive List based 
        
               ## - Keep an approximate count on enqueued 
        
               ## - Support batching on both the producers and consumer side 
        
               # TODO: pass this through Relacy and Valgrind/Helgrind 
        
               #       to make sure there are no bugs 
        
               #       on arch with relaxed memory models 
        
               # Producers and consumer slow-path 
        
               back{.align: MpscPadding.}: Atomic[pointer] # Workaround generic atomics bug: https://github.com/nim-lang/Nim/issues/12695 
        
               # Accessed by all - don't remove it even when not keeping count. This pads back/front by 2x cache lines. 
        
               count{.align: MpscPadding.}: Atomic[int] 
        
               # Consumer only - front is a dummy node 
        
               front{.align: MpscPadding.}: derefMPSC(T)

This is the error while compiled depth-first search even after some mixin and checking that the symbol is declared:

true
.../weave/benchmarks/dfs/weave_dfs.nim(28, 19) template/generic instantiation of `sync` from here
.../weave/weave/await_fsm.nim(204, 5) template/generic instantiation of `forceComplete` from here
.../weave/weave/await_fsm.nim(184, 19) template/generic instantiation of `recycleChannel` from here
.../weave/weave/datatypes/flowvars.nim(77, 12) template/generic instantiation of `recycle` from here
.../weave/weave/memory/memory_pools.nim(547, 47) template/generic instantiation of `trySend` from here
.../weave/weave/channels/channels_mpsc_unbounded_batch.nim(71, 34) Error: undeclared identifier: 'derefMPSC'

CI: https://dev.azure.com/numforge/Weave/_build/results?buildId=291&view=logs&j=0795a493-e5ca-56c8-45b7-d83e4a06a826&t=bf8af78a-c46e-5ed8-fb21-c91632a59524&l=849

weave/weave/memory/memory_pools.nim

Lines 512 to 548 in ba6f1f4

    
           proc recycle*[T](p: ptr T) {.gcsafe.} = 
        
             ## Returns a memory block to its memory pool. 
        
             ## 
        
             ## This is thread-safe, any thread can call it. 
        
             ## A fast path is used if it's the ID of the borrowing thread, 
        
             ## otherwise a slow path will be used. 
        
             ## 
        
             ## If the thread owning the pool was exited before this 
        
             ## block was returned, the main thread should now 
        
             ## have ownership of the related arenas and can deallocate them. 
        
             static: echo declared(derefMPSC) 
        
             mixin derefMPSC # that doesn't work either :/ 
        
             # TODO: sink ptr T - parsing bug https://github.com/nim-lang/Nim/issues/12757 
        
             preCondition: not p.isNil 
        
             let p = cast[ptr MemBlock](p) 
        
             # Find the owning arena 
        
             let arena = p.getArena() 
        
             if getMemThreadID() == arena.meta.threadID.load(moRelaxed): 
        
               # thread-local free 
        
               if arena.meta.localFree.isNil: 
        
                 p.next.store(nil, moRelaxed) 
        
                 arena.meta.localFree = p 
        
               else: 
        
                 arena.meta.localFree.prepend(p) 
        
               poisonMemRegion(p, WV_MemBlockSize) 
        
               arena.meta.used -= 1 
        
               if unlikely(arena.isUnused()): 
        
                 # If an arena is unused, we can try releasing it immediately 
        
                 arena.allocator[].considerRelease(arena) 
        
             else: 
        
               # remote arena - TODO: Poisoning except from the MPSC Queue? 
        
               let remoteRecycled = arena.meta.remoteFree.trySend(p) 
        
               postCondition: remoteRecycled

Alternatively, having a dereference operator for pointer types, or having typeof(default(T)[]) working in nested context (nim-lang/Nim#13048) would not require a macro that has early symbol resolution issues.

…ution bug nim-lang/Nim#8677 (static sandwich? nim-lang/Nim#11225)

* Change the mpsc counting to fix #49 * Make MPSC count optional, finishes #93 * semantics: ascertain -> postCondition * rebase

mratsim added enhancement New feature or request memory 💫 labels Jan 5, 2020

mratsim mentioned this issue Jan 12, 2020

Reduce mpsc padding #96

Merged

mratsim added a commit that referenced this issue Jan 12, 2020

Stash removing useless count for pledges #93. Showstopper early resol…

2ecc4b1

…ution bug nim-lang/Nim#8677 (static sandwich? nim-lang/Nim#11225)

mratsim added the upstream 🐉 label Jan 12, 2020

mratsim added a commit that referenced this issue Jan 18, 2020

Stash removing useless count for pledges #93. Showstopper early resol…

e69caf2

…ution bug nim-lang/Nim#8677 (static sandwich? nim-lang/Nim#11225)

mratsim added a commit that referenced this issue May 16, 2020

Make MPSC count optional, finishes #93

5e31590

mratsim mentioned this issue May 16, 2020

Change mpsc counting #138

Merged

mratsim added a commit that referenced this issue May 16, 2020

Make MPSC count optional, finishes #93

76cc7fd

mratsim closed this as completed in #138 May 16, 2020

mratsim added a commit that referenced this issue May 16, 2020

Change mpsc counting (#138)

5192da5

* Change the mpsc counting to fix #49 * Make MPSC count optional, finishes #93 * semantics: ascertain -> postCondition * rebase

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce padding in the MPSC Channel + introduce a no count flag #93

Reduce padding in the MPSC Channel + introduce a no count flag #93

mratsim commented Jan 5, 2020 •

edited

Loading

mratsim commented Jan 12, 2020

mratsim commented Jan 12, 2020 •

edited

Loading

Reduce padding in the MPSC Channel + introduce a no count flag #93

Reduce padding in the MPSC Channel + introduce a no count flag #93

Comments

mratsim commented Jan 5, 2020 • edited Loading

Padding

Count

Status

mratsim commented Jan 12, 2020

mratsim commented Jan 12, 2020 • edited Loading

mratsim commented Jan 5, 2020 •

edited

Loading

mratsim commented Jan 12, 2020 •

edited

Loading