Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce padding in the MPSC Channel + introduce a no count flag #93

Closed
2 tasks done
mratsim opened this issue Jan 5, 2020 · 2 comments · Fixed by #138
Closed
2 tasks done

Reduce padding in the MPSC Channel + introduce a no count flag #93

mratsim opened this issue Jan 5, 2020 · 2 comments · Fixed by #138
Labels

Comments

@mratsim
Copy link
Owner

mratsim commented Jan 5, 2020

Padding

The MPSC channel padding is very memory hungry:

ChannelMpscUnboundedBatch*[T: Enqueueable] = object
## Lockless multi-producer single-consumer channel
##
## Properties:
## - Lockless
## - Wait-free for producers
## - Consumer can be blocked by producers when they swap
## the tail, the tail can grow but the consumer sees it as nil
## until it's published all at once.
## - Unbounded
## - Intrusive List based
## - Keep an approximate count on enqueued
## - Support batching on both the producers and consumer side
# TODO: pass this through Relacy and Valgrind/Helgrind
# to make sure there are no bugs
# on arch with relaxed memory models
# Accessed by all
count{.align: WV_CacheLinePadding.}: Atomic[int]
# Producers and consumer slow-path
back{.align: WV_CacheLinePadding.}: Atomic[pointer] # Workaround generic atomics bug: https://github.com/nim-lang/Nim/issues/12695
# Consumer only - front is a dummy node
front{.align: WV_CacheLinePadding.}: typeof(default(T)[])

WV_CacheLinePadding is at 2x cachelinesize = 128, which means 384 bytes are taken. The value of 128 was chosen because Intel CPU prefetches cache lines by pairs. Facebook's Folly also did in-depth experiments to come up with this value.

This was OK when the MPSC channel was used in a fixed manner for incoming steal requests and incoming freed memory from remote thread however the dataflow parallelism protocol described in #92 (comment) requires allocating ephemeral MPSC channel.
If an application relies exclusively on dataflow graph parallelism, it will incur huge memory overhead as the memory pool only allocated 256 bytes.

As a compromise (hopefully) between cache invalidation prevention and memory usage, the data could be reorganized the following way:

  ChannelMpscUnboundedBatch*[T: Enqueueable] = object
    # Producers and consumer slow-path
    back{.align: WV_CacheLinePadding div 2.}: Atomic[pointer]
    # Accessed by all
    count{.align: WV_CacheLinePadding div 2.}: Atomic[int]
    # Consumer only - front is a dummy node
    front{.align: WV_CacheLinePadding div 2.}: typeof(default(T)[])

Padding is now 64 bytes for a total of 128 + sizeof(T) bytes taken, well within the memory pool block size of 256 bytes, it can be made intrusive to another datastructure with 256 - 128 - sizeof(T) bytes of metadata to save on allocations.

In terms of cache conflict, front/back are still 2x cache-line apart and there was cache invalidation on count anyway.

The ordering producers field then consumer field assumes that:

  • If the MPSC channel is used intrusively it's the first field of the datatype
  • the consumer is the likely owner and updater of that type
    This may or may not be true

Count

Count is needed for steal requests to approximate (give a lower-bound) the number of thieves in steal adaptative mode.
It is needed for remote freed memory for the memory pool to give a lower bound to the number of memory blocks that can be collected back in the memory arena.

However in the dataflow parallelism protocol it is not needed to keep track of the enqueued tasks count. Similarly for the non-adaptative steal it is not needed to keep track of the number of steal requests.
Atomic increment/decrement are very taxing as they require flushing the caches. The MPSC channel should have an optional count.
Note that in the case of an optional count, padding between back and front should be 2x cachelines.

Status

  • Reduce Padding
  • Introduce a nocount flag
@mratsim mratsim added enhancement :shipit: New feature or request memory 💫 labels Jan 5, 2020
@mratsim
Copy link
Owner Author

mratsim commented Jan 12, 2020

Upstream nim-lang/Nim#13122 {.align.} pragma is not applied if there is a generic field.

@mratsim
Copy link
Owner Author

mratsim commented Jan 12, 2020

Seems like the second part, avoiding counting for pledges will require duplicating the channel.

Using a static bool parameter does not work.

There is a static early resolution bug nim-lang/Nim#8677 or static sandwich nim-lang/Nim#11225 that prevents the following

macro derefMPSC*(T: typedesc): typedesc =
# This somehows isn't bound properly when used in a typesection (it needs to be exported in this module)
# and it's even worse if that type section involves a static (it needs to be reexported without hops to all modules!)
let instantiated = T.getTypeInst
instantiated.expectkind(nnkBracketExpr)
doAssert instantiated[0].eqIdent"typeDesc"
let ptrT = instantiated[1]
if ptrT.kind == nnkPtrTy:
return ptrT[0]
let ptrTImpl = instantiated[1].getImpl
ptrTimpl.expectKind(nnkTypeDef)
ptrTImpl[2].expectKind(nnkPtrTy)
ptrTImpl[2][0].expectKind({nnkObjectTy, nnkSym})
return ptrTImpl[2][0]
# MPSC channel
# ------------------------------------------------
const MpscPadding = WV_CacheLinePadding div 2
## The padding is chosen so that:
## - The data structure fits in WV_MemBlockSize (4x cache-line by default)
## - The front and back fields which are respectively owned by consumer and the producers
## accessed only by the consumer for the first and mostly by producers for the second
## are kept 2 cache-line apart.
type
Enqueueable = concept x, type T
x is ptr
x.next is Atomic[pointer]
# Workaround generic atomics bug: https://github.com/nim-lang/Nim/issues/12695
ChannelMpscUnboundedBatch*[keepCount: static bool, T: Enqueueable] = object
## Lockless multi-producer single-consumer channel
##
## Properties:
## - Lockless
## - Wait-free for producers
## - Consumer can be blocked by producers when they swap
## the tail, the tail can grow but the consumer sees it as nil
## until it's published all at once.
## - Unbounded
## - Intrusive List based
## - Keep an approximate count on enqueued
## - Support batching on both the producers and consumer side
# TODO: pass this through Relacy and Valgrind/Helgrind
# to make sure there are no bugs
# on arch with relaxed memory models
# Producers and consumer slow-path
back{.align: MpscPadding.}: Atomic[pointer] # Workaround generic atomics bug: https://github.com/nim-lang/Nim/issues/12695
# Accessed by all - don't remove it even when not keeping count. This pads back/front by 2x cache lines.
count{.align: MpscPadding.}: Atomic[int]
# Consumer only - front is a dummy node
front{.align: MpscPadding.}: derefMPSC(T)

This is the error while compiled depth-first search even after some mixin and checking that the symbol is declared:

true
.../weave/benchmarks/dfs/weave_dfs.nim(28, 19) template/generic instantiation of `sync` from here
.../weave/weave/await_fsm.nim(204, 5) template/generic instantiation of `forceComplete` from here
.../weave/weave/await_fsm.nim(184, 19) template/generic instantiation of `recycleChannel` from here
.../weave/weave/datatypes/flowvars.nim(77, 12) template/generic instantiation of `recycle` from here
.../weave/weave/memory/memory_pools.nim(547, 47) template/generic instantiation of `trySend` from here
.../weave/weave/channels/channels_mpsc_unbounded_batch.nim(71, 34) Error: undeclared identifier: 'derefMPSC'

CI: https://dev.azure.com/numforge/Weave/_build/results?buildId=291&view=logs&j=0795a493-e5ca-56c8-45b7-d83e4a06a826&t=bf8af78a-c46e-5ed8-fb21-c91632a59524&l=849

proc recycle*[T](p: ptr T) {.gcsafe.} =
## Returns a memory block to its memory pool.
##
## This is thread-safe, any thread can call it.
## A fast path is used if it's the ID of the borrowing thread,
## otherwise a slow path will be used.
##
## If the thread owning the pool was exited before this
## block was returned, the main thread should now
## have ownership of the related arenas and can deallocate them.
static: echo declared(derefMPSC)
mixin derefMPSC # that doesn't work either :/
# TODO: sink ptr T - parsing bug https://github.com/nim-lang/Nim/issues/12757
preCondition: not p.isNil
let p = cast[ptr MemBlock](p)
# Find the owning arena
let arena = p.getArena()
if getMemThreadID() == arena.meta.threadID.load(moRelaxed):
# thread-local free
if arena.meta.localFree.isNil:
p.next.store(nil, moRelaxed)
arena.meta.localFree = p
else:
arena.meta.localFree.prepend(p)
poisonMemRegion(p, WV_MemBlockSize)
arena.meta.used -= 1
if unlikely(arena.isUnused()):
# If an arena is unused, we can try releasing it immediately
arena.allocator[].considerRelease(arena)
else:
# remote arena - TODO: Poisoning except from the MPSC Queue?
let remoteRecycled = arena.meta.remoteFree.trySend(p)
postCondition: remoteRecycled

Alternatively, having a dereference operator for pointer types, or having typeof(default(T)[]) working in nested context (nim-lang/Nim#13048) would not require a macro that has early symbol resolution issues.

mratsim added a commit that referenced this issue May 16, 2020
mratsim added a commit that referenced this issue May 16, 2020
mratsim added a commit that referenced this issue May 16, 2020
* Change the mpsc counting to fix #49

* Make MPSC count optional, finishes #93

* semantics: ascertain -> postCondition

* rebase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant