Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMP: Memory optimization of intermediate result #261

Open
mratsim opened this issue Jul 31, 2018 · 1 comment
Open

OpenMP: Memory optimization of intermediate result #261

mratsim opened this issue Jul 31, 2018 · 1 comment

Comments

@mratsim
Copy link
Owner

mratsim commented Jul 31, 2018

To avoid false sharing / bank conflict / cache trashing when multiple threads read and write data in the same cache line, an intermediate array is used with intermediate values padded so that they take different cache line. (See #139)

Tensor x: 1_000_000 float
             |
             |
             V
intermediate reduction: 4 floats (for a quad-core CPU)
             |
             |
             v
Final result:  1 float

Given that cache lines are all 64 Byte, to reduce memory consumption elements of type T should be spaced by n, n = min(64 / sizeof(T), 1) ==> 8 for float64/int64, 16 for float32/int32, 1 for an uint256 for example.

Unfortunately, sizeof(T) works for primitive types and since nim-lang/Nim#8445 for arrays of primitive types. However it is still pending nim-lang/Nim#5664 for Tensor of custom objects.

In the mean time, the intermediate array elements are arbitrarily spaced by 16 (maxItemsPerCacheLine) which waste lots of space when T is a tensor (size a couple hundred bytes).

template omp_parallel_reduce_blocks*[T](reduced: T, block_offset, block_size: untyped, size, weight: Natural, op_final, op_init, op_middle: untyped): untyped =
# To prevent false sharing, results will be stored in an array but
# padded to be a cache line apart atleast.
# All CPUs cache line is 64B, 16 float32/int32 fits or 8 float64/int64
# TODO compile time evaluation depending of sizeof(T)
# Pending https://github.com/nim-lang/Nim/pull/5664
const maxItemsPerCacheLine = 16
let ompsize = size # ensure that if size is computed it is only called once
if likely(ompsize > 0):
block ompblocks:
when defined(openmp):
if ompsize * weight >= OMP_FOR_THRESHOLD:
let num_blocks = min(min(ompsize, omp_get_max_threads()), OMP_MAX_REDUCE_BLOCKS)
if num_blocks > 1:
withMemoryOptimHints()
var results{.align64, noInit.}: array[OMP_MAX_REDUCE_BLOCKS * maxItemsPerCacheLine, type(reduced)]

@mratsim
Copy link
Owner Author

mratsim commented Oct 25, 2018

Implemented in nim-lang/Nim#9356, note that nim-lang/Nim#9493 is an alternative by allowing each threads to keep a local variable and reduce the partial reduction in an #pragma omp critical section.

This would allow much better scaling on proc with more than 8 cores (the current reduction block limit).

This is being implemented in Arraymancer future backend laser: parallel reduction example.

Note that runtime allocation is probably better anyway to support a high number of cores while keeping the binary small similar to the sum kernel: https://github.com/numforge/laser/blob/e9bbb1055517c6e76649b67737d7b8807af22eb2/laser/hpc_kernels/sum.nim#L56-L60, only issue is on embedded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant