OpenMP: Memory optimization of intermediate result #261

mratsim · 2018-07-31T07:50:10Z

To avoid false sharing / bank conflict / cache trashing when multiple threads read and write data in the same cache line, an intermediate array is used with intermediate values padded so that they take different cache line. (See #139)

Tensor x: 1_000_000 float
             |
             |
             V
intermediate reduction: 4 floats (for a quad-core CPU)
             |
             |
             v
Final result:  1 float

Given that cache lines are all 64 Byte, to reduce memory consumption elements of type T should be spaced by n, n = min(64 / sizeof(T), 1) ==> 8 for float64/int64, 16 for float32/int32, 1 for an uint256 for example.

Unfortunately, sizeof(T) works for primitive types and since nim-lang/Nim#8445 for arrays of primitive types. However it is still pending nim-lang/Nim#5664 for Tensor of custom objects.

In the mean time, the intermediate array elements are arbitrarily spaced by 16 (maxItemsPerCacheLine) which waste lots of space when T is a tensor (size a couple hundred bytes).

Arraymancer/src/tensor/backend/openmp.nim

Lines 73 to 92 in 2a3c406

    
           template omp_parallel_reduce_blocks*[T](reduced: T, block_offset, block_size: untyped, size, weight: Natural, op_final, op_init, op_middle: untyped): untyped = 
        
             # To prevent false sharing, results will be stored in an array but 
        
             # padded to be a cache line apart atleast. 
        
             # All CPUs cache line is 64B, 16 float32/int32 fits or 8 float64/int64 
        
             # TODO compile time evaluation depending of sizeof(T) 
        
             # Pending https://github.com/nim-lang/Nim/pull/5664 
        
             const maxItemsPerCacheLine = 16 
        
             let ompsize = size # ensure that if size is computed it is only called once 
        
             if likely(ompsize > 0): 
        
               block ompblocks: 
        
                 when defined(openmp): 
        
                   if ompsize * weight >= OMP_FOR_THRESHOLD: 
        
                     let num_blocks = min(min(ompsize, omp_get_max_threads()), OMP_MAX_REDUCE_BLOCKS) 
        
                     if num_blocks > 1: 
        
                       withMemoryOptimHints() 
        
                       var results{.align64, noInit.}: array[OMP_MAX_REDUCE_BLOCKS * maxItemsPerCacheLine, type(reduced)]

The text was updated successfully, but these errors were encountered:

mratsim · 2018-10-25T10:01:03Z

Implemented in nim-lang/Nim#9356, note that nim-lang/Nim#9493 is an alternative by allowing each threads to keep a local variable and reduce the partial reduction in an #pragma omp critical section.

This would allow much better scaling on proc with more than 8 cores (the current reduction block limit).

This is being implemented in Arraymancer future backend laser: parallel reduction example.

Note that runtime allocation is probably better anyway to support a high number of cores while keeping the binary small similar to the sum kernel: https://github.com/numforge/laser/blob/e9bbb1055517c6e76649b67737d7b8807af22eb2/laser/hpc_kernels/sum.nim#L56-L60, only issue is on embedded.

mratsim added need upstream fix easy optimization OpenMP labels Jul 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMP: Memory optimization of intermediate result #261

OpenMP: Memory optimization of intermediate result #261

mratsim commented Jul 31, 2018

mratsim commented Oct 25, 2018

OpenMP: Memory optimization of intermediate result #261

OpenMP: Memory optimization of intermediate result #261

Comments

mratsim commented Jul 31, 2018

mratsim commented Oct 25, 2018