OpenMP: allow "#pragma omp" without "parallel for" #9490

mratsim · 2018-10-24T12:30:02Z

I am not sure if there are future plans to expand OpenMP support in Nim but here is something I'd like to have:

Context

Nim codegen generates all for and while loops as while true loops in the C codegen.

The only exception is Nim || (doc which produces #pragma omp parallel for followed by a normal C for loop.

However there are more OpenMP constructs that requires an actual for loop than #pragma omp parallel for for example #pragma omp for, #pragma omp simd, #pragma omp taskloop #pragma omp distribute, see OpenMP quicksheet.

Proposed solution

The || should be changed from

iterator `||`*[S, T](a: S, b: T, annotation=""): T {.inline, magic: "OmpParFor", sideEffect.} =

to

iterator `||`*[S, T](a: S, b: T, annotation="parallel for"): T {.inline, magic: "OmpParFor", sideEffect.} =

and only produces #pragma omp + a true C for loop

Applications

One example application is a sum reduction due to all threads having to share the result variable.
Hera are the main concerns with reductions:

We have to update a common variables with all threads.
To avoid locking, each thread should keep track of their own sums.
After we merge all partial sums into the final result.

Overview of OpenMP solutions

Solution 1: Reference serial code

func reduction_serial(s: seq[int]): int =
  for val in s:
    result += val

Solution 2: using OpenMP reduction clause

OpenMP offers a reduction clause but it's very restrictive in terms of supported operations:
+, *, min, max. And so requires the use of emit to avoid Nim using addInt.
Furthermore due to #9365 we need our own name mangling proc.

The big advantage is that it does all steps 1, 2, 3 for us.

proc reduction_reduction_clause(s: seq[int]): int =
  var sum{.exportc: "sum_" & omp_suffix(genNew = true).} = 0
  const omp_annotation = "reduction(+:sum_" & omp_suffix(genNew = false) & ')'
  # very restrictive in terms of op supported (add, mul, min, max)

  for i in `||`(0, s.len - 1, omp_annotation):
    let si = s[i]
    {.emit: "`sum` += `si`;".}
  return sum

Solution 3: using padding

To be less limited we can use a shared array or seq to store the partial sums, the main issue is determining its size. We can base it on the max number of hardware threads but it might be that within the OpenMP parallel section, not all threads are used.

For example assume we allocate newSeq[int](4) on a machine with 4 threads but OpenMP only uses 2 at runtime, if we do sum, it's fine to add the extra zeroes but if we are doing product/min/max it's not. And omp_get_num_threads() is always 1 when not in a parallel section and we can't use it in a omp parallel for section without triggering it for each thread at each iteration.

Additionally to avoid false sharing we must pad the partial_sums array or seq so that each partial_sum is separated by at least a cache_line (64 bytes).

proc reduction_padding(s: seq[int]): int =
  const
    CacheLineSize = 64 # To avoid false sharing, partial_sums must be padded by a CPU cache line size
    Padding = min(1, CacheLineSize div sizeof(int))

  var partial_sums = newSeq[int](omp_get_max_threads() * Padding)
  # For sum, 0 is a neutral element but if we wanted to do "product"
  # the seq must be initialized with ones in case OpenMP scheduled less
  # threads than omp_get_max_threads.

  doAssert omp_get_num_threads() == 1 # Outside of a parallel section
  for i in 0||(s.len-1):
    # There is no way to get the number of threads used
    # except in the inner loop that will be repeated `s.len` times
    partial_sums[omp_get_thread_num() * Padding] += s[i]

  for i in 0 ..< omp_get_max_threads(): # We assume all threads were used but it might not be true
    result += partial_sums[i * Padding]

Solution 4: most flexible but not working at the moment

By using a local var we can avoid padding, omp_get_num_threads and extra allocation issues at the cost of using mutexes to merge the partial sums.

The main benefit is that we can now split parallel work in:

omp_parallel:
  initialization_code
  actual_for_loop
  finalization_code / merging

proc reduction_localvar(s: seq[int]): int =

  # var num_threads_used: int

  omp_parallel:
    ### initialization
    # # We have a proper way to get the number of threads used
    # # without setting them repeatedly in the for loop
    # omp_master:
    #   num_threads_used = omp_get_num_threads()

    var local_sum = 0 # Variables declared in a prallel region are automatically private to each thread
                      # http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-data.html#Default

    ### for loop
    for i in 0||(s.len-1): # Unfortunately this does "omp parallel for" instead of just "omp for"
      local_sum += s[i]

    ### Finalization
    omp_critical: # This will use a mutex and can accept any C code (omp atomic would not work with Nim)
      result += local_sum

But here the nested #pragma omp parallel for instead of just #pragma omp for will share the local_sum due to the creation of another omp parallel scope. It would work with just #pragma omp for.

Full demo

# Compile with
# `nim c --stackTraces:off -r -d:openmp omp_reduction.nim`
#
# On mac, default Clang doesn't support OpenMP use
# `nim c --stackTraces:off --cc:gcc --gcc.exe:"/usr/local/bin/gcc-7" --gcc.linkerexe:"/usr/local/bin/gcc-7" -r -d:openmp omp_reduction.nim`

import sequtils

when defined(openmp):
  {.passC: "-fopenmp".}
  {.passL: "-fopenmp".}

  {.pragma: omp, header:"omp.h".}

  proc omp_set_num_threads*(x: cint) {.omp.}
  proc omp_get_num_threads*(): cint {.omp.}
  proc omp_get_max_threads*(): cint {.omp.} # This takes hyperthreading into account
  proc omp_get_thread_num*(): cint {.omp.}

else:
  template omp_set_num_threads*(x: cint) = discard
  template omp_get_num_threads*(): cint = 1
  template omp_get_max_threads*(): cint = 1
  template omp_get_thread_num*(): cint = 0

template omp_parallel*(body: untyped): untyped =
  {.emit: "#pragma omp parallel".}
  block:
    body

template omp_critical*(body: untyped): untyped =
  {.emit: "#pragma omp critical".}
  block:
    body

template omp_master*(body: untyped): untyped =
  {.emit: "#pragma omp master".}
  block:
    body

# #########################################
# OMP mangling, workaround for https://github.com/nim-lang/Nim/issues/9365

import random
from strutils import toHex

var mangling_rng {.compileTime.} = initRand(0x1337DEADBEEF)
var current_suffix {.compileTime.} = ""

proc omp_suffix(genNew: static bool = false): static string =
  if genNew:
    current_suffix = mangling_rng.rand(high(uint32)).toHex
  result = current_suffix

# #########################################

func reduction_serial(s: seq[int]): int =
  for val in s:
    result += val

proc reduction_reduction_clause(s: seq[int]): int =
  var sum{.exportc: "sum_" & omp_suffix(genNew = true).} = 0
  const omp_annotation = "reduction(+:sum_" & omp_suffix(genNew = false) & ')'
  # very restrictive in terms of op supported (add, mul, min, max)

  for i in `||`(0, s.len - 1, omp_annotation):
    let si = s[i]
    {.emit: "`sum` += `si`;".}
  return sum


proc reduction_padding(s: seq[int]): int =
  const
    CacheLineSize = 64 # To avoid false sharing, partial_sums must be padded by a CPU cache line size
    Padding = min(1, CacheLineSize div sizeof(int))

  var partial_sums = newSeq[int](omp_get_max_threads() * Padding)
  # For sum, 0 is a neutral element but if we wanted to do "product"
  # the seq must be initialized with ones in case OpenMP scheduled less
  # threads than omp_get_max_threads

  doAssert omp_get_num_threads() == 1 # Outside of a parallel section
  for i in 0||(s.len-1):
    # There is no way to get the number of threads used
    # except in the inner loop that will be repeated `s.len` times
    partial_sums[omp_get_thread_num() * Padding] += s[i]

  for i in 0 ..< omp_get_max_threads(): # We assume all threads were used but it might not be true
    result += partial_sums[i * Padding]

proc reduction_localvar(s: seq[int]): int =

  # var num_threads_used: int

  omp_parallel:
    ### initialization
    # # We have a proper way to get the number of threads used
    # # without setting them repeatedly in the for loop
    # omp_master:
    #   num_threads_used = omp_get_num_threads()

    var local_sum = 0 # Variables declared in a prallel region are automatically private to each thread
                      # http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-data.html#Default

    ### for loop
    for i in 0||(s.len-1): # Unfortunately this does "omp parallel for" instead of just "omp for"
      local_sum += s[i]

    ### Finalization
    omp_critical: # This will use a mutex and can accept any C code (omp atomic would not work with Nim)
      result += local_sum

let s = toSeq(1..100)
let expected = 100 * (100+1) div 2

echo "Expected: ", expected # 5050
echo "Serial: ", s.reduction_serial # 5050
echo "OMP reduction clause: ", s.reduction_reduction_clause # 5050
echo "OMP padding: ", s.reduction_padding # 5050
echo "OMP localvar: ", s.reduction_localvar # 20200 on a dual-core with hyper-threading

The text was updated successfully, but these errors were encountered:

mratsim added a commit to mratsim/laser that referenced this issue Oct 24, 2018

Reduction example, add ref to nim-lang/Nim#9490

ddabf9e

Araq added Feature Severe labels Oct 24, 2018

mratsim mentioned this issue Oct 24, 2018

Openmp parallel iterator improvements #9493

Merged

Araq closed this as completed in #9493 Oct 25, 2018

mratsim mentioned this issue Nov 4, 2018

Introduce forEach multi-stage domain specific language mratsim/laser#4

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMP: allow "#pragma omp" without "parallel for" #9490

OpenMP: allow "#pragma omp" without "parallel for" #9490

mratsim commented Oct 24, 2018 •

edited

Loading

OpenMP: allow "#pragma omp" without "parallel for" #9490

OpenMP: allow "#pragma omp" without "parallel for" #9490

Comments

mratsim commented Oct 24, 2018 • edited Loading

Context

Proposed solution

Applications

Overview of OpenMP solutions

Solution 1: Reference serial code

Solution 2: using OpenMP reduction clause

Solution 3: using padding

Solution 4: most flexible but not working at the moment

Full demo

mratsim commented Oct 24, 2018 •

edited

Loading