nvc issue with Inc #2224

deckerla · 2023-10-05T19:41:37Z

consider the following mfe:

import devito

shape = (51,41)
nfreq = 5
space_order=4
grid = devito.Grid(shape=shape)
freq_dim = devito.DefaultDimension(name="freq", default_value=nfreq)
u = devito.Function(name="u", dimensions=(*grid.dimensions, freq_dim), 
            grid=grid, shape=(*grid.shape, nfreq), space_order=space_order)
v = devito.Function(name="v", dimensions=(*grid.dimensions, freq_dim), 
            grid=grid, shape=(*grid.shape, nfreq), space_order=space_order)

w = devito.Function(name="w", grid=grid, space_order=space_order)

summation = devito.Inc(w, u*v)

op = devito.Operator([summation])

op.apply()

Using gcc this runs without an issue:

cvx@cbox-lukedecker-baredevitocuda:~/.julia/dev/JetPackDevitoPSD/test$ DEVITO_ARCH=gcc python mfe.py 
Operator `Kernel` ran in 0.01 s

But if I use nvc, which is required for a lot of devitopro features, I have the following issue:

cvx@cbox-lukedecker-baredevitocuda:~/.julia/dev/JetPackDevitoPSD/test$ DEVITO_ARCH=nvc python mfe.py 
NVC++-F-0000-Internal compiler error. unhandled size for preparing constant 0     255  (/tmp/devito-jitcache-uid1000/a444fd3b436666de6e2079acc59b28052a8a030c.cpp: 49)
NVC++/x86-64 Linux 23.7-0: compilation aborted
FAILED compiler invocation: nvc++ -g -fPIC -std=c++11 -mp -fast -shared /tmp/devito-jitcache-uid1000/a444fd3b436666de6e2079acc59b28052a8a030c.cpp -lm -o /tmp/devito-jitcache-uid1000/a444fd3b436666de6e2079acc59b28052a8a030c.so
Traceback (most recent call last):
  File "/home/cvx/.julia/dev/JetPackDevitoPSD/test/mfe.py", line 19, in <module>
    op.apply()
  File "/home/cvx/.conda/envs/conda_jl/lib/python3.10/site-packages/devito/operator/operator.py", line 832, in apply
    cfunction = self.cfunction
  File "/home/cvx/.conda/envs/conda_jl/lib/python3.10/site-packages/devito/operator/operator.py", line 714, in cfunction
    self._jit_compile()
  File "/home/cvx/.conda/envs/conda_jl/lib/python3.10/site-packages/devito/operator/operator.py", line 699, in _jit_compile
    recompiled, src_file = self._compiler.jit_compile(self._soname,
  File "/home/cvx/.conda/envs/conda_jl/lib/python3.10/site-packages/devito/arch/compiler.py", line 360, in jit_compile
    _, _, _, recompiled = compile_from_string(self, target, code, src_file,
  File "/home/cvx/.conda/envs/conda_jl/lib/python3.10/site-packages/codepy/jit.py", line 439, in compile_from_string
    toolchain.build_extension(ext_file, source_paths, debug=debug)
  File "/home/cvx/.conda/envs/conda_jl/lib/python3.10/site-packages/codepy/toolchain.py", line 211, in build_extension
    raise CompileError("module compilation failed")
codepy.CompileError: module compilation failed

Now is where things get even more interesting.
If I change the script to use devitopro I have the following mfe:

import devito
import devitopro
shape = (51,41)
nfreq = 5
space_order=4
grid = devito.Grid(shape=shape)
freq_dim = devito.DefaultDimension(name="freq", default_value=nfreq)
u = devitopro.Function(name="u", dimensions=(*grid.dimensions, freq_dim), 
            grid=grid, shape=(*grid.shape, nfreq), space_order=space_order)
v = devitopro.Function(name="v", dimensions=(*grid.dimensions, freq_dim), 
            grid=grid, shape=(*grid.shape, nfreq), space_order=space_order)

w = devitopro.Function(name="w", grid=grid, space_order=space_order)

summation = devito.Inc(w, u*v)

op = devito.Operator([summation])

op.apply()

I segfault when running with nvc!

cvx@cbox-lukedecker-baredevitocuda:~/.julia/dev/JetPackDevitoPSD/test$ DEVITO_ARCH=nvc python mfe.py 
Segmentation fault (core dumped)

The operator compiles and runs if I use gcc

The text was updated successfully, but these errors were encountered:

deckerla · 2023-10-05T19:43:27Z

Here's the generated c code that experiences the seg fault. Using the jit backdoor it looks like its tied to the reduction.
This impacts some stuff that we run in production. @FabioLuporini @mloubout

#define _POSIX_C_SOURCE 200809L
#define uL0(x, y, freq) u[(freq) + (x)*y_stride0 + (y)*freq_stride0]
#define vL0(x, y, freq) v[(freq) + (x)*y_stride0 + (y)*freq_stride0]
#define wL0(x, y) w[(x)*y_stride1 + (y)]
#define START_TIMER(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL);
#define STOP_TIMER(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000;

#include "stdlib.h"
#include "math.h"
#include "sys/time.h"
#include "omp.h"

struct dataobj
{
  void *restrict data;
  unsigned long * size;
  unsigned long * npsize;
  unsigned long * dsize;
  int * hsize;
  int * hofs;
  int * oofs;
  void * dmap;
} ;

struct profiler
{
  double section0;
} ;

extern "C" int Kernel(struct dataobj *restrict u_vec, struct dataobj *restrict v_vec, struct dataobj *restrict w_vec, const int freq_M, const int freq_m, const int x_M, const int x_m, const int y_M, const int y_m, const int nthreads, struct profiler * timers);


int Kernel(struct dataobj *restrict u_vec, struct dataobj *restrict v_vec, struct dataobj *restrict w_vec, const int freq_M, const int freq_m, const int x_M, const int x_m, const int y_M, const int y_m, const int nthreads, struct profiler * timers)
{
  float *u __attribute__ ((aligned (64))) = (float *) u_vec->data;
  float *v __attribute__ ((aligned (64))) = (float *) v_vec->data;
  float *w __attribute__ ((aligned (64))) = (float *) w_vec->data;

  const long y_fsz0 = w_vec->size[1];
  const long freq_fsz0 = v_vec->size[2];

  const long y_stride0 = freq_fsz0*y_fsz0;
  const long freq_stride0 = freq_fsz0;
  const long y_stride1 = y_fsz0;

  /* Begin section0 */
  START_TIMER(section0)
  for (int x = x_m; x <= x_M; x += 1)
  {
    #pragma omp parallel num_threads(nthreads)
    {
      #pragma omp for collapse(2) schedule(static,1) reduction(+:w[0:w_vec->size[0]*w_vec->size[1]])
      for (int y = y_m; y <= y_M; y += 1)
      {
        for (int freq = freq_m; freq <= freq_M; freq += 1)
        {
          wL0(x + 4, y + 4) += uL0(x + 4, y + 4, freq)*vL0(x + 4, y + 4, freq);
        }
      }
    }
  }
  STOP_TIMER(section0,timers)
  /* End section0 */

  return 0;
}

deckerla · 2023-10-05T19:44:22Z

Also, it is strange that the OMP parallel block doesn't happen outside of the for loop over x

mloubout · 2023-10-05T20:47:27Z

Thanks for the MFE will work on it.

Considering the last point about the parallel over y, this will depend on your system and how many collapsible loops are considered ok (number of cores and such). Since it's a reduction and can't be simd it counts from the inner and stop at max collapse

mloubout mentioned this issue Oct 6, 2023

compiler: prevent reduction clause for perfect-enough outer loops #2226

Merged

mloubout closed this as completed in #2226 Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvc issue with Inc #2224

nvc issue with Inc #2224

deckerla commented Oct 5, 2023

deckerla commented Oct 5, 2023

deckerla commented Oct 5, 2023

mloubout commented Oct 5, 2023

nvc issue with Inc #2224

nvc issue with Inc #2224

Comments

deckerla commented Oct 5, 2023

deckerla commented Oct 5, 2023

deckerla commented Oct 5, 2023

mloubout commented Oct 5, 2023