Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvc issue with Inc #2224

Closed
deckerla opened this issue Oct 5, 2023 · 3 comments · Fixed by #2226
Closed

nvc issue with Inc #2224

deckerla opened this issue Oct 5, 2023 · 3 comments · Fixed by #2226

Comments

@deckerla
Copy link
Contributor

deckerla commented Oct 5, 2023

consider the following mfe:

import devito

shape = (51,41)
nfreq = 5
space_order=4
grid = devito.Grid(shape=shape)
freq_dim = devito.DefaultDimension(name="freq", default_value=nfreq)
u = devito.Function(name="u", dimensions=(*grid.dimensions, freq_dim), 
            grid=grid, shape=(*grid.shape, nfreq), space_order=space_order)
v = devito.Function(name="v", dimensions=(*grid.dimensions, freq_dim), 
            grid=grid, shape=(*grid.shape, nfreq), space_order=space_order)

w = devito.Function(name="w", grid=grid, space_order=space_order)

summation = devito.Inc(w, u*v)

op = devito.Operator([summation])

op.apply()

Using gcc this runs without an issue:

cvx@cbox-lukedecker-baredevitocuda:~/.julia/dev/JetPackDevitoPSD/test$ DEVITO_ARCH=gcc python mfe.py 
Operator `Kernel` ran in 0.01 s

But if I use nvc, which is required for a lot of devitopro features, I have the following issue:

cvx@cbox-lukedecker-baredevitocuda:~/.julia/dev/JetPackDevitoPSD/test$ DEVITO_ARCH=nvc python mfe.py 
NVC++-F-0000-Internal compiler error. unhandled size for preparing constant 0     255  (/tmp/devito-jitcache-uid1000/a444fd3b436666de6e2079acc59b28052a8a030c.cpp: 49)
NVC++/x86-64 Linux 23.7-0: compilation aborted
FAILED compiler invocation: nvc++ -g -fPIC -std=c++11 -mp -fast -shared /tmp/devito-jitcache-uid1000/a444fd3b436666de6e2079acc59b28052a8a030c.cpp -lm -o /tmp/devito-jitcache-uid1000/a444fd3b436666de6e2079acc59b28052a8a030c.so
Traceback (most recent call last):
  File "/home/cvx/.julia/dev/JetPackDevitoPSD/test/mfe.py", line 19, in <module>
    op.apply()
  File "/home/cvx/.conda/envs/conda_jl/lib/python3.10/site-packages/devito/operator/operator.py", line 832, in apply
    cfunction = self.cfunction
  File "/home/cvx/.conda/envs/conda_jl/lib/python3.10/site-packages/devito/operator/operator.py", line 714, in cfunction
    self._jit_compile()
  File "/home/cvx/.conda/envs/conda_jl/lib/python3.10/site-packages/devito/operator/operator.py", line 699, in _jit_compile
    recompiled, src_file = self._compiler.jit_compile(self._soname,
  File "/home/cvx/.conda/envs/conda_jl/lib/python3.10/site-packages/devito/arch/compiler.py", line 360, in jit_compile
    _, _, _, recompiled = compile_from_string(self, target, code, src_file,
  File "/home/cvx/.conda/envs/conda_jl/lib/python3.10/site-packages/codepy/jit.py", line 439, in compile_from_string
    toolchain.build_extension(ext_file, source_paths, debug=debug)
  File "/home/cvx/.conda/envs/conda_jl/lib/python3.10/site-packages/codepy/toolchain.py", line 211, in build_extension
    raise CompileError("module compilation failed")
codepy.CompileError: module compilation failed

Now is where things get even more interesting.
If I change the script to use devitopro I have the following mfe:

import devito
import devitopro
shape = (51,41)
nfreq = 5
space_order=4
grid = devito.Grid(shape=shape)
freq_dim = devito.DefaultDimension(name="freq", default_value=nfreq)
u = devitopro.Function(name="u", dimensions=(*grid.dimensions, freq_dim), 
            grid=grid, shape=(*grid.shape, nfreq), space_order=space_order)
v = devitopro.Function(name="v", dimensions=(*grid.dimensions, freq_dim), 
            grid=grid, shape=(*grid.shape, nfreq), space_order=space_order)

w = devitopro.Function(name="w", grid=grid, space_order=space_order)

summation = devito.Inc(w, u*v)

op = devito.Operator([summation])

op.apply()

I segfault when running with nvc!

cvx@cbox-lukedecker-baredevitocuda:~/.julia/dev/JetPackDevitoPSD/test$ DEVITO_ARCH=nvc python mfe.py 
Segmentation fault (core dumped)

The operator compiles and runs if I use gcc

@deckerla
Copy link
Contributor Author

deckerla commented Oct 5, 2023

Here's the generated c code that experiences the seg fault. Using the jit backdoor it looks like its tied to the reduction.
This impacts some stuff that we run in production. @FabioLuporini @mloubout

#define _POSIX_C_SOURCE 200809L
#define uL0(x, y, freq) u[(freq) + (x)*y_stride0 + (y)*freq_stride0]
#define vL0(x, y, freq) v[(freq) + (x)*y_stride0 + (y)*freq_stride0]
#define wL0(x, y) w[(x)*y_stride1 + (y)]
#define START_TIMER(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL);
#define STOP_TIMER(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000;

#include "stdlib.h"
#include "math.h"
#include "sys/time.h"
#include "omp.h"

struct dataobj
{
  void *restrict data;
  unsigned long * size;
  unsigned long * npsize;
  unsigned long * dsize;
  int * hsize;
  int * hofs;
  int * oofs;
  void * dmap;
} ;

struct profiler
{
  double section0;
} ;

extern "C" int Kernel(struct dataobj *restrict u_vec, struct dataobj *restrict v_vec, struct dataobj *restrict w_vec, const int freq_M, const int freq_m, const int x_M, const int x_m, const int y_M, const int y_m, const int nthreads, struct profiler * timers);


int Kernel(struct dataobj *restrict u_vec, struct dataobj *restrict v_vec, struct dataobj *restrict w_vec, const int freq_M, const int freq_m, const int x_M, const int x_m, const int y_M, const int y_m, const int nthreads, struct profiler * timers)
{
  float *u __attribute__ ((aligned (64))) = (float *) u_vec->data;
  float *v __attribute__ ((aligned (64))) = (float *) v_vec->data;
  float *w __attribute__ ((aligned (64))) = (float *) w_vec->data;

  const long y_fsz0 = w_vec->size[1];
  const long freq_fsz0 = v_vec->size[2];

  const long y_stride0 = freq_fsz0*y_fsz0;
  const long freq_stride0 = freq_fsz0;
  const long y_stride1 = y_fsz0;

  /* Begin section0 */
  START_TIMER(section0)
  for (int x = x_m; x <= x_M; x += 1)
  {
    #pragma omp parallel num_threads(nthreads)
    {
      #pragma omp for collapse(2) schedule(static,1) reduction(+:w[0:w_vec->size[0]*w_vec->size[1]])
      for (int y = y_m; y <= y_M; y += 1)
      {
        for (int freq = freq_m; freq <= freq_M; freq += 1)
        {
          wL0(x + 4, y + 4) += uL0(x + 4, y + 4, freq)*vL0(x + 4, y + 4, freq);
        }
      }
    }
  }
  STOP_TIMER(section0,timers)
  /* End section0 */

  return 0;
}

@deckerla
Copy link
Contributor Author

deckerla commented Oct 5, 2023

Also, it is strange that the OMP parallel block doesn't happen outside of the for loop over x

@mloubout
Copy link
Contributor

mloubout commented Oct 5, 2023

Thanks for the MFE will work on it.

Considering the last point about the parallel over y, this will depend on your system and how many collapsible loops are considered ok (number of cores and such). Since it's a reduction and can't be simd it counts from the inner and stop at max collapse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants