Skip to content

PGI Compiler Issues, Bugs, etc.

pgimwolfe edited this page Jul 14, 2015 · 33 revisions

Sadly, I find myself making a list of new bugs found at the hackathon. Thank you all for your help.

Private arrays in adjacent inner acc loop

Example:

real :: t(10)
!$acc parallel loop gang
do j = 1, n
    !$acc loop vector private(t)
    do i1 = 1, m
        do k = 1, 10
           t(k) = ...
        enddo
    enddo
    !$acc loop vector private(t)
    do i2 = 1, m
       .... t ....
    enddo
enddo

The compiler decides it needs to allocate space for the private version of 't' for the i1 loop and save its address. Then it decides it needs to allocate space for a private version of 't' for the i2 loop and save its address. That's bad enough, but then it saves the second address in the same place as the first address. After the loop, it frees up that memory, and since it allocated two blocks of memory, there are two deallocate calls. But since it saved both addresses in the same temp pointer, it tries to free that second block twice. Havoc follows.

This actually exposed three more bugs in the compiler. One: if those i1 and i2 loops had constant loop limits (say 32), the compiler would still generate a launch schedule of vector(128), essentially wasting 96 of those vector lanes. This is not a correctness bug, but it's annoying and stupid. Two: Each vector lane has to find its private portion of the private array that it wants to use. In this case, it takes the starting point of the block and computes an offset to the portion is should use. It does this by multiply the size of each block (10 elements above, or 40 bytes) times of offset of the thread computing this vector lane, as blockIdx.x%blockDim.x + threadIdx.x. This is usually right, but what if that loop has fewer iterations than the launch configuration. For instance, what if the compiler generates a vector(128) launch configuration, but 'm' is only 90. Then the compiler allocates space for n90 copies of 't', but the generated code computes its offset as blockidx.x128+threadIdx.x, so it generates out-of-bounds accesses. Three: If the launch is asynchronous, the private array allocate/deallocate must also be asynchronous. We have thought about this before, but didn't realize how important it was, so now we'll address it.

FS 21779

Fortran initialized data in 'acc routine' routines fails

Example:

subroutine sub( a, n )
   !$acc routine seq
   real :: a(*)
   integer :: n
   integer :: x = 0
   ....
end subroutine

The problem is the 'integer :: x = 0' implicit gives 'x' the Fortran save attribute, meaning it needs to be statically allocated. The PGI compiler currently does not support saved variables in Fortran 'acc routine' subprograms. That's bad enough, but it fails in a bad way. We'll work on adding a useful message to this effect. In this case, the fix is initializing x at runtime

    integer :: x
    x = 0

FS21780

Formatted IO not allowed in Fortran acc routine subprograms

Example:

subroutine sub( a, n )
    real :: a(*)
    integer :: n
    . . .
    write(*,"(i,f12,5)") i,a(i)
    . . .
end subroutine

The compiler gives a message that says something like call to routine without acc routine: pgf90_encode_format. This isn't very useful. We'll work on adding a better message for this. Full support for Fortran IO will take awhile.

FS21781

Calls to routines in acc routine subprograms can give Internal Compiler Error

I'd thought we caught all cases of this, but apparently not. There are cases when compiling a subroutine or function with 'acc routine' for the device, if there is a call to another subroutine or function that does not have 'acc routine', where the compiler issues an internal compiler error message. It should give a useful error message, like you need to add acc routine for that subroutine or function. It's not a compiler error at all, and a useful message would allow the programmer to make better progress.

FS 21782

Fortran array assignments in acc kernels constructs don't always work

We are aware of this. It has to do with the compiler temps that need to be created to allow for overlap between the array assignment right hand side expressions and left hand side targets. In many (most) cases, the compiler can get rid of these temps, but not always. We are looking at whether we can add a flag to tell the compiler that the temps are not needed for this file, which should help this issue. Fixing the real problem, handling those temps properly, is under design now.

FS 21783

Fortran where blocks in acc kernels constructs don't always work

This is the same issue as above, FS 21783

acc update device(...) if(any(A(:,:).eq.3)) fails badly at compile time.

We'll create a reproducer example to debug this.

FS 21773

Fortran inlining will fail when passing scalar actual argument to array dummy argument

The compiler gives an error message about the datatype not matching, but in fact it's that it won't inline when a scalar actual argument is passed to an array dummy argument. It's unlikely that we will fix the inliner in the near future, but we will try to give a better message.

subroutine sub( a, b, c )
    real :: a(*), b(*), c(*)
    ...
end subroutine
...
real x, y(2), z(3)
call sub( x, y, z )
! here, scalar x is passed to array dummy argument a, preventing the inlining

BL to open

pow math function is not always recognized as a builtin

In C and C++, if you include <accelmath.h>, most of the common math.h functions (sin, sinf, exp, expf, ...) will be recognized as intrinsic functions for both the CPU and GPU, allowing the compiler to generate efficient inline code where appropriate. The pow function in C++ has several other versions, such as pow(float,float), pow(double,double), pow(double,int), pow(double,long), and so on. The powf(float,float) and pow(double,double) versions will be recognized, the others likely will not, and that may cause problems at compile time.

can't recreate at this time

compiler sometimes gives no reason for not vectorizing a loop

This is not necessarily a bug, but it's annoying and the compiler needs to tell why vector code was not generated.

MW to find reproducing example

If the only use of a pointer is in a data clause, the pointer assignment is apparently not preserved

This particularly affects pgc++ (or pgcpp). It doesn't affect pgcc so much. Use case:

float* p = inx.p;
#pragma acc data copyin(p[0:n])
{
   ... // the p=inx.p assignment is getting deleted, and the data clause
       // is using garbage pointer values

MW to create example, fixed during the week.

C++ class member in host_data clause isn't working

class{
    float* x;
    . . .
    void foo(){
        #pragma acc host_data use_device(x)
        {
             do_something_with_cuda( x ); // not getting replaced
        }
    }
    . . .

BL to create example

Fortran module arrays without acc declare create() doesn't always get error

module mm
    real, allocatable, dimension(:) :: foo
end module
subroutine sub
    use mm
    !$acc routine vector
    !$acc declare present(foo)
    !$acc loop vector
    do i = 1, ubound(foo,1)
       foo(i) = foo(i) + 1
    enddo
end subroutine

The 'declare present(foo)' should affect the subroutine sub when compiling for the host, but not for the device. When compiling for the device, the compiler should give a message that a 'declare create' is still necessary for the module array foo.

Assign to PM

C switch and Fortran select statements don't work in compute region

The PGI compiler doesn't generate code for C switch statements or Fortran select statements in device compute regions. We've known about this problem for a long time, and hope we can resolve this in the near future.

Assign to MW

Compiler should give message for 'update' in acc compute region

Inside an 'acc routine' or 'acc parallel' or 'acc kernels', acc update is going to be ignored, so perhaps the compiler should give a message. Similarly for data constructs in an 'acc routine'.

Assign to PM

no PGI_ACC_NOTIFY notification for reduction variable upload/download

#pragma acc parallel loop copyin(x) reduction(+:y)

This prints out information for the data copy for 'x' but not for 'y'. The runtime has no profile calls in the reduction upload/download operations. It seems the data uploads for firstprivate are not printed either.

Assign to MW

simple structs should be implicitly copy(); scalars should be implicitly firstprivate

Cray seems to do the right thing, PGI complains about scalars being live-out, structs aren't getting implicit copy()

Assign to MW

arrays or structs in firstprivate do not get initialized.

Scalars seem to be handled by using the kernel argument as the firstprivate copy.

Assign to MW

PGI compiler will not inline a routine with 'acc routine' and 'acc loop'

It's the acc loop that's the problem. This should get fixed in the next release. We had some restrictions on inlining with OpenACC directives that predated OpenACC 2.0 and acc routine and orphaned loop directives that no longer apply.

MW fixed during the week. Assign to MW. Create test case.

Interfacing PGI OpenACC with CUDA libraries (cufft, cublas) streams

The PGI OpenACC does not (by default) use CUDA stream zero for kernel launches and data movement. The CUDA libraries do, by default, use stream zero. Stream zero has (or used to have) some unexpected and unfortunate behavior, such that operations on stream zero caused synchronization with all other streams, so PGI decided not to use stream zero by default. You can override this by linking the application with -Mcuda. This tells the compiler that you are using CUDA features that will use stream zero, so OpenACC should as well. Another method is to set the environment variable PGI_ACC_NOSYNCQUEUE to 1. We will revisit the stream zero behavior and determine whether we should change our default behavior.

Alternatively, you can get the CUDA stream used by PGI with:

#include <openacc.h>
...
void* stream = acc_get_cuda_stream( acc_async_sync );
cufftSetStream( stream );

or whatever the appropriate cuda library routine is to set the stream.

Create the documentation. Write PGInsider article.

Fortran interfaces to OpenACC API routines use long instead of int

Specifically, acc_get_cuda_stream() takes an argument that is the async queue, and it's defined in the spec as an int but in the PGI interface as a long. It wouldn't be a problem for a value argument, but Fortran uses ref arguments.

Also, we include the source for omp_lib.f90 in the src/ directory, we should include the source for openacc.f90 as well.

Fix for 16.0

strange error message, Fortran module declare create in data copyin

module e
 real, save, allocatable :: at(:)
 !$acc declare create(at)
end module

subroutine sub
 use e
 !$acc data copyin(at)
  call foo
 !$acc end data
end subroutine

Compiling this with pgfortran -acc gives a strange error message

PGF90-S-0155--Mcuda should be used with CUDA DEVICE module variable: at (e.f90: 8)

In fact, the 'declare create' data should not appear in data clauses, but that message is wrong.

Assign to PM

passing struct by value causes bad code generation

typedef struct{
    int a;
    float b;
    float c;
} s1;
#pragma acc routine seq
extern void test2( s1 arg, s1* arg2 );

. . .
s1 x, y[10];

#pragma acc data copy( x, y )
{
    #pragma acc parallel
    {
        test2( x, y );
    }
}

The compiler generates bad code and gets a build error when compiling the call to test2. The problem is 'x' is passed by value, but the compiler tries to pass it by reference.

Assign to MW

C++ compiler dies or generates bad code with parallel loop wait async(as)

#pragma acc parallel loop present(a[0:n],b[0:n]) async(as) wait
for( i=0;i<n;++i) a[i] += b[i];

The above loop generates a waitall, but not an async wait. It works in pgcc.

#pragma acc parallel loop present(a[0:n],b[0:n]) wait async(as)
for( i=0;i<n;++i) a[i] += b[i];

The above loop dies in pggpp1

#pragma acc parallel loop present(a[0:n],b[0:n]) async(as) wait(as+1)
for( i=0;i<n;++i) a[i] += b[i];

The above loop generates a waitall_async on queue as, not a wait for queue as+1, regardless of how many arguments there are in the wait clause. This works in pgcc.

#pragma acc wait(as+1) async(as)
#pragma acc parallel loop present(a[0:n],b[0:n]) async(as)
for( i=0;i<n;++i) a[i] += b[i];

This case generates a wait on queue as+1, but not async on queue as. It works in pgcc.

Create examples, assign to DC

Fortran arrays passed as arguments and not accessed locally seem to be bad

!$acc parallel loop
do i = 1, n
    call foo( a )
    ....
enddo

If a is an array, the address passed to foo seems to be bad.

BL could not reproduce. Create example or give up.

data construct containing OpenMP parallel region seems to fail

Getting no output at all. I'm going to try to create a reproducer. MW: was able to reproduce this, the problem is C++ interface losing the end-of-data region tag.

Create example, assign to DC.

Fortran parameter arrays to get to the device

does declare create work? Can we just put them there always?

Also came from ORNL. Easy to create example. Look at how parameter arrays are handled in CUDA Fortran.

C and C++ const and static const variables should be forward substituted

const int num = 15;
...
for(i=0;i<num;++i)

Compiler should be able to tell that the loop only has 15 iterations. Sometimes the variable is a C++ class member

class foo{
    static const int nnum = 15;
}...

Same thing here.

Create examples. Some cases may be easy, others may be intractable.

Compiler uses vector_length(128) even if the loop only has 15 iterations

I was working on this very issue last week.

Assign to MW.

forward substitute static const int values past assignments.

const int num = 15;
. . . 
int nn = num;
. . .
#pragma acc data copy(a[0:nn])

Here, it should know that nn is 15.

May be possible.

reduction should be able to leave the result on the device

real :: sum
sum = 0
!$acc declare copy(sum)
...
 !$acc parallel loop reduction(+:sum) async
 do ...
   sum = sum + ...
 enddo
 ! HERE

Right now, PGI brings the result of the sum back to the host, even though the reduction variable lives on the device. This synchronizes the host with the device, which we don't want since the loop has the async clause here. Why not just compute the sum into the device sum variable?!

What does the spec allow? Only for async? This changes behavior on the host. Discuss.

Suggestion for -ta=tesla:pin

Right now, -ta=tesla:pin works at runtime by trying to dynamically pin user data at the data clauses. Perhaps a better solution is to work more like -ta=tesla:managed, to change the malloc/free, new/delete, allocate/deallocate to use pinned memory for dynamic memory and remember that, then to use async transfers for pinned buffers.

Discuss. Assign to MW. May not be a long-term issue, anyway.

-noacc flag would be convenient

Depending on the cray modules loaded, the cray wrapper will add -acc to the PGI compiler build line. It would be convenient to add -noacc to negate this.

Assign to MW.

opening left curly bracket on pragma line, would be nice to give an error message

#pragma acc data copyin(x){

It would help to give error on the opening brace. This was found in a C++ program, but it would be good for C as well.

Assign to PM?

DEEP COPY

I got a lot of pressure to implement deep copy, in some form, for Fortran in particular, but also C++ and C, in that order.

This takes a couple of months to do the runtime work.

Clone this wiki locally