-
Notifications
You must be signed in to change notification settings - Fork 7
PGI Compiler Issues, Bugs, etc.
Sadly, I find myself making a list of new bugs found at the hackathon. Thank you all for your help.
Example:
real :: t(10)
!$acc parallel loop gang
do j = 1, n
!$acc loop vector private(t)
do i1 = 1, m
do k = 1, 10
t(k) = ...
enddo
enddo
!$acc loop vector private(t)
do i2 = 1, m
.... t ....
enddo
enddo
The compiler decides it needs to allocate space for the private version of 't' for the i1 loop and save its address. Then it decides it needs to allocate space for a private version of 't' for the i2 loop and save its address. That's bad enough, but then it saves the second address in the same place as the first address. After the loop, it frees up that memory, and since it allocated two blocks of memory, there are two deallocate calls. But since it saved both addresses in the same temp pointer, it tries to free that second block twice. Havoc follows.
This actually exposed three more bugs in the compiler. One: if those i1 and i2 loops had constant loop limits (say 32), the compiler would still generate a launch schedule of vector(128), essentially wasting 96 of those vector lanes. This is not a correctness bug, but it's annoying and stupid. Two: Each vector lane has to find its private portion of the private array that it wants to use. In this case, it takes the starting point of the block and computes an offset to the portion is should use. It does this by multiply the size of each block (10 elements above, or 40 bytes) times of offset of the thread computing this vector lane, as blockIdx.x%blockDim.x + threadIdx.x. This is usually right, but what if that loop has fewer iterations than the launch configuration. For instance, what if the compiler generates a vector(128) launch configuration, but 'm' is only 90. Then the compiler allocates space for n90 copies of 't', but the generated code computes its offset as blockidx.x128+threadIdx.x, so it generates out-of-bounds accesses. Three: If the launch is asynchronous, the private array allocate/deallocate must also be asynchronous. We have thought about this before, but didn't realize how important it was, so now we'll address it.
FS 21779
Example:
subroutine sub( a, n )
!$acc routine seq
real :: a(*)
integer :: n
integer :: x = 0
....
end subroutine
The problem is the 'integer :: x = 0' implicit gives 'x' the Fortran save attribute, meaning it needs to be statically allocated. The PGI compiler currently does not support saved variables in Fortran 'acc routine' subprograms. That's bad enough, but it fails in a bad way. We'll work on adding a useful message to this effect. In this case, the fix is initializing x at runtime
integer :: x
x = 0
FS21780
Example:
subroutine sub( a, n )
real :: a(*)
integer :: n
. . .
write(*,"(i,f12,5)") i,a(i)
. . .
end subroutine
The compiler gives a message that says something like call to routine without acc routine: pgf90_encode_format. This isn't very useful. We'll work on adding a better message for this. Full support for Fortran IO will take awhile.
FS21781
I'd thought we caught all cases of this, but apparently not. There are cases when compiling a subroutine or function with 'acc routine' for the device, if there is a call to another subroutine or function that does not have 'acc routine', where the compiler issues an internal compiler error message. It should give a useful error message, like you need to add acc routine for that subroutine or function. It's not a compiler error at all, and a useful message would allow the programmer to make better progress.
FS 21782
We are aware of this. It has to do with the compiler temps that need to be created to allow for overlap between the array assignment right hand side expressions and left hand side targets. In many (most) cases, the compiler can get rid of these temps, but not always. We are looking at whether we can add a flag to tell the compiler that the temps are not needed for this file, which should help this issue. Fixing the real problem, handling those temps properly, is under design now.
FS 21783
This is the same issue as above, FS 21783
We'll create a reproducer example to debug this.
FS 21773
The compiler gives an error message about the datatype not matching, but in fact it's that it won't inline when a scalar actual argument is passed to an array dummy argument. It's unlikely that we will fix the inliner in the near future, but we will try to give a better message.
subroutine sub( a, b, c )
real :: a(*), b(*), c(*)
...
end subroutine
...
real x, y(2), z(3)
call sub( x, y, z )
! here, scalar x is passed to array dummy argument a, preventing the inlining
BL to open
In C and C++, if you include <accelmath.h>, most of the common math.h functions (sin, sinf, exp, expf, ...) will be recognized as intrinsic functions for both the CPU and GPU, allowing the compiler to generate efficient inline code where appropriate. The pow function in C++ has several other versions, such as pow(float,float), pow(double,double), pow(double,int), pow(double,long), and so on. The powf(float,float) and pow(double,double) versions will be recognized, the others likely will not, and that may cause problems at compile time.
can't recreate at this time
This is not necessarily a bug, but it's annoying and the compiler needs to tell why vector code was not generated.
MW to find reproducing example
If the only use of a pointer is in a data clause, the pointer assignment is apparently not preserved
This particularly affects pgc++ (or pgcpp). It doesn't affect pgcc so much. Use case:
float* p = inx.p;
#pragma acc data copyin(p[0:n])
{
... // the p=inx.p assignment is getting deleted, and the data clause
// is using garbage pointer values
MW to create example, fixed during the week.
class{
float* x;
. . .
void foo(){
#pragma acc host_data use_device(x)
{
do_something_with_cuda( x ); // not getting replaced
}
}
. . .
BL to create example
module mm
real, allocatable, dimension(:) :: foo
end module
subroutine sub
use mm
!$acc routine vector
!$acc declare present(foo)
!$acc loop vector
do i = 1, ubound(foo,1)
foo(i) = foo(i) + 1
enddo
end subroutine
The 'declare present(foo)' should affect the subroutine sub when compiling for the host, but not for the device. When compiling for the device, the compiler should give a message that a 'declare create' is still necessary for the module array foo.
Assign to PM
The PGI compiler doesn't generate code for C switch statements or Fortran select statements in device compute regions. We've known about this problem for a long time, and hope we can resolve this in the near future.
Assign to MW
Inside an 'acc routine' or 'acc parallel' or 'acc kernels', acc update is going to be ignored, so perhaps the compiler should give a message. Similarly for data constructs in an 'acc routine'.
Assign to PM
#pragma acc parallel loop copyin(x) reduction(+:y)
This prints out information for the data copy for 'x' but not for 'y'. The runtime has no profile calls in the reduction upload/download operations. It seems the data uploads for firstprivate are not printed either.
Assign to MW
Cray seems to do the right thing, PGI complains about scalars being live-out, structs aren't getting implicit copy()
Assign to MW
Scalars seem to be handled by using the kernel argument as the firstprivate copy.
Assign to MW
It's the acc loop that's the problem. This should get fixed in the next release. We had some restrictions on inlining with OpenACC directives that predated OpenACC 2.0 and acc routine and orphaned loop directives that no longer apply.
MW fixed during the week. Assign to MW. Create test case.
The PGI OpenACC does not (by default) use CUDA stream zero for kernel launches and data movement. The CUDA libraries do, by default, use stream zero. Stream zero has (or used to have) some unexpected and unfortunate behavior, such that operations on stream zero caused synchronization with all other streams, so PGI decided not to use stream zero by default. You can override this by linking the application with -Mcuda. This tells the compiler that you are using CUDA features that will use stream zero, so OpenACC should as well. Another method is to set the environment variable PGI_ACC_NOSYNCQUEUE to 1. We will revisit the stream zero behavior and determine whether we should change our default behavior.
Alternatively, you can get the CUDA stream used by PGI with:
#include <openacc.h>
...
void* stream = acc_get_cuda_stream( acc_async_sync );
cufftSetStream( stream );
or whatever the appropriate cuda library routine is to set the stream.
Create the documentation. Write PGInsider article.
Specifically, acc_get_cuda_stream() takes an argument that is the async queue, and it's defined in the spec as an int but in the PGI interface as a long. It wouldn't be a problem for a value argument, but Fortran uses ref arguments.
Also, we include the source for omp_lib.f90 in the src/ directory, we should include the source for openacc.f90 as well.
Fix for 16.0
module e
real, save, allocatable :: at(:)
!$acc declare create(at)
end module
subroutine sub
use e
!$acc data copyin(at)
call foo
!$acc end data
end subroutine
Compiling this with pgfortran -acc gives a strange error message
PGF90-S-0155--Mcuda should be used with CUDA DEVICE module variable: at (e.f90: 8)
In fact, the 'declare create' data should not appear in data clauses, but that message is wrong.
Assign to PM
typedef struct{
int a;
float b;
float c;
} s1;
#pragma acc routine seq
extern void test2( s1 arg, s1* arg2 );
. . .
s1 x, y[10];
#pragma acc data copy( x, y )
{
#pragma acc parallel
{
test2( x, y );
}
}
The compiler generates bad code and gets a build error when compiling the call to test2. The problem is 'x' is passed by value, but the compiler tries to pass it by reference.
Assign to MW
#pragma acc parallel loop present(a[0:n],b[0:n]) async(as) wait
for( i=0;i<n;++i) a[i] += b[i];
The above loop generates a waitall, but not an async wait. It works in pgcc.
#pragma acc parallel loop present(a[0:n],b[0:n]) wait async(as)
for( i=0;i<n;++i) a[i] += b[i];
The above loop dies in pggpp1
#pragma acc parallel loop present(a[0:n],b[0:n]) async(as) wait(as+1)
for( i=0;i<n;++i) a[i] += b[i];
The above loop generates a waitall_async on queue as, not a wait for queue as+1, regardless of how many arguments there are in the wait clause. This works in pgcc.
#pragma acc wait(as+1) async(as)
#pragma acc parallel loop present(a[0:n],b[0:n]) async(as)
for( i=0;i<n;++i) a[i] += b[i];
This case generates a wait on queue as+1, but not async on queue as. It works in pgcc.
Create examples, assign to DC
!$acc parallel loop
do i = 1, n
call foo( a )
....
enddo
If a is an array, the address passed to foo seems to be bad.
BL could not reproduce. Create example or give up.
Getting no output at all. I'm going to try to create a reproducer. MW: was able to reproduce this, the problem is C++ interface losing the end-of-data region tag.
Create example, assign to DC.
does declare create work? Can we just put them there always?
Also came from ORNL. Easy to create example. Look at how parameter arrays are handled in CUDA Fortran.
const int num = 15;
...
for(i=0;i<num;++i)
Compiler should be able to tell that the loop only has 15 iterations. Sometimes the variable is a C++ class member
class foo{
static const int nnum = 15;
}...
Same thing here.
Create examples. Some cases may be easy, others may be intractable.
I was working on this very issue last week.
Assign to MW.
const int num = 15;
. . .
int nn = num;
. . .
#pragma acc data copy(a[0:nn])
Here, it should know that nn is 15.
May be possible.
real :: sum
sum = 0
!$acc declare copy(sum)
...
!$acc parallel loop reduction(+:sum) async
do ...
sum = sum + ...
enddo
! HERE
Right now, PGI brings the result of the sum back to the host, even though the reduction variable lives on the device. This synchronizes the host with the device, which we don't want since the loop has the async clause here. Why not just compute the sum into the device sum variable?!
What does the spec allow? Only for async? This changes behavior on the host. Discuss.
Right now, -ta=tesla:pin works at runtime by trying to dynamically pin user data at the data clauses. Perhaps a better solution is to work more like -ta=tesla:managed, to change the malloc/free, new/delete, allocate/deallocate to use pinned memory for dynamic memory and remember that, then to use async transfers for pinned buffers.
Discuss. Assign to MW. May not be a long-term issue, anyway.
Depending on the cray modules loaded, the cray wrapper will add -acc to the PGI compiler build line. It would be convenient to add -noacc to negate this.
Assign to MW.
#pragma acc data copyin(x){
It would help to give error on the opening brace. This was found in a C++ program, but it would be good for C as well.
Assign to PM?
I got a lot of pressure to implement deep copy, in some form, for Fortran in particular, but also C++ and C, in that order.
This takes a couple of months to do the runtime work.