OpenMP teams should be used to distribute work offloaded to the GPU.
Use the target teams distribute parallel for
combined construct to offload
work to the GPU using two levels of parallelism.
GPUs are notably different from CPUs, being composed by a high number of processing units instead of a low level of cores. Moreover, these processing units are organized following a hierarchy within the GPU that requires some specific setup in order to better exploit its capabilities.
The OpenMP parallel
construct specifies a parallel region of the code that
will be executed by a team of threads. It is normally accompanied by a
worksharing construct so that each thread of the team takes care of part of the
work (e.g the for
construct assigns a subset of the loop iterations to each
thread). This attains a single level of parallelism since all work is
distributed across a team of threads. This works well for multi-core CPUs but
GPUs are composed of a high number of processing units organized into groups
that can share memory and synchronize. This must be taken into account in order
to get the better performance out of GPUs.
The OpenMP teams distribute
construct can be used to introduce an additional
level of parallelism by creating multiple teams of threads and distributing loop
iterations across them. Each team forms a contention group, meaning that threads
can only synchronize with other threads in its team. This allows the work to be
distributed better fitting the hierarchical organization of the processing units
of GPUs. Additionally, using teams enhances performance portability, ensuring a
more predictable performance no matter which compiler and hardware combination
is used.
The following code offloads a matrix multiplication computation through the
target
construct and then creates a parallel region and distributes the work
through for
construct (note that the matrices are statically sized arrays):
#pragma omp target map(to: A[0:m][0:p], B[0:p][0:n], m, n, p) \
map(tofrom: C[0:m][0:n])
{
#pragma omp parallel default(none) shared(A, B, C, m, n, p)
{
#pragma omp for schedule(auto)
for (size_t i = 0; i < m; i++) {
for (size_t j = 0; j < n; j++) {
for (size_t k = 0; k < p; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
} // end parallel
} // end target
When offloading to the GPU, it is recommended to use an additional level of
parallelism. This can be achieved by using the teams
and distribute
constructs; in this case, in combination with parallel for
:
#pragma omp target teams distribute parallel for \
map(to: A[0:m][0:p], B[0:p][0:n], m, n, p) shared(A, B, m, n, p) \
map(tofrom: C[0:m][0:n]) schedule(auto)
for (size_t i = 0; i < m; i++) {
for (size_t j = 0; j < n; j++) {
for (size_t k = 0; k < p; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
The following code offloads a matrix multiplication computation through the
target
construct and then creates a parallel region and distributes the work
through the do
construct:
!$omp target map(to: A, B) map(tofrom: C)
!$omp parallel default(none) private(i, j, k) shared(A, B, C)
!$omp do
do j = 1, size(C, 2)
do k = 1, size(C, 2)
do i = 1, size(C, 1)
C(i, j) = C(i, j) + A(i, k) * B(k, j)
end do
end do
end do
!$omp end do
!$omp end parallel
!$omp end target
When offloading to the GPU, it is recommended to use an additional level of
parallelism. This can be achieved by using the teams
and distribute
constructs; in this case, in combination with parallel do
:
!$omp target teams distribute map(to: A, B) map(tofrom: C)
!$omp parallel default(none) private(i, j, k) shared(A, B, C)
!$omp do
do j = 1, size(C, 2)
do k = 1, size(C, 2)
do i = 1, size(C, 1)
C(i, j) = C(i, j) + A(i, k) * B(k, j)
end do
end do
end do
!$omp end do
!$omp end parallel
!$omp end target
-
OpenMP 4.5 Complete Specifications, November 2015 [last checked June 2020]
-
Portability of OpenMP Offload Directives - Jeff Larkin, OpenMP Booth Talk SC17, November 2017 [last checked June 2020]
-
OpenMP and NVIDIA - Jeff Larkin, NVIDIA Developer Technologies [last checked June 2020]