Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
274 changes: 274 additions & 0 deletions sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
#pragma once

#include <CL/sycl/detail/defines_elementary.hpp>
#include <immintrin.h>

__SYCL_INLINE_NAMESPACE(cl) {
namespace sycl {
namespace ext {
namespace intel {
namespace experimental::matrix {

enum class matrix_type { a, b, accumulator };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we don't have this right now, this will be needed for future support.
So this is a good addition to the interface, I would suggest changing the name though to matrix_use rather than type.
Also, please take a look at the query interface in static-query.hpp. It provides a nice way to bypass all these extra arguments including the sizes (see example in https://github.com/intel/llvm/blob/sycl/sycl/test/matrix/query.cpp)
You just need to say:
using myparams = tpu_params<tpu::nvidia, int8_t, int8_t, int>;
the matrices can be created as follows:
myparams::joint_matrix_a<sub_group> sub_a(sg);
myparams::joint_matrix_b<sub_group> sub_b(sg);
myparams::joint_matrix_c<sub_group> sub_c(sg);

As you can see the sizes are constructed underneath. The matrix_use is specified in the type alias.

Copy link
Contributor Author

@JackAKirk JackAKirk Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think that the query which misses matrix size parameters will be useful for the tensorcore case, both as a user query to inform which matrix sizes are available for the given matrix_type (using the definition of matrix_type in static_query.hpp which basically corresponds to matrix::precision that was described in the tensorcore matrix proposal) and also to potentially reduce the number of parameters necessary in the group functions in the cases where a single matrix_type corresponds to a single matrix size (although there are only a small number of cases where this is valid for cuda - In the majority of cases all template parameters will be needed to uniquely specify the correct joint_matrix: see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-shape). Since the tensorcore case does not support a continuous range of integers for the matrix sizes, the variables such as max_msize, max_nsize, max_ksize won't be appropriate for the cuda case, but we could e.g. make an alternative implementation for cuda which can report to the user the set of available matrix sizes (most commonly there are two or three per matrix_type) for each 'matrix_type'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the tensorcore case does not support a continuous range of integers for the matrix sizes, the variables such as max_msize, max_nsize, max_ksize won't be appropriate for the cuda case

the max_msize/nsize/ksize are only appropriate for AMX.
the DPAS GPU implementation also supports a discrete range of values. That would be msize/nsize/ksize members of 'combination' type. Please refer to https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/matrix/static-query.hpp#L297 on how we filled out the combinations for DPAS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks.


enum class matrix_layout { row_major, col_major, packed };

template <typename Group, typename T, matrix_type MT,
size_t Rows = sycl::dynamic_extent,
size_t Cols = sycl::dynamic_extent,
matrix_layout Layout = matrix_layout::row_major, typename Cond = void>
struct joint_matrix {
joint_matrix(Group g) {}
};

// The enable_if_t usage in this file is used to disable the
// matrix_layout::packed case which is not compatible with the Nvidia cuda
// backend.
template <matrix_layout Layout>
struct joint_matrix<
sycl::sub_group, double, matrix_type::a, 8, 4, Layout,
typename std::enable_if_t<Layout == matrix_layout::row_major ||
Layout == matrix_layout::col_major>> {
double data[1];
};

template <matrix_layout Layout>
struct joint_matrix<
sycl::sub_group, double, matrix_type::b, 4, 8, Layout,
typename std::enable_if_t<(Layout == matrix_layout::row_major ||
Layout == matrix_layout::col_major)>> {
double data[1];
};

template <matrix_layout Layout>
struct joint_matrix<
sycl::sub_group, double, matrix_type::accumulator, 8, 8, Layout,
typename std::enable_if_t<Layout == matrix_layout::row_major ||
Layout == matrix_layout::col_major>> {
double data[2];
};

} // namespace experimental::matrix

namespace detail {
using namespace experimental;

template <typename Group, typename T, matrix::matrix_type MT, size_t NumRows,
size_t NumCols, matrix::matrix_layout Layout,
access::address_space Space, typename Cond = void>
struct joint_matrix_load_impl {
void load(matrix::joint_matrix<Group, T, MT, NumRows, NumCols, Layout> &res,
multi_ptr<T, Space> src, size_t stride);
};

template <matrix::matrix_layout Layout> constexpr int get_layout_id();

template <> constexpr int get_layout_id<matrix::matrix_layout::row_major>() {
return 0;
}

template <> constexpr int get_layout_id<matrix::matrix_layout::col_major>() {
return 1;
}

template <matrix::matrix_layout Layout, access::address_space Space>
struct joint_matrix_load_impl<
sycl::sub_group, double, matrix::matrix_type::a, 8, 4, Layout, Space,
typename std::enable_if_t<Layout == matrix::matrix_layout::row_major ||
Layout == matrix::matrix_layout::col_major>> {
void load(matrix::joint_matrix<sycl::sub_group, double,
matrix::matrix_type::a, 8, 4, Layout> &res,
multi_ptr<double, Space> src, size_t stride) {

#ifdef __NVPTX__
#ifdef __SYCL_DEVICE_ONLY__
__dmma_m8n8k4_ld_a(res.data, src.get(), stride, get_layout_id<Layout>());
#endif
#endif
}
};

template <matrix::matrix_layout Layout, access::address_space Space>
struct joint_matrix_load_impl<
sycl::sub_group, double, matrix::matrix_type::b, 4, 8, Layout, Space,
typename std::enable_if_t<Layout == matrix::matrix_layout::row_major ||
Layout == matrix::matrix_layout::col_major>> {
void load(matrix::joint_matrix<sycl::sub_group, double,
matrix::matrix_type::b, 4, 8, Layout> &res,
multi_ptr<double, Space> src, size_t stride) {
#ifdef __NVPTX__
#ifdef __SYCL_DEVICE_ONLY__
__dmma_m8n8k4_ld_b(res.data, src.get(), stride, get_layout_id<Layout>());
#endif
#endif
}
};

template <matrix::matrix_layout Layout, access::address_space Space>
struct joint_matrix_load_impl<
sycl::sub_group, double, matrix::matrix_type::accumulator, 8, 8, Layout,
Space,
typename std::enable_if_t<Layout == matrix::matrix_layout::row_major ||
Layout == matrix::matrix_layout::col_major>> {
void load(
matrix::joint_matrix<sycl::sub_group, double,
matrix::matrix_type::accumulator, 8, 8, Layout> &res,
multi_ptr<double, Space> src, size_t stride) {

#ifdef __NVPTX__
#ifdef __SYCL_DEVICE_ONLY__
__dmma_m8n8k4_ld_c(res.data, src.get(), stride, get_layout_id<Layout>());
#endif
#endif
}
};

template <typename Group, typename T, size_t NumRows, size_t NumCols,
matrix::matrix_layout Layout, access::address_space Space,
typename Cond = void>
struct joint_matrix_store_impl {
void store(matrix::joint_matrix<Group, T, matrix::matrix_type::accumulator,
NumRows, NumCols, Layout> &src,
multi_ptr<T, Space> dst, size_t stride);
};

template <matrix::matrix_layout Layout, access::address_space Space>
struct joint_matrix_store_impl<
sycl::sub_group, double, 8, 8, Layout, Space,
typename std::enable_if_t<Layout == matrix::matrix_layout::row_major ||
Layout == matrix::matrix_layout::col_major>> {
void store(
matrix::joint_matrix<sycl::sub_group, double,
matrix::matrix_type::accumulator, 8, 8, Layout> &src,
multi_ptr<double, Space> dst, size_t stride) {

#ifdef __NVPTX__
#ifdef __SYCL_DEVICE_ONLY__
__dmma_m8n8k4_st_c_f64(dst.get(), src.data, stride,
get_layout_id<Layout>());
#endif
#endif
}
};

template <typename Group, typename T1, typename T2, std::size_t M,
std::size_t K, std::size_t N, matrix::matrix_layout LayoutA,
matrix::matrix_layout LayoutB, matrix::matrix_layout LayoutC,
typename Cond = void>
struct joint_matrix_mad_impl {
matrix::joint_matrix<Group, T2, matrix::matrix_type::accumulator, M, N,
LayoutC>
mad(Group sg,
matrix::joint_matrix<Group, T1, matrix::matrix_type::a, M, K, LayoutA> A,
matrix::joint_matrix<Group, T1, matrix::matrix_type::b, K, N, LayoutB> B,
matrix::joint_matrix<Group, T2, matrix::matrix_type::accumulator, M, N,
LayoutC>
C);
};

template <matrix::matrix_layout LayoutA, matrix::matrix_layout LayoutB>
constexpr int get_layout_pair_id();

template <>
constexpr int get_layout_pair_id<matrix::matrix_layout::row_major,
matrix::matrix_layout::row_major>() {
return 0;
}

template <>
constexpr int get_layout_pair_id<matrix::matrix_layout::row_major,
matrix::matrix_layout::col_major>() {
return 1;
}

template <>
constexpr int get_layout_pair_id<matrix::matrix_layout::col_major,
matrix::matrix_layout::row_major>() {
return 2;
}

template <>
constexpr int get_layout_pair_id<matrix::matrix_layout::col_major,
matrix::matrix_layout::col_major>() {
return 3;
}

template <matrix::matrix_layout LayoutA, matrix::matrix_layout LayoutB,
matrix::matrix_layout LayoutC>
struct joint_matrix_mad_impl<
sycl::sub_group, double, double, 8, 4, 8, LayoutA, LayoutB, LayoutC,
typename std::enable_if_t<(LayoutA == matrix::matrix_layout::row_major ||
LayoutA == matrix::matrix_layout::col_major) &&
(LayoutB == matrix::matrix_layout::row_major ||
LayoutB == matrix::matrix_layout::col_major) &&
(LayoutC == matrix::matrix_layout::row_major ||
LayoutC == matrix::matrix_layout::col_major)>> {
matrix::joint_matrix<sycl::sub_group, double,
matrix::matrix_type::accumulator, 8, 8, LayoutC>
mad(sycl::sub_group sg,
matrix::joint_matrix<sycl::sub_group, double, matrix::matrix_type::a, 8,
4, LayoutA>
A,
matrix::joint_matrix<sycl::sub_group, double, matrix::matrix_type::b, 4,
8, LayoutB>
B,
matrix::joint_matrix<sycl::sub_group, double,
matrix::matrix_type::accumulator, 8, 8, LayoutC>
C) {
matrix::joint_matrix<sycl::sub_group, double,
matrix::matrix_type::accumulator, 8, 8, LayoutC>
D;

#ifdef __NVPTX__
#ifdef __SYCL_DEVICE_ONLY__
__dmma_m8n8k4_mma_f64(D.data, A.data, B.data, C.data,
get_layout_pair_id<LayoutA, LayoutB>(), 0);
#endif
#endif

return D;
}
};

} // namespace detail

namespace experimental::matrix {

template <typename Group, typename T, matrix_type MT, size_t NumRows,
size_t NumCols, matrix_layout Layout, access::address_space Space>
void joint_matrix_load(
Group sg, joint_matrix<Group, T, MT, NumRows, NumCols, Layout> &res,
multi_ptr<T, Space> src, size_t stride) {
detail::joint_matrix_load_impl<Group, T, MT, NumRows, NumCols, Layout,
Space>{}
.load(res, src, stride);
}

template <typename Group, typename T, size_t NumRows, size_t NumCols,
matrix_layout Layout, access::address_space Space>
void joint_matrix_store(Group sg,
joint_matrix<Group, T, matrix_type::accumulator,
NumRows, NumCols, Layout> &src,
multi_ptr<T, Space> dst, size_t stride) {
detail::joint_matrix_store_impl<Group, T, NumRows, NumCols, Layout, Space>{}
.store(src, dst, stride);
}

template <typename Group, typename T1, typename T2, std::size_t M,
std::size_t K, std::size_t N, matrix_layout LayoutA,
matrix_layout LayoutB, matrix_layout LayoutC>
joint_matrix<Group, T2, matrix_type::accumulator, M, N, LayoutC>
joint_matrix_mad(
Group sg, joint_matrix<Group, T1, matrix_type::a, M, K, LayoutA> A,
joint_matrix<Group, T1, matrix_type::b, K, N, LayoutB> B,
joint_matrix<Group, T2, matrix_type::accumulator, M, N, LayoutC> C) {
return detail::joint_matrix_mad_impl<Group, T1, T2, M, K, N, LayoutA, LayoutB,
LayoutC>{}
.mad(sg, A, B, C);
}

} // namespace experimental::matrix
} // namespace intel
} // namespace ext
} // namespace sycl
} // __SYCL_INLINE_NAMESPACE(cl)
3 changes: 3 additions & 0 deletions sycl/include/sycl/ext/oneapi/matrix/matrix.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,6 @@
#include <sycl/ext/oneapi/matrix/matrix-jit.hpp>
#include <sycl/ext/oneapi/matrix/static-query.hpp>
#endif
#if (SYCL_EXT_ONEAPI_MATRIX == 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation can also benefit from the static query we have as well. Besides that the query can give the user information about what the implementation support, it can also construct the matrices and make the sizes optional for the user.

We should probably add this to matrix-jit.hpp and fork to using the AOT tensorcore implementation based on some option (AOT for tensorcore).
I am asking this because we should have one place that has the interface to make maintaining the code easy but also, since this interface is experimental, we expect it will be changed (like the use argument you introduce). We should make the interface in one place so we only have to modify it in only one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that there should be a single header for all of the definitions of joint_matrix, joint_matrix_load, joint_matrix_store, joint_matrix_mad, and then backend dependent specializations of these functions can be in separate files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you can use the same things as in matrix/matrix-jit.hpp like matrix_layout and not redefine them, that would be better.
For the things that are different like the definition of joint_matrix type, joint_matrix_load/store/mad because of "use" argument, can you add the use-definitions in matrix-jit.hpp (under the new test macro = 3)

As you know, we are planning on adding the new "use" argument for AMX and DPAS as well. Once we do that, there will be one definition of joint_matrix type/joint_matrix_load/store/mad.

If you make this change now, later, there will be one place for us to change (remove the old joint_matrix,load,store,mad that do not have "use" argument). And we won't need to touch the tensorcores specific specifications that will be in a different file.

Also, when this convergence happens, there will be no need for the feature test macro. Since this is an experimental interface, we don't need to keep track of "old" versions of the interface. We will remove AOT AMX (SYCL_EXT_ONEAPI_MATRIX=1), we only keep matrix-jit.hpp that enables DPAS, AMX and tensorecores.

Copy link
Contributor Author

@JackAKirk JackAKirk Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matrix_layout has an identical definition as in matrix-jit.hpp.

For the things that are different like the definition of joint_matrix type, joint_matrix_load/store/mad because of "use" argument, can you add the use-definitions in matrix-jit.hpp (under the new test macro = 3)

As you know, we are planning on adding the new "use" argument for AMX and DPAS as well. Once we do that, there will be one definition of joint_matrix type/joint_matrix_load/store/mad.

I'm not sure what you are asking me to do here? : if I add the definitions of joint_matrix_* used in matrix-tensorcore.hpp into matrix-jit.hpp they will be a redeclaration of the intel specific functions already defined in matrix-jit.hpp that do not use the matrix_use template parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dkhaldi , We would like to get this merged. Could you clarify what you would like me to change? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late reply, I was thinking you can have these defined under the new test macro = 3 in the same file so they don't get redefined.
However, I think it will be best if we merge these as separate files. Once we add the use argument, we can reiterate on this to merge both files. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sure, I think that keeping them separate is a good idea for now.

#include <sycl/ext/oneapi/matrix/matrix-tensorcore.hpp>
#endif
Loading