intel · dm-vodopyanov · Aug 28, 2023 · Jan 9, 2023 · Jan 9, 2023 · Jan 10, 2023
@@ -1,4 +1,4 @@
-# Additional Intel-only specifics about matrix extension for DPC++
+# Intel-specific matrix features
 
 :source-highlighter: coderay
 :coderay-linenums-mode: table
@@ -128,28 +128,13 @@ The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the ca
       // ---------------------------------
       // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
 
-## Supported Combinations Per Hardware
-
-The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class.
-
-### Intel AMX Supported Combinations
-
-[frame="none",options="header"]
-|======================
-| A type | B type | Accumulator type | M | N | K
-| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
-|  bf16       |  bf16   |   fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
-|======================
+## Open Questions
+- Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent`  to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.
 
-### Intel XMX Supported Combinations
+## Revision History
 
 [frame="none",options="header"]
 |======================
-| A type | B type | Accumulator type | M | N | K
-| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 8 |  16 |  32
-|  fp16       |  fp16   |   fp32   |  +<=+ 8 |  16   |  16
-|  bf16       |  bf16   |   fp32   |  +<=+ 8 |  16   |  16
+|Rev |Date       |Author     |Changes
+|1   |2022-11-07 |Dounia Khaldi |Add Intel-specific store API and layout information.
 |======================
-
-## Open Questions
-- Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent`  to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.
diff --git a/.../extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/.../extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -112,7 +112,7 @@ enum class layout {
 
 
 #### Group Memory Scope
-In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter and is also part of the constructor arguments.
+In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter.
 
 IMPORTANT: In the current implementation, only the `sub_group` scope is supported
 
@@ -188,7 +188,7 @@ The matrix multiply and add function performs the multiply operation on the matr
 
 
 #### Matrix Initialization: `joint_matrix_fill`
-The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
+Unlike `joint_matrix_load` that assumes that all the matrices are directly loaded from memory, `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -216,9 +216,7 @@ We introduce a new function `get_wi_data` that provides a view of the portion of
 
 Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and changes from one backend to the other. For general piece-wise operations such as summing the rows of a matrix, the WI data to joint matrix mapping coordinates information must be known in order to reason about the matrix view and extract the relevant piece. However, for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.   
 
-Therefore, this extension currently only supports class 1 of operations because the mapping between `get_wi_data` and `joint_matrix` elements is not required to be known for these operations. However, general piece-wise operations will be supported in the future as a new API will be provided to convey the mapping from `joint_matrix` domain to WI Domain (See Section "WI data to joint matrix mapping coordinates information for piece-wise operations for more information").
-
-Also, note that `get_wi_data` cannot return a fixed size array length because the length of the WI portion is a runtime variable for the following reasons:
+Note that `get_wi_data` cannot return a fixed size array length because the length of the WI portion is a runtime variable for the following reasons:
 
 1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined.
 
@@ -241,7 +239,8 @@ template <typename T, size_t Rows, size_t Cols,
 class wi_element {
   operator T();
   wi_element &operator=(const T &rhs);
-…
+  // other operators overloading (+, -, etc)
+  std::tuple<size_t, size_t> get_coord();	
 };
 }
 ```
@@ -258,7 +257,21 @@ for (int i = 0; i < wi_data_c.length(); i++)
 
 IMPORTANT: In the current implementation, only the `sub_group` scope is supported.
 
-IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.
+##### Work-item data to joint matrix mapping coordinates
+The `wi_data` and `wi_element` classes provide access to the matrix elements that are local to the calling work-item. However, the distribution of matrix elements to each work-item is implementation-defined, so application code cannot assume any fixed distribution. Instead, application code can use the `get_coord` method to query the matrix coordinates of an individual `wi_element`.
+
+`get_coord` returns [row,col] coordinates of the current object `wi_element` of the joint matrix.  The code above results into the following:
+
+```c++
+auto data = get_wi_data(sg, tA);
+// each WI calculates local sum of rows
+for (int i = 0; i < data.length(); ++i) {
+  auto [row, col] = data[i].get_coord();
+  sum_of_local_rows[row] += data[i];
+}  
+```
+
+IMPORTANT: `get_coord` is not implemented yet.
 
 ## Example using int8_t type
 ```c++
@@ -282,22 +295,27 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
    joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
    joint_matrix_fill(sg, tC, 0);
    for (int k = 0; k < K; k += tK) {
-     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);
-     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); 
+     joint_matrix_load(sg, tA,
+          multi_ptr<int8_t, sycl::access::address_space::global_space>(memA) +
+	  sg_startx * tM * K + k, K);
+     joint_matrix_load(sg, tB,
+          multi_ptr<int8_t, sycl::access::address_space::global_space>(memB) +
+	  k * N + sg_starty/SG_SIZE*tN, N); 
      tC = joint_matrix_mad(sg, tA, tB, tC);
    }
    auto wi_data_c = get_wi_data(sg, tC);
    for (int i = 0; i < wi_data_c.length(); i++)
-     wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C
-   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
+     wi_data_c[i] *= alpha; 
+   joint_matrix_store(sg, tC,
+        multi_ptr<int32_t, sycl::access::address_space::global_space>(memC) +
+	sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
 }).wait();
 ```
 
 == Query Interface
-Intel AMX, Intel XMX and Nvidia TPUs support different sizes and types.
-The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
-This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, 
-so there will be no runtime errors.
+Intel AMX, Intel XMX and Nvidia TPUs support different sizes and types (see Appendix: Supported Combinations Per Hardware). The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
+This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, so there will be no runtime errors.
+
 The query interface proposed here consists of three functionalities:
 
 - Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters.
@@ -330,8 +348,6 @@ The table below provides a description for each of the member variables and type
 |`num_combinations`|  validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array
 |======================
 
-
-
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int sM=0, int sN=0, int sK=0>
@@ -472,7 +488,6 @@ struct tpu_params<tpu::amx, void, void, void, sM, sN, sK> {
       sizeof(combinations) / sizeof(combination);
 };
 
-
 enum class tpu {
   xmx8,
   xmx16,
@@ -505,8 +520,6 @@ enum class scope_t {
 };
 }
 ```
-
-
 === Validation Example:
 ```c++
 // User can provide sizes besides the types and tpu_params can assert if they are supported or not
@@ -546,42 +559,28 @@ joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
 //Remainder handling
 ```
 
-## Future-looking API
-
-### Memory scope
-The current experimental API uses `joint_` semantics to define the memory scope of the matrix. The long term solution is to use the proposed link:../supported/sycl_ext_oneapi_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below.
-
-
-```c++
-multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN, use::a>>(sg);
-```
-We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. 
+## Appendix: Supported Combinations Per Hardware
 
-### WI data to joint matrix mapping coordinates information for piece-wise operations
-The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the original matrix.  This coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
-Within the joint matrix extension, we want to write, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact row of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different hardware and implementation is used, the same WI may own only half of the row if, for example, the SG size increased. 
+The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class.
 
-```c++
-auto data = get_wi_data(sg, C);
-for (int i = 0; i < data.length(); ++i) {
-  sum_of_local_rows[row] += data[i];
-}
-```
+### Intel AMX Supported Combinations
 
-We want to keep backward compatibility in the joint matrix code when implementations or hardware change. To that end, instead of hard-coding this mapping, we use general backend and target-agnostic functionality, especially in the JIT compilation mode of SYCL. For this reason we would like to be able to query this mapping so that code does not have to change from one version to the other.
+[frame="none",options="header"]
+|======================
+| A type | B type | Accumulator type | M | N | K
+| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
+|  bf16       |  bf16   |   fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
+|======================
 
-So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like:
-```c++
-auto data = get_wi_data(sg, C);
-for (int i = 0; i < data.length; ++i) {
-  auto row, col = data[i].get_coord();
-  sum_of_local_rows[row] += data[i];
-}
-```
+### Intel XMX Supported Combinations
 
-## TODO List
-- Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 
-- Add a more realistic and complete example that shows the value of the general query. 
+[frame="none",options="header"]
+|======================
+| A type | B type | Accumulator type | M | N | K
+| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 8 |  16 |  32
+|  fp16       |  fp16   |   fp32   |  +<=+ 8 |  16   |  16
+|  bf16       |  bf16   |   fp32   |  +<=+ 8 |  16   |  16
+|======================
 
 
 ## Revision History
@@ -593,5 +592,6 @@ for (int i = 0; i < data.length; ++i) {
 |2   |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
 |3   |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support
 |4   |2022-08-25 |Dounia Khaldi |Update the matrix spec by adding the new matrix use parameter and remove reference to the AOT AMX initial implementation 
-|5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it portable across Intel AMX, Intel XMX and Nvidia tensor Cores, and move the Intel-specifics to a separate extension document.  
+|5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it portable across Intel AMX, Intel XMX and Nvidia tensor Cores, and move the Intel-specifics to a separate extension document.
+|6   |2023-01-09 |Dounia Khaldi |Add `get_coord` API and supported combinations appendix.
 |======================