diff --git a/extensions/clext.php b/extensions/clext.php index caae768a..0e005db1 100644 --- a/extensions/clext.php +++ b/extensions/clext.php @@ -65,6 +65,8 @@
  • cl_intel_advanced_motion_estimation
  • +
  • cl_intel_bfloat16_conversions +
  • cl_intel_command_queue_families
  • cl_intel_create_buffer_with_properties @@ -111,6 +113,12 @@
  • cl_intel_spirv_subgroups
  • +
  • cl_intel_split_work_group_barrier +
  • +
  • cl_intel_subgroup_matrix_multiply_accumulate +
  • +
  • cl_intel_subgroup_split_matrix_multiply_accumulate +
  • cl_intel_subgroups
  • cl_intel_subgroups_char diff --git a/extensions/intel/cl_intel_bfloat16_conversions.html b/extensions/intel/cl_intel_bfloat16_conversions.html new file mode 100644 index 00000000..f2e5ad1d --- /dev/null +++ b/extensions/intel/cl_intel_bfloat16_conversions.html @@ -0,0 +1,1249 @@ + + + + + + + +cl_intel_bfloat16_conversions + + + + + + + + + + +
    +
    +

    Name Strings

    +
    +
    +

    cl_intel_bfloat16_conversions

    +
    +
    +
    +
    +

    Contact

    +
    +
    +

    Ben Ashbaugh, Intel (ben 'dot' ashbaugh 'at' intel 'dot' com)

    +
    +
    +
    +
    +

    Contributors

    +
    +
    +

    Ben Ashbaugh, Intel
    +Alexey Sotkin, Intel
    +Lukasz Towarek, Intel

    +
    +
    +
    +
    +

    Notice

    +
    +
    +

    Copyright (c) 2022-2023 Intel Corporation. All rights reserved.

    +
    +
    +
    +
    +

    Status

    +
    +
    +

    Shipping

    +
    +
    +
    +
    +

    Version

    +
    +
    +

    Built On: 2023-06-12
    +Revision: 1.0.0

    +
    +
    +
    +
    +

    Dependencies

    +
    +
    +

    This extension is written against the OpenCL 3.0 C Language specification and the OpenCL SPIR-V Environment specification, V3.0.8.

    +
    +
    +

    This extension requires OpenCL 1.0.

    +
    +
    +
    +
    +

    Overview

    +
    +
    +

    This extension adds built-in functions to convert between single-precision 32-bit floating-point values and 16-bit bfloat16 values. +The 16-bit bfloat16 format has similar dynamic range as the 32-bit float format, albeit with lower precision than the 16-bit half format.

    +
    +
    +

    Please note that this extension currently does not introduce a bfloat16 type to OpenCL C and instead the built-in functions convert to or from a ushort 16-bit unsigned integer type with a bit pattern that represents a bfloat16 value.

    +
    +
    +
    +
    +

    New API Functions

    +
    +
    +

    None.

    +
    +
    +
    +
    +

    New API Enums

    +
    +
    +

    None.

    +
    +
    +
    +
    +

    New API Types

    +
    +
    +

    None.

    +
    +
    +
    +
    +

    New OpenCL C Functions

    +
    +
    +
    +
    ushort intel_convert_bfloat16_as_ushort(float source);
    +ushort2 intel_convert_bfloat162_as_ushort2(float2 source);
    +ushort3 intel_convert_bfloat163_as_ushort3(float3 source);
    +ushort4 intel_convert_bfloat164_as_ushort4(float4 source);
    +ushort8 intel_convert_bfloat168_as_ushort8(float8 source);
    +ushort16 intel_convert_bfloat1616_as_ushort16(float16 source);
    +
    +float intel_convert_as_bfloat16_float(ushort source);
    +float2 intel_convert_as_bfloat162_float2(ushort2 source);
    +float3 intel_convert_as_bfloat163_float3(ushort3 source);
    +float4 intel_convert_as_bfloat164_float4(ushort4 source);
    +float8 intel_convert_as_bfloat168_float8(ushort8 source);
    +float16 intel_convert_as_bfloat1616_float16(ushort16 source);
    +
    +
    +
    +
    +
    +

    Modifications to the OpenCL C Specification

    +
    +
    +

    Add a new Section 6.3.1.X - The bfloat16 Format

    +
    +

    The bfloat16 format is a floating-point format occupying 16 bits. +It is a truncated version of the 32-bit IEEE 754 single-precision floating-point format. +The bfloat16 format includes one sign bit, eight exponent bits (same as the 32-bit single-precision floating-point format), and 7 mantissa bits (fewer than the 16-bit IEEE 754-2008 half-precision floating-point format). +This means that a bfloat16 number may represent numeric values with a similar dynamic range as a 32-bit float number, albeit with lower precision than a 16-bit half number.

    +
    +
    +

    The cl_intel_bfloat16_conversions extension does not add bfloat16 as a supported data type for OpenCL kernels, however the built-in functions added by the extension are able to use and return bfloat16 data. +For these built-in functions, the bfloat16 data is passed to the function or returned from the function by encoding it into a ushort 16-bit unsigned integer data type. +If a future extension adds bfloat16 as a supported data type for OpenCL kernels, the bfloat16 data may be reinterpreted and passed to the built-in functions added by cl_intel_bfloat16_conversions using the as_type() operator.

    +
    +
    +
    +

    Add a new Section 6.4.X - bfloat16 Conversions

    +
    +

    The bfloat16 format can be used in explicit conversions using the following suite of functions:

    +
    +
    +
    +
    // conversions to bfloat16:
    +destType intel_convert_bfloat16_as_destType(sourceType)
    +destTypen intel_convert_bfloat16n_as_destTypen(sourceTypen)
    +
    +// conversions from bfloat16:
    +destType intel_convert_as_bfloat16_destType(sourceType)
    +destTypen intel_convert_as_bfloat16n_destTypen(sourceType)
    +
    +
    +
    +

    The number of elements in the source and destination vectors must match.

    +
    +
    +

    The only supported rounding mode is implicitly round-to-nearest-even. +No explicit rounding modes are supported.

    +
    +
    +

    Supported scalar and vector data types:

    +
    + ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
    destTypesourceType

    bfloat16 (as ushort)

    float

    bfloat162, bfloat163, bfloat164, bfloat168, bfloat1616
    + (as ushort2, ushort3, ushort4, ushort8, ushort16)

    float2, float3, float4, float8, float16

    float

    bfloat16 (as ushort)

    float2, float3, float4, float8, float16

    bfloat162, bfloat163, bfloat164, bfloat168, bfloat1616
    + (as ushort2, ushort3, ushort4, ushort8, ushort16)

    +
    +
    +
    +
    +

    Modifications to the OpenCL SPIR-V Environment Specification

    +
    +
    +

    Add a new section 5.2.X - cl_intel_bfloat16_conversions

    +
    +

    If the OpenCL environment supports the extension cl_intel_bfloat16_conversions then the environment must accept modules that declare use of the extension SPV_INTEL_bfloat16_conversion and that declare the SPIR-V capability Bfloat16ConversionINTEL.

    +
    +
    +

    For the instructions OpConvertFToBF16INTEL and OpConvertBF16ToFINTEL added by the extension:

    +
    +
    +
      +
    • +

      Valid types for Result Type, Float Value, and Bfloat16 Value are Scalars and OpTypeVectors with 2, 3, 4, 8, or 16 Component Count components

      +
    • +
    +
    +
    +
    +
    +
    +

    Issues

    +
    +
    +
      +
    1. +

      Should these functions have a special prefix (such as __) or suffix (such as _as_ushort) since they do not truly operate on a bfloat16 type?

      +
      +
      +
      +

      RESOLVED: Yes, we will use the _as_ushort nomenclature.

      +
      +
      +

      The function name to convert to a ushort representing a bfloat16 value is intel_convert_bfloat16_as_ushort.

      +
      +
      +

      The function name to convert from a ushort representing a bfloat16 value is intel_convert_as_bfloat16_float.

      +
      +
      +
      +
    2. +
    3. +

      Should we define a type alias for our bfloat16 type or use ushort (or short) directly?

      +
      +
      +
      +

      RESOLVED: No, we will not define a type alias.

      +
      +
      +
      +
    4. +
    5. +

      Should the integer bfloat16 representation be signed or unsigned?

      +
      +
      +
      +

      RESOLVED: We will use an unsigned type.

      +
      +
      +
      +
    6. +
    7. +

      Should we support vector conversion built-in functions?

      +
      +
      +
      +

      RESOLVED: Yes, we will support the vector conversion built-in functions for consistency.

      +
      +
      +
      +
    8. +
    9. +

      Should we support built-in functions with explicit rounding modes?

      +
      +
      +
      +

      RESOLVED: No, we will not support the built-in functions with explicit rounding modes for the initial version of this extension.

      +
      +
      +

      The only supported rounding mode for the conversion from float to bfloat16 will be the implicit round-to-nearest-even rounding mode.

      +
      +
      +

      The conversions from bfloat16 to float are lossless.

      +
      +
      +
      +
    10. +
    11. +

      Do we need to support packed conversions?

      +
      +
      +
      +

      RESOLVED: No, we will not support packed conversions for the initial version of this extension. +If we decide to add packed conversions we will also need to add them to the SPIR-V extension.

      +
      +
      +
      +
    12. +
    13. +

      Do we need to say anything about out-of-range conversions?

      +
      +
      +
      +

      RESOLVED: No, out-of-range behavior is covered by existing rounding rules.

      +
      +
      +
      +
    14. +
    15. +

      How should we name the vector conversion functions?

      +
      +
      +
      +

      RESOLVED: The name of the vector conversion functions will be intel_convert_bfloat16n_as_ushortn and intel_convert_as_bfloat16n_floatn. +This is consistent with the naming of the existing conversion functions.

      +
      +
      +

      Because bfloat16 ends with a number this does lead to awkward function names like intel_convert_bfloat1616_as_ushort16, but the awkward-ness is preferable to the ambiguity without the vector size suffix.

      +
      +
      +

      If we decide to add a true bfloat16 type we should consider other names that do not end in a number (bfloat16_t?).

      +
      +
      +
      +
    16. +
    +
    +
    +
    +
    +

    Revision History

    +
    + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    VersionDateAuthorChanges

    0.9.0

    2021-09-03

    Ben Ashbaugh

    Initial revision

    0.9.0

    2021-10-01

    Ben Ashbaugh

    Reduced scope, resolved all open issues.

    0.9.0

    2021-10-19

    Ben Ashbaugh

    Fixed the names of the vector conversion functions.

    1.0.0

    2022-08-26

    Ben Ashbaugh

    Updated version.

    +
    +
    +
    + + + + + \ No newline at end of file diff --git a/extensions/intel/cl_intel_split_work_group_barrier.html b/extensions/intel/cl_intel_split_work_group_barrier.html new file mode 100644 index 00000000..7a95dc69 --- /dev/null +++ b/extensions/intel/cl_intel_split_work_group_barrier.html @@ -0,0 +1,1190 @@ + + + + + + + +cl_intel_split_work_group_barrier + + + + + + + + + + +
    +
    +

    Name Strings

    +
    +
    +

    cl_intel_split_work_group_barrier

    +
    +
    +
    +
    +

    Contact

    +
    +
    +

    Ben Ashbaugh, Intel (ben 'dot' ashbaugh 'at' intel 'dot' com)

    +
    +
    +
    +
    +

    Contributors

    +
    +
    +

    Ben Ashbaugh, Intel
    +Eugene Chereshnev, Intel
    +John Pennycook, Intel

    +
    +
    +
    +
    +

    Notice

    +
    +
    +

    Copyright (c) 2022-2023 Intel Corporation. All rights reserved.

    +
    +
    +
    +
    +

    Status

    +
    +
    +

    Shipping

    +
    +
    +
    +
    +

    Version

    +
    +
    +

    Built On: 2023-06-12
    +Version: 1.0.0

    +
    +
    +
    +
    +

    Dependencies

    +
    +
    +

    This extension is written against the OpenCL 3.0 C Language specification and the OpenCL SPIR-V Environment specification, V3.0.10.

    +
    +
    +

    This extension requires OpenCL 1.0.

    +
    +
    +

    Some OpenCL C function overloads added by this extension require OpenCL C 2.0 or newer.

    +
    +
    +
    +
    +

    Overview

    +
    +
    +

    This extension adds built-in functions to split a barrier or work_group_barrier function in OpenCL C into two separate operations: +the first indicates that a work-item has "arrived" at a barrier but should continue executing, +and the second indicates that a work-item should "wait" for all of the work-items to arrive at the barrier before executing further.

    +
    +
    +

    Splitting a barrier operation may improve performance and may provide a closer match to "latch" or "barrier" operations in other parallel languages such as C++ 20.

    +
    +
    +
    +
    +

    New API Functions

    +
    +
    +

    None.

    +
    +
    +
    +
    +

    New API Enums

    +
    +
    +

    None.

    +
    +
    +
    +
    +

    New API Types

    +
    +
    +

    None.

    +
    +
    +
    +
    +

    New OpenCL C Functions

    +
    +
    +
    +
    void intel_work_group_barrier_arrive(cl_mem_fence_flags flags);
    +void intel_work_group_barrier_wait(cl_mem_fence_flags flags);
    +
    +// For OpenCL C 2.0 or newer:
    +void intel_work_group_barrier_arrive(cl_mem_fence_flags flags, memory_scope scope);
    +void intel_work_group_barrier_wait(cl_mem_fence_flags flags, memory_scope scope);
    +
    +
    +
    +
    +
    +

    Modifications to the OpenCL C Specification

    +
    +
    +

    Add to Table 19 - Built-in Work-group Synchronization Functions

    + + ++++ + + + + + + + + + + + + +
    Table 19. Built-in Work-group synchronization Functions
    FunctionDescription
    +
    +
    void intel_work_group_barrier_arrive(
    +    cl_mem_fence_flags flags);
    +void intel_work_group_barrier_wait(
    +    cl_mem_fence_flags flags);
    +
    +void intel_work_group_barrier_arrive(
    +    cl_mem_fence_flags flags,
    +    memory_scope scope);
    +void intel_work_group_barrier_wait(
    +    cl_mem_fence_flags flags,
    +    memory_scope scope);
    +
    +

    For these functions, if any work-item in a work-group arrives at a barrier, behavior is undefined unless all work-items in the work-group arrive at the barrier. +If any work-item in a work-group waits on a barrier, behavior is undefined unless all work-items in the work-group wait on the barrier.

    +

    If a barrier arrive function is inside of a conditional statement and any work-item in the work-group enters the conditional statement and arrives at the barrier, behavior is undefined unless all work-items enter the conditional and arrive at the barrier. +If a barrier wait function is inside of a conditional statement and any work-item in the work-group enters the conditional statement and waits on the barrier, behavior is undefined unless all work-items enter the conditional and wait on the barrier.

    +

    If a barrier arrive function is inside of a loop and any work-item arrives at the barrier for an iteration of the loop, behavior is undefined unless all work-items arrive at the barrier for the same iteration of the loop. +If a barrier wait function is inside of a loop and any work-item waits on the barrier for an iteration of the loop, behavior is undefined unless all work-items wait on the barrier for the same iteration of the loop.

    +

    Behavior is undefined if a work-item waits on a barrier before arriving at a barrier. +After a work-item arrives at a barrier, behavior is undefined if the work-item arrives at another barrier before waiting on a barrier. +After a work-item waits on a barrier, behavior is undefined if the work-item waits on another barrier before arriving at a barrier.

    +

    The intel_work_group_barrier_arrive and intel_work_group_barrier_wait functions specify which memory operations from before arriving at the barrier must be visible to work-items after waiting on the barrier by using the flags and scope arguments.

    +

    The flags argument specifies the memory address spaces to apply the memory ordering constraints. +This is a bitfield that can be zero or a combination of the following values:

    +

    CLK_LOCAL_MEM_FENCE: for local memory accesses.
    +CLK_GLOBAL_MEM_FENCE: for global memory accesses.
    +CLK_IMAGE_MEM_FENCE: for image memory accesses, for this flag the value of scope must be memory_scope_work_group or behavior is undefined.

    +

    The scope argument describes the work-items to apply the memory ordering constraints. +If no scope argument is provided, the scope is memory_scope_work_group.

    +

    If the flags argument differs between the barrier arrive function and the barrier wait function then only memory operations for the address spaces specified by the intersection of the two flags arguments must be visible.

    +

    If the scope argument differs between the barrier arrive function and the barrier wait function then the memory ordering constraints only apply to work-items described by the narrower of the two scope arguments.

    +

    For each call to these functions, the values of flags and scope must be the same for all work-items in the work-group.

    +
    +
    +
    +
    +

    Modifications to the OpenCL SPIR-V Environment Specification

    +
    +
    +

    Add a new section 5.2.X - cl_intel_split_work_group_barrier

    +
    +

    If the OpenCL environment supports the extension cl_intel_split_work_group_barrier then the environment must accept modules that declare use of the extension SPV_INTEL_split_barrier and that declare the SPIR-V capability SplitBarrierINTEL.

    +
    +
    +

    For the instructions OpControlBarrierArriveINTEL and OpControlBarrierWaitINTEL added by the extension:

    +
    +
    +
      +
    • +

      Scope for Execution must be WorkGroup.

      +
    • +
    • +

      Valid values for Scope for Memory are the same as for OpControlBarrier.

      +
    • +
    +
    +
    +

    For the instruction OpControlBarrierArriveINTEL, the memory-order constraint in Memory Semantics must be Release.

    +
    +
    +

    For the instruction OpControlBarrierWaitINTEL, the memory-order constraint in Memory Semantics must be Acquire.

    +
    +
    +
    +
    +
    +

    Issues

    +
    +
    +
      +
    1. +

      Do we need to support all of the features of C++ 20 barriers (completion functions, arrival tokens, etc.)?

      +
      +
      +
      +

      RESOLVED: Not in this extension.

      +
      +
      +
      +
    2. +
    3. +

      Do we need to support sub-group split barriers?

      +
      +
      +
      +

      RESOLVED: Not in this extension.

      +
      +
      +
      +
    4. +
    5. +

      Do we need to document formal changes to the memory model?

      +
      +
      +
      +

      RESOLVED: Not initially. +Informally, the barrier wait for one work-item synchronizes-with the barrier arrives for the other work-items in the work-group.

      +
      +
      +
      +
    6. +
    7. +

      What are the memory order constraints for a split barrier?

      +
      +
      +
      +

      RESOLVED: Arriving at a split barrier will effectively be a release memory fence and waiting on a barrier will effectively be an acquire memory fence.

      +
      +
      +

      Alternatively, both arriving and waiting could be sequentially consistent memory fences, but acquire and release are sufficient for most use-cases and may perform better. +If a sequentially consistent fence is required instead, applications can use an ordinary non-split barrier, or insert explicit memory fences before arriving at the split barrier and after waiting on a split barrier.

      +
      +
      +
      +
    8. +
    9. +

      What should behavior be if the flags arguments differ between the barrier arrive and the barrier wait?

      +
      +
      +
      +

      RESOLVED: The address spaces will be the intersection of the flags, and the memory scope will be the narrowest of the two scopes. +This is the same behavior that would be observed with a release fence before arriving at the barrier and an acquire fence after waiting on the barrier.

      +
      +
      +

      Alternatively, this scenario could be undefined behavior, but this appears to be unnecessary.

      +
      +
      +
      +
    10. +
    +
    +
    +
    +
    +

    Revision History

    +
    + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    VersionDateAuthorChanges

    0.9.0

    2022-01-11

    Ben Ashbaugh

    Initial revision

    0.9.1

    2022-02-07

    Ben Ashbaugh

    Added "intel" prefix to split barrier functions.

    1.0.0

    2022-09-06

    Ben Ashbaugh

    Updated version.

    +
    +
    +
    + + + + + \ No newline at end of file diff --git a/extensions/intel/cl_intel_subgroup_matrix_multiply_accumulate.html b/extensions/intel/cl_intel_subgroup_matrix_multiply_accumulate.html new file mode 100644 index 00000000..b0dffe07 --- /dev/null +++ b/extensions/intel/cl_intel_subgroup_matrix_multiply_accumulate.html @@ -0,0 +1,1331 @@ + + + + + + + +cl_intel_subgroup_matrix_multiply_accumulate + + + + + + + + + + +
    +
    +

    Name Strings

    +
    +
    +

    cl_intel_subgroup_matrix_multiply_accumulate

    +
    +
    +
    +
    +

    Contact

    +
    +
    +

    Ben Ashbaugh, Intel (ben 'dot' ashbaugh 'at' intel 'dot' com)

    +
    +
    +
    +
    +

    Contributors

    +
    +
    +

    Ben Ashbaugh, Intel
    +Eugene Chereshnev, Intel
    +Junjie Gu, Intel
    +Bartosz Koscielak, Intel
    +Mike MacPherson, Intel
    +Ritesh Patel, Intel
    +Lukasz Towarek, Intel

    +
    +
    +
    +
    +

    Notice

    +
    +
    +

    Copyright (c) 2022-2023 Intel Corporation. All rights reserved.

    +
    +
    +
    +
    +

    Status

    +
    +
    +

    Complete

    +
    +
    +
    +
    +

    Version

    +
    +
    +

    Built On: 2023-06-12
    +Revision: 1.0.0

    +
    +
    +
    +
    +

    Dependencies

    +
    +
    +

    This extension is written against the OpenCL 3.0 C Language specification, V3.0.10.

    +
    +
    +

    This extension requires support for subgroups.

    +
    +
    +

    This extension depends on cl_intel_required_subgroup_size to query the subgroup sizes supported by a device or to require a subgroup size for a kernel.

    +
    +
    +
    +
    +

    Overview

    +
    +
    +

    The goal of this extension is to allow programmers to access specialized hardware to compute the product of an M x K matrix with a K x N matrix and then add an M x N matrix accumulation value. +This is a commonly used building block to compute the product of two large matrices. +When used in an OpenCL kernel, all work items in the subgroup cooperate to perform this operation.

    +
    +
    +

    This is a low-level extension for expert programmers seeking to access this functionality directly in custom kernels. +Most users will access this functionality via high-level libraries or frameworks.

    +
    +
    +
    +
    +

    New API Functions

    +
    +
    +

    None.

    +
    +
    +
    +
    +

    New API Enums

    +
    +
    +

    None.

    +
    +
    +
    +
    +

    New OpenCL C Functions

    +
    +
    +
    +
    // These functions are available to devices where the minimum subgroup
    +// size is 8.  For these devices, the subgroup size must be 8 (the
    +// minimum supported subgroup size).  Calling these functions on other
    +// devices or from kernels with a different subgroup size is undefined
    +// behavior:
    +
    +// 8-bit matrices:
    +int  intel_sub_group_i8_i8_matrix_mad_k32(int   a, int8  b, int  acc);  // M = 1
    +int2 intel_sub_group_i8_i8_matrix_mad_k32(int2  a, int8  b, int2 acc);  // M = 2
    +int4 intel_sub_group_i8_i8_matrix_mad_k32(int4  a, int8  b, int4 acc);  // M = 4
    +int8 intel_sub_group_i8_i8_matrix_mad_k32(int8  a, int8  b, int8 acc);  // M = 8
    +
    +int  intel_sub_group_i8_u8_matrix_mad_k32(int   a, uint8 b, int  acc);  // ...
    +int2 intel_sub_group_i8_u8_matrix_mad_k32(int2  a, uint8 b, int2 acc);
    +int4 intel_sub_group_i8_u8_matrix_mad_k32(int4  a, uint8 b, int4 acc);
    +int8 intel_sub_group_i8_u8_matrix_mad_k32(int8  a, uint8 b, int8 acc);
    +
    +int  intel_sub_group_u8_i8_matrix_mad_k32(uint  a, int8  b, int  acc);
    +int2 intel_sub_group_u8_i8_matrix_mad_k32(uint2 a, int8  b, int2 acc);
    +int4 intel_sub_group_u8_i8_matrix_mad_k32(uint4 a, int8  b, int4 acc);
    +int8 intel_sub_group_u8_i8_matrix_mad_k32(uint8 a, int8  b, int8 acc);
    +
    +int  intel_sub_group_u8_u8_matrix_mad_k32(uint  a, uint8 b, int  acc);
    +int2 intel_sub_group_u8_u8_matrix_mad_k32(uint2 a, uint8 b, int2 acc);
    +int4 intel_sub_group_u8_u8_matrix_mad_k32(uint4 a, uint8 b, int4 acc);
    +int8 intel_sub_group_u8_u8_matrix_mad_k32(uint8 a, uint8 b, int8 acc);
    +
    +// bfloat16 matrices:
    +float  intel_sub_group_bf16_bf16_matrix_mad_k16(int  a, int8 b, float  acc);
    +float2 intel_sub_group_bf16_bf16_matrix_mad_k16(int2 a, int8 b, float2 acc);
    +float4 intel_sub_group_bf16_bf16_matrix_mad_k16(int4 a, int8 b, float4 acc);
    +float8 intel_sub_group_bf16_bf16_matrix_mad_k16(int8 a, int8 b, float8 acc);
    +
    +// fp16 matrices:
    +float  intel_sub_group_f16_f16_matrix_mad_k16(int  a, int8 b, float  acc);
    +float2 intel_sub_group_f16_f16_matrix_mad_k16(int2 a, int8 b, float2 acc);
    +float4 intel_sub_group_f16_f16_matrix_mad_k16(int4 a, int8 b, float4 acc);
    +float8 intel_sub_group_f16_f16_matrix_mad_k16(int8 a, int8 b, float8 acc);
    +
    +// These functions are available to devices where the minimum subgroup
    +// size is 16.  For these devices, the subgroup size must be 16 (the
    +// minimum supported subgroup size).  Calling these functions on other
    +// devices or from kernels with a different subgroup size is undefined
    +// behavior:
    +
    +// 8-bit matrices:
    +int  intel_sub_group_i8_i8_matrix_mad_k32(short   a, int8  b, int  acc);  // M = 1
    +int2 intel_sub_group_i8_i8_matrix_mad_k32(short2  a, int8  b, int2 acc);  // M = 2
    +int4 intel_sub_group_i8_i8_matrix_mad_k32(short4  a, int8  b, int4 acc);  // M = 4
    +int8 intel_sub_group_i8_i8_matrix_mad_k32(short8  a, int8  b, int8 acc);  // M = 8
    +
    +int  intel_sub_group_i8_u8_matrix_mad_k32(short   a, uint8 b, int  acc);  // ...
    +int2 intel_sub_group_i8_u8_matrix_mad_k32(short2  a, uint8 b, int2 acc);
    +int4 intel_sub_group_i8_u8_matrix_mad_k32(short4  a, uint8 b, int4 acc);
    +int8 intel_sub_group_i8_u8_matrix_mad_k32(short8  a, uint8 b, int8 acc);
    +
    +int  intel_sub_group_u8_i8_matrix_mad_k32(ushort  a, int8  b, int  acc);
    +int2 intel_sub_group_u8_i8_matrix_mad_k32(ushort2 a, int8  b, int2 acc);
    +int4 intel_sub_group_u8_i8_matrix_mad_k32(ushort4 a, int8  b, int4 acc);
    +int8 intel_sub_group_u8_i8_matrix_mad_k32(ushort8 a, int8  b, int8 acc);
    +
    +int  intel_sub_group_u8_u8_matrix_mad_k32(ushort  a, uint8 b, int  acc);
    +int2 intel_sub_group_u8_u8_matrix_mad_k32(ushort2 a, uint8 b, int2 acc);
    +int4 intel_sub_group_u8_u8_matrix_mad_k32(ushort4 a, uint8 b, int4 acc);
    +int8 intel_sub_group_u8_u8_matrix_mad_k32(ushort8 a, uint8 b, int8 acc);
    +
    +// bfloat16 matrices:
    +float  intel_sub_group_bf16_bf16_matrix_mad_k16(short  a, int8 b, float  acc);
    +float2 intel_sub_group_bf16_bf16_matrix_mad_k16(short2 a, int8 b, float2 acc);
    +float4 intel_sub_group_bf16_bf16_matrix_mad_k16(short4 a, int8 b, float4 acc);
    +float8 intel_sub_group_bf16_bf16_matrix_mad_k16(short8 a, int8 b, float8 acc);
    +
    +// fp16 matrices:
    +float  intel_sub_group_f16_f16_matrix_mad_k16(short  a, int8 b, float  acc);
    +float2 intel_sub_group_f16_f16_matrix_mad_k16(short2 a, int8 b, float2 acc);
    +float4 intel_sub_group_f16_f16_matrix_mad_k16(short4 a, int8 b, float4 acc);
    +float8 intel_sub_group_f16_f16_matrix_mad_k16(short8 a, int8 b, float8 acc);
    +
    +
    +
    +
    +
    +

    Modifications to the OpenCL C Specification

    +
    +
    +

    Add a new Section 6.13.X - Subgroup Matrix Multiply Accumulate Instructions

    +
    +

    This section describes a family of built-in functions that multiply two matrix sources a and b and then add a matrix accumulation value to produce a matrix result value. +a is the first matrix operand and has M rows and K columns. +b is the second matrix operand and has K rows and N columns. +acc is the matrix accumulation value and has M rows and N columns. +The result value also has M rows and N columns. +All work items in the subgroup cooperate to perform this operation. +These functions must be encountered by all work items in the subgroup executing the kernel.

    +
    +
    +

    The dimensions of the two source matrices and the elements of each source matrix are described by the built-in function name and its arguments.

    +
    +
    +

    As an example, given the function:

    +
    +
    +
    +
    int2 intel_sub_group_u8_i8_matrix_mad_k32(uint2 a, int8  b, int2 acc);
    +
    +
    +
    +
      +
    • +

      a is the first source matrix operand and has M rows and K columns.

      +
      +
        +
      • +

        The value for M is determined by the number of vector components in the source operand a. +In the example above, a is a uint2 argument, therefore the matrix a operand has M equal to 2 rows.

        +
      • +
      • +

        The value of K is described by the function name. +In this case, the value of K is 32, therefore the matrix a operand has K equal to 32 columns.

        +
      • +
      • +

        The matrix component data type is also described by the function name. +In this case, the matrix a component data type is u8, indicating that the elements of the matrix a operand are unsigned 8-bit integers.

        +
      • +
      • +

        Each work item contributes part of this matrix. +In this case, since the elements of the matrix a are 8-bit integers, and since each work item is contributing 32 bits (the size of a uint) of data per row of this matrix, each work item is contributing four 8-bit integer values per row.

        +
      • +
      • +

        Since K is 32, and each work item is contributing four 8-bit values per row, the number of work items in the subgroup must be equal to 8.

        +
      • +
      +
      +
    • +
    • +

      b is the second source matrix operand and has K rows and N columns.

      +
      +
        +
      • +

        Each work item contributes one column of this matrix. +Therefore, the number of columns N is equivalent to the subgroup size.

        +
      • +
      • +

        As above, the value of K is described by the function name. +In this case, the value of K is 32, therefore the matrix b operand has K equal to 32 rows.

        +
      • +
      • +

        As above, the matrix component data type is described by the function name. +In this case, the matrix b component data type is i8, indicating that the elements of the matrix b operand are signed 8-bit integers.

        +
      • +
      • +

        Since K is 32 and the elements of the matrix b are 8-bit integers, each work item must contribute 256 bits of source data to contribute K values. +The 256 bits of source data are packed and passed as the int8 argument b.

        +
      • +
      +
      +
    • +
    • +

      acc specifies the accumulation value and has M rows and N columns.

      +
      +
        +
      • +

        As above, the value of M is determined by the number of components in the source operand acc. +In the example above, acc is an int2 argument, therefore the accumulation value operand has M equal to 2 rows.

        +
      • +
      • +

        Since both a and acc specify operands with M rows, and since the value of M is determined by the number of components in the source operand, both the a and acc operands will be vector operands with the same number of components.

        +
      • +
      • +

        As above, each work item contributes one column of accumulation values. +Therefore, the number of columns N is equivalent to the subgroup size.

        +
      • +
      • +

        The acc operand is a "full precision" accumulation value. +In the example above, the matrices contain integer data, therefore the acc operand is a vector of int data.

        +
      • +
      +
      +
    • +
    • +

      The result value returned by the function also has M rows and N columns.

      +
      +
        +
      • +

        As above, the value of M is determined by the number of components in the return type. +In the example above, the return type is int2, therefore the result value has M equal to 2 rows.

        +
      • +
      • +

        Since the result value, a, and acc all specify values with M rows, and since the value of M is determined by the number of components in the source operand or return type, the return tye, a, and acc will all be vectors with the same number of components.

        +
      • +
      • +

        As above, each work item will receive one column of result values. +Therefore, the number of columns N is equivalent to the subgroup size.

        +
      • +
      • +

        Similar to the acc operand, the return value is a "full precision" result value. +In the example above, the matrices contain integer data, therefore the return type is a vector of int data.

        +
      • +
      +
      +
    • +
    +
    +
    +

    The full list of supported functions is described in the overview, above. +For this list of functions:

    +
    +
    +
      +
    • +

      M may be equal to 1, 2, 4, or 8.

      +
    • +
    • +

      N must be equal to 8 for some devices or 16 for other devices. +In other words, the only supported subgroup sizes are 8 and 16.

      +
    • +
    • +

      Supported integer matrix types for a and b are any combination of signed or unsigned 8-bit integers. +For these integer matrix types, the accumulation value acc and result value are signed 32-bit integers, and K must be equal to 32.

      +
    • +
    • +

      The supported floating-point matrix types for a and b are fp16 (half) or bfloat16. +For these floating-point matrix type, the accumulation value acc and result value are 32-bit floating-point values, and K must be equal to 16.

      +
    • +
    +
    +
    +
    +
    +
    +

    Coding Sample

    +
    +
    +
    +
    // The code below shows a functional implementation of one of the
    +// built-in functions added by this extension.  For this built-in
    +// function:
    +//  * M = 2, since the result value, a operand, and acc operand
    +//    are all vectors with two components.
    +//  * N = 8, and is equal to the subgroup size.
    +//  * K = 32, as described by the function name.
    +//  * The elements of both matrix a and matrix b are signed 8-bit
    +//    integers.
    +
    +// This is a helper function that performs the dot product of
    +// two vectors of four components of 8-bit integer data, and then
    +// adds a 32-bit integer accumulation value.
    +static int __intel_dot_product_accumulate( char4 a, char4 b, int acc )
    +{
    +    return a.x * b.x + a.y * b.y + a.z * b.z + a.w * b.w + acc;
    +}
    +
    +// This is a helper function that computes the product of a
    +// 1 x 32 row vector value shared across the subgroup and a 32 x 1
    +// column vector, that is added to a full precision accumulation
    +// value.
    +static int __intel_vector_matrix_multiply_accumulate_k32( int v, int8 b, int acc )
    +{
    +    // Note: 8 is the size of the subgroup.
    +    // As K is 32, and the size of the subgroup is 8, each
    +    // work item contributes 4 elements of the 1 x K vector.
    +    // as_char4() is used to reinterpret 32-bits of data
    +    // as four components of 8-bit data.
    +
    +    int result = acc;
    +
    +    result = __intel_dot_product_accumulate(
    +        as_char4( sub_group_broadcast( v, 0 ) ), as_char4( b.s0 ), result );
    +    result = __intel_dot_product_accumulate(
    +        as_char4( sub_group_broadcast( v, 1 ) ), as_char4( b.s1 ), result );
    +    result = __intel_dot_product_accumulate(
    +        as_char4( sub_group_broadcast( v, 2 ) ), as_char4( b.s2 ), result );
    +    result = __intel_dot_product_accumulate(
    +        as_char4( sub_group_broadcast( v, 3 ) ), as_char4( b.s3 ), result );
    +
    +    result = __intel_dot_product_accumulate(
    +        as_char4( sub_group_broadcast( v, 4 ) ), as_char4( b.s4 ), result );
    +    result = __intel_dot_product_accumulate(
    +        as_char4( sub_group_broadcast( v, 5 ) ), as_char4( b.s5 ), result );
    +    result = __intel_dot_product_accumulate(
    +        as_char4( sub_group_broadcast( v, 6 ) ), as_char4( b.s6 ), result );
    +    result = __intel_dot_product_accumulate(
    +        as_char4( sub_group_broadcast( v, 7 ) ), as_char4( b.s7 ), result );
    +
    +    return result;
    +}
    +
    +int2 intel_sub_group_i8_i8_matrix_mad_k32(int2  a, int8  b, int2 acc)
    +{
    +    int2 result;
    +
    +    result.x = __intel_vector_matrix_multiply_accumulate_k32( a.x, b, acc.x );
    +    result.y = __intel_vector_matrix_multiply_accumulate_k32( a.y, b, acc.y );
    +
    +    return result;
    +}
    +
    +
    +
    +
    +
    +

    Issues

    +
    +
    +

    None.

    +
    +
    +
      +
    1. +

      Should this extension use signed or unsigned types to represent fp16 and bf16 data?

      +
      +
      +
      +

      RESOLVED: This extension will use signed types to represent fp16 and bf16 data even though this is inconsistent with other extensions such as cl_intel_bfloat16 conversions. +This inconsistency may be addressed in a future extension or in a future version of this extension. +Applications are encouraged to use as_type to reinterpret unsigned data as signed data as needed to use the functions added by this extension.

      +
      +
      +
      +
    2. +
    +
    +
    +
    +
    +

    Revision History

    +
    + ++++++ + + + + + + + + + + + + + + + + +
    RevDateAuthorChanges

    1.0.0

    2022-05-18

    Ben Ashbaugh

    Initial public revision

    +
    +
    +
    + + + + + \ No newline at end of file diff --git a/extensions/intel/cl_intel_subgroup_split_matrix_multiply_accumulate.html b/extensions/intel/cl_intel_subgroup_split_matrix_multiply_accumulate.html new file mode 100644 index 00000000..94e0a8c8 --- /dev/null +++ b/extensions/intel/cl_intel_subgroup_split_matrix_multiply_accumulate.html @@ -0,0 +1,1232 @@ + + + + + + + +cl_intel_subgroup_split_matrix_multiply_accumulate + + + + + + + + + + +
    +
    +

    Name Strings

    +
    +
    +

    cl_intel_subgroup_split_matrix_multiply_accumulate

    +
    +
    +
    +
    +

    Contact

    +
    +
    +

    Ben Ashbaugh, Intel (ben 'dot' ashbaugh 'at' intel 'dot' com)

    +
    +
    +
    +
    +

    Contributors

    +
    +
    +

    Ben Ashbaugh, Intel
    +Junjie Gu, Intel
    +Mike MacPherson, Intel
    +Lukasz Towarek, Intel

    +
    +
    +
    +
    +

    Notice

    +
    +
    +

    Copyright (c) 2022-2023 Intel Corporation. All rights reserved.

    +
    +
    +
    +
    +

    Status

    +
    +
    +

    Complete

    +
    +
    +
    +
    +

    Version

    +
    +
    +

    Built On: 2023-06-12
    +Revision: 1.0.0

    +
    +
    +
    +
    +

    Dependencies

    +
    +
    +

    This extension is written against the OpenCL 3.0 C Language specification, V3.0.10.

    +
    +
    +

    This extension requires support for subgroups.

    +
    +
    +

    This extension uses many of the terms and concepts from the cl_intel_subgroup_matrix_multiply_accumulate extension.

    +
    +
    +
    +
    +

    Overview

    +
    +
    +

    The goal of this extension is to allow programmers to access specialized hardware to compute the product of an M x K matrix with a K x N matrix and then add an M x N matrix accumulation value. +This is a commonly used building block to compute the product of two large matrices.

    +
    +
    +

    The functionality described in this extension is very similar to the functionality described in the cl_intel_subgroup_matrix_multiply_accumulate extension, with one key difference: +in this extension, work items across two subgroups cooperate to perform the operation. +This is done by splitting the M x K matrix source across two participating subgroups: +The first M-divided-by-2 rows of the matrix source are provided by the first subgroup, and the remaining M-divided-by-2 rows of the matrix source are provided by the second subgroup.

    +
    +
    +

    Splitting the matrix source improves performance by halving the amount of data each subgroup must load for the first matrix source.

    +
    +
    +
    +
    +

    New API Functions

    +
    +
    +

    None.

    +
    +
    +
    +
    +

    New API Enums

    +
    +
    +

    None.

    +
    +
    +
    +
    +

    New OpenCL C Functions

    +
    +
    +
    +
    // 8-bit matrices:
    +int2 intel_sub_group_i8_i8_split_matrix_mad_k32(int   a, int8  b, int2 acc);  // M = 2
    +int4 intel_sub_group_i8_i8_split_matrix_mad_k32(int2  a, int8  b, int4 acc);  // M = 4
    +int8 intel_sub_group_i8_i8_split_matrix_mad_k32(int4  a, int8  b, int8 acc);  // M = 8
    +
    +int2 intel_sub_group_i8_u8_split_matrix_mad_k32(int   a, uint8 b, int2 acc);  // ...
    +int4 intel_sub_group_i8_u8_split_matrix_mad_k32(int2  a, uint8 b, int4 acc);
    +int8 intel_sub_group_i8_u8_split_matrix_mad_k32(int4  a, uint8 b, int8 acc);
    +
    +int2 intel_sub_group_u8_i8_split_matrix_mad_k32(uint  a, int8  b, int2 acc);
    +int4 intel_sub_group_u8_i8_split_matrix_mad_k32(uint2 a, int8  b, int4 acc);
    +int8 intel_sub_group_u8_i8_split_matrix_mad_k32(uint4 a, int8  b, int8 acc);
    +
    +int2 intel_sub_group_u8_u8_split_matrix_mad_k32(uint  a, uint8 b, int2 acc);
    +int4 intel_sub_group_u8_u8_split_matrix_mad_k32(uint2 a, uint8 b, int4 acc);
    +int8 intel_sub_group_u8_u8_split_matrix_mad_k32(uint4 a, uint8 b, int8 acc);
    +
    +// bfloat16 matrices:
    +float2 intel_sub_group_bf16_bf16_split_matrix_mad_k16(int  a, int8 b, float2 acc);
    +float4 intel_sub_group_bf16_bf16_split_matrix_mad_k16(int2 a, int8 b, float4 acc);
    +float8 intel_sub_group_bf16_bf16_split_matrix_mad_k16(int4 a, int8 b, float8 acc);
    +
    +// fp16 matrices:
    +float2 intel_sub_group_f16_f16_split_matrix_mad_k16(int  a, int8 b, float2 acc);
    +float4 intel_sub_group_f16_f16_split_matrix_mad_k16(int2 a, int8 b, float4 acc);
    +float8 intel_sub_group_f16_f16_split_matrix_mad_k16(int4 a, int8 b, float8 acc);
    +
    +
    +
    +
    +
    +

    Modifications to the OpenCL C Specification

    +
    +
    +

    Add a new Section 6.13.X - Subgroup Split Matrix Multiply Accumulate Instructions

    +
    +

    This section describes a family of built-in functions that multiply two matrix sources a and b and then add a matrix accumulation value to produce a matrix result value. +Work items from two subgroups cooperate to perform this operation. +a is the first matrix operand and has M rows and K columns. +Each subgroup provides half of the rows of the a matrix. +b is the second matrix operand and has K rows and N columns. +acc is the matrix accumulation value and has M rows and N columns. +The result value also has M rows and N columns. +All work items in both subgroups cooperate to perform this operation. +These functions must be encountered by all work items in both subgroups executing the kernel.

    +
    +
    +

    The dimensions of the two source matrices and the elements of each source matrix are described by the built-in function name and its arguments.

    +
    +
    +

    As an example, given the function:

    +
    +
    +
    +
    int2 intel_sub_group_u8_i8_split_matrix_mad_k32(uint  a, int8  b, int2 acc);
    +
    +
    +
    +
      +
    • +

      a is the first source matrix operand and has M rows and K columns. +This matrix operand is split across two participating subgroups. +Work items from each participating subgroup provide half of the row data for this matrix.

      +
      +
        +
      • +

        The value for M is determined by the number of vector components in the source operand a. +Since each subgroup provides half of the row data for this matrix, multiply the number of components in a by two to compute the number of rows M. +In the example above, a is a scalar uint argument, therefore the matrix a operand has M equal to 2 rows.

        +
      • +
      • +

        The value of K is described by the function name. +In this case, the value of K is 32, therefore the matrix a operand has K equal to 32 columns.

        +
      • +
      • +

        The matrix component data type is also described by the function name. +In this case, the matrix a component data type is u8, indicating that the elements of the matrix a operand are unsigned 8-bit integers.

        +
      • +
      • +

        Each work item contributes part of this matrix. +In this case, since the elements of the matrix a are 8-bit integers, and since each work item is contributing 32 bits (the size of a uint) of data per row of this matrix, each work item is contributing four 8-bit integer values per row.

        +
      • +
      • +

        Since K is 32, and each work item is contributing four 8-bit values per row, the number of work items in the subgroup must be equal to 8.

        +
      • +
      +
      +
    • +
    • +

      b is the second source matrix operand and has K rows and N columns.

      +
      +
        +
      • +

        Each work item contributes one column of this matrix. +Therefore, the number of columns N is equivalent to the subgroup size.

        +
      • +
      • +

        As above, the value of K is described by the function name. +In this case, the value of K is 32, therefore the matrix b operand has K equal to 32 rows.

        +
      • +
      • +

        As above, the matrix component data type is described by the function name. +In this case, the matrix b component data type is i8, indicating that the elements of the matrix b operand are signed 8-bit integers.

        +
      • +
      • +

        Since K is 32 and the elements of the matrix b are 8-bit integers, each work item must contribute 256 bits of source data to contribute K values. +The 256 bits of source data are packed and passed as the int8 argument b.

        +
      • +
      +
      +
    • +
    • +

      acc specifies the accumulation value and has M rows and N columns.

      +
      +
        +
      • +

        As above, the value of M is determined by the number of components in the source operand acc. +In the example above, acc is an int2 argument, therefore the accumulation value operand has M equal to 2 rows.

        +
      • +
      • +

        Both a and acc specify operands with M rows, and the value of M is determined by the number of components in each source operand. +Since each subgroup provides half of the a matrix data, the a operand will have half the number of components as the acc source operand.

        +
      • +
      • +

        As above, each work item contributes one column of accumulation values. +Therefore, the number of columns N is equivalent to the subgroup size.

        +
      • +
      • +

        The acc operand is a "full precision" accumulation value. +In the example above, the matrices contain integer data, therefore the acc operand is a vector of int data.

        +
      • +
      +
      +
    • +
    • +

      The result value returned by the function also has M rows and N columns.

      +
      +
        +
      • +

        As above, the value of M is determined by the number of components in the return type. +In the example above, the return type is int2, therefore the result value has M equal to 2 rows.

        +
      • +
      • +

        The result value, a, and acc all specify values with M rows, and the value of M is determined by the number of components in each source operand or return type. +Since each subgroup provides half of the a matrix data, the a operand will have half the number of components as the return type and acc operand.

        +
      • +
      • +

        As above, each work item will receive one column of result values. +Therefore, the number of columns N is equivalent to the subgroup size.

        +
      • +
      • +

        Similar to the acc operand, the return value is a "full precision" result value. +In the example above, the matrices contain integer data, therefore the return type is a vector of int data.

        +
      • +
      +
      +
    • +
    +
    +
    +

    The full list of supported functions is described in the overview, above. +For this list of functions:

    +
    +
    +
      +
    • +

      M may be equal to 2, 4, or 8.

      +
    • +
    • +

      N must be equal to 8. In other words, the only supported subgroup size is 8.

      +
    • +
    • +

      Supported integer matrix types for a and b are any combination of signed or unsigned 8-bit integers. +For these integer matrix types, the accumulation value acc and result value are signed 32-bit integers, and K must be equal to 32.

      +
    • +
    • +

      The supported floating-point matrix types for a and b are fp16 (half) or bfloat16. +For these floating-point matrix type, the accumulation value acc and result value are 32-bit floating-point values, and K must be equal to 16.

      +
    • +
    +
    +
    +
    +
    +
    +

    Issues

    +
    +
    +
      +
    1. +

      Do we need to talk about which two subgroups cooperate to perform the split matrix multiplication?

      +
      +
      +
      +

      UNRESOLVED: For now, this is left as an implementation detail, outside of the scope of this extension.

      +
      +
      +
      +
    2. +
    3. +

      Should the built-in functions in this extension overload the built-ins from cl_intel_subgroup_matrix_multiply_accumulate, or define new functions?

      +
      +
      +
      +

      RESOLVED: Switched to a non-overloaded syntax: intel_sub_group_i8_i8_split_matrix_mad_k32.

      +
      +
      +
      +
    4. +
    5. +

      Should this extension use signed or unsigned types to represent fp16 and bf16 data?

      +
      +
      +
      +

      RESOLVED: This extension will use signed types to represent fp16 and bf16 data even though this is inconsistent with other extensions such as cl_intel_bfloat16 conversions. +See discussion in cl_intel_subgroup_matrix_multiply_accumulate.

      +
      +
      +
      +
    6. +
    +
    +
    +
    +
    +

    Revision History

    +
    + ++++++ + + + + + + + + + + + + + + + + +
    RevDateAuthorChanges

    1.0.0

    2022-05-18

    Ben Ashbaugh

    Initial public revision

    +
    +
    +
    + + + + + \ No newline at end of file diff --git a/extensions/registry.py b/extensions/registry.py index e93cf0b8..f9de2b1d 100644 --- a/extensions/registry.py +++ b/extensions/registry.py @@ -161,6 +161,11 @@ 'flags' : { 'public' }, 'url' : 'extensions/intel/cl_intel_advanced_motion_estimation.txt', }, + 'cl_intel_bfloat16_conversions' : { + 'number' : 80, + 'flags' : { 'public' }, + 'url' : 'extensions/intel/cl_intel_bfloat16_conversions.html', + }, 'cl_intel_command_queue_families' : { 'number' : 68, 'flags' : { 'public' }, @@ -276,6 +281,21 @@ 'flags' : { 'public' }, 'url' : 'extensions/intel/cl_intel_spirv_subgroups.html', }, + 'cl_intel_split_work_group_barrier' : { + 'number' : 81, + 'flags' : { 'public' }, + 'url' : 'extensions/intel/cl_intel_split_work_group_barrier.html', + }, + 'cl_intel_subgroup_matrix_multiply_accumulate' : { + 'number' : 78, + 'flags' : { 'public' }, + 'url' : 'extensions/intel/cl_intel_subgroup_matrix_multiply_accumulate.html', + }, + 'cl_intel_subgroup_split_matrix_multiply_accumulate' : { + 'number' : 79, + 'flags' : { 'public' }, + 'url' : 'extensions/intel/cl_intel_subgroup_split_matrix_multiply_accumulate.html', + }, 'cl_intel_subgroups' : { 'number' : 35, 'flags' : { 'public' },