API Proposal: AVX-VNNI intrinsics #43780

hanblee · 2020-10-23T18:40:30Z

Background and Motivation

The upcoming Intel® Alder Lake and Sapphire Rapids processors will introduce AVX-VNNI instruction set architecture which provides VEX-encoded versions of the Vector Neural Network Instructions (reference: https://software.intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf). This proposal aims to expose AVX-VNNI instructions via intrinsics.

Proposed API

namespace System.Runtime.Intrinsics.X86
{
    public abstract class AvxVnni : Avx2
    {
        internal AvxVnni() { }

        public static new bool IsSupported { [Intrinsic] get { return false; } }

        public new abstract class X64 : Avx2.X64
        {
            internal X64() { }

            public static new bool IsSupported { [Intrinsic] get { return false; } }
        }

        /// <summary>
        /// __m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPBUSD xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m128i _mm_dpwssd_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPWSSD xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<short> left, Vector128<short> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpbusd_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPBUSD ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpwssd_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPWSSD ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<short> left, Vector256<short> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m128i _mm_dpbusds_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPBUSDS xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m128i _mm_dpwssds_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPWSSDS xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<short> left, Vector128<short> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpbusds_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPBUSDS ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpwssds_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPWSSDS ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<short> left, Vector256<short> right) { throw new PlatformNotSupportedException(); }
    }
}

/cc @tannergooding @CarolEidt

ghost · 2020-10-23T18:40:38Z

Tagging subscribers to this area: @tannergooding, @jeffhandley
See info in area-owners.md if you want to be subscribed.

tannergooding · 2020-10-23T18:55:42Z

Thanks for the proposal, these LGTM.

It might be beneficial to name the operands left, right, and addend to help disambiguate which each is used for.
It would also be good to identify it these operations are effectively "fused" and if so include that in the name. That is, is a * 2 - a == a or might it overflow and produce a different result?

GrabYourPitchforks · 2020-10-23T19:01:08Z

@tannergooding What does the order of operations normally look like re: API review and implementation? Presumably we won't implement until we have test hardware, even if the API is approved?

tannergooding · 2020-10-23T19:03:31Z

That is likely something that needs more discussion.

One could imagine implementing it as an "experimental API" even without hardware. But I don't believe we would ship without being able to validate things more end to end.

Of course Carol, Bruce, or others may have different opinions here 😄

tannergooding · 2020-10-23T19:04:28Z

-- I'd also like to double check what we did here for ARM, as I feel we included Widening in the names for similar instructions.

CC. @echesakovMSFT

echesakov · 2020-10-23T19:42:50Z

-- I'd also like to double check what we did here for ARM, as I feel we included Widening in the names for similar instructions.

CC. @echesakovMSFT

Yes, we did include Widening in the names of similar intrinsics on Arm64. For example,

runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Arm/AdvSimd.cs

Lines 10153 to 10158 in 54906ea

    
           /// <summary> 
        
           /// uint16x8_t vmlal_u8 (uint16x8_t a, uint8x8_t b, uint8x8_t c) 
        
           ///   A32: VMLAL.U8 Qd, Dn, Dm 
        
           ///   A64: UMLAL Vd.8H, Vn.8B, Vm.8B 
        
           /// </summary> 
        
           public static Vector128<ushort> MultiplyWideningLowerAndAdd(Vector128<ushort> addend, Vector64<byte> left, Vector64<byte> right) => MultiplyWideningLowerAndAdd(addend, left, right);

and

runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Arm/AdvSimd.cs

Lines 2056 to 2061 in 54906ea

    
           /// <summary> 
        
           /// int32_t vqdmlalh_lane_s16 (int32_t a, int16_t b, int16x4_t v, const int lane) 
        
           ///   A64: SQDMLAL Sd, Hn, Vm.H[lane] 
        
           /// </summary> 
        
           public static Vector64<int> MultiplyDoublingWideningScalarBySelectedScalarAndAddSaturate(Vector64<int> addend, Vector64<short> left, Vector64<short> right, byte rightIndex) => MultiplyDoublingWideningScalarBySelectedScalarAndAddSaturate(addend, left, right, rightIndex);

Presumably, we can name the intrinsics the same way and, as an example, the one that corresponds to __m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b) would become MultiplyWideningAndAdd.

Although on Arm64 "widening" always means doubling the size of the result while on Intel it would also include doubling and quadrupling (byte -> int)

hanblee · 2020-10-23T20:15:31Z

Thanks for the feedback.

public static Vector64 MultiplyDoublingWideningScalarBySelectedScalarAndAddSaturate(...)

It looks like Saturate is used as a shortcut for WithSignedSaturation as well. Would the preference be for Saturate then and use Widening? E.g.,
MultiplyWideningAndAddSaturate instead of MultiplyAddWithSignedSaturation?

echesakov · 2020-10-23T20:33:16Z

Thanks for the feedback.

public static Vector64 MultiplyDoublingWideningScalarBySelectedScalarAndAddSaturate(...)

It looks like Saturate is used as a shortcut for WithSignedSaturation as well. Would the preference be for Saturate then and use Widening? E.g.,
MultiplyWideningAndAddSaturate instead of MultiplyAddWithSignedSaturation?

If I understand what this instructions does - multiplies two values (bytes or shorts) with widening the result to larger type (short or int) and sum them up; if the value exceeds int positive or negative boundaries, it will saturate the value then the name seems right. I don't think we need to specify SignedSaturation since it's clear from the resulting value type.

I am curious why in

 MultiplyAdd(Vector128<int> source, Vector128<byte> a, Vector128<sbyte> b)

a corresponds to unsigned bytes while b to signed bytes?

hanblee · 2020-10-23T20:42:44Z

The operand types are from the programming reference for VPDPBUSD:

Description
Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second
source operand, producing intermediate signed word results. The word results are then summed and accumulated
in the destination dword element size operand.

and that is also how the C intrinsic is designed:

__m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)
Description
Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate signed 16-bit results. Sum these 4 results with the corresponding 32-bit integer in src, and store the packed 32-bit results in dst.

hanblee · 2020-11-19T00:54:07Z

I updated the original post based on suggestions.

hanblee · 2021-03-22T18:54:12Z

One could imagine implementing it as an "experimental API" even without hardware. But I don't believe we would ship without being able to validate things more end to end.

@tannergooding Since the HW spec is out in the public, I think the APIs are ready to be reviewed. I agree that it may not make sense to ship them until the hardware becomes available. Is there an experimental repo/branch where we can make a PR to get the implementation reviewed prior to hardware availability?

tannergooding · 2021-03-22T19:05:51Z

We can discuss exposing them in System.Runtime.Intrinsics.Experimental which is where we've exposed other "preview intrinsics". We'll also need to discuss how these will be tested even once hardware does become available since they won't be on our current CI machines.

I'm fine with marking this api-ready-for-review if there are no other concerns.

CC. @echesakovMSFT

bartonjs · 2021-04-27T17:13:24Z

Video

Looks good as proposed. One thing that was observed is one set of methods is short/short, but the other is byte/sbyte. Double check the sign bits.

namespace System.Runtime.Intrinsics.X86
{
    public abstract class AvxVnni : Avx2
    {
        internal AvxVnni() { }

        public static new bool IsSupported { [Intrinsic] get { return false; } }

        public new abstract class X64 : Avx2.X64
        {
            internal X64() { }

            public static new bool IsSupported { [Intrinsic] get { return false; } }
        }

        /// <summary>
        /// __m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPBUSD xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m128i _mm_dpwssd_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPWSSD xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<short> left, Vector128<short> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpbusd_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPBUSD ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpwssd_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPWSSD ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<short> left, Vector256<short> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m128i _mm_dpbusds_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPBUSDS xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m128i _mm_dpwssds_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPWSSDS xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<short> left, Vector128<short> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpbusds_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPBUSDS ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpwssds_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPWSSDS ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<short> left, Vector256<short> right) { throw new PlatformNotSupportedException(); }
    }
}

hanblee · 2021-04-27T17:24:51Z

One thing that was observed is one set of methods is short/short, but the other is byte/sbyte. Double check the sign bits.

That's the way the instructions were designed.

The operand types are from the programming reference for VPDPBUSD:

Description
Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second
source operand, producing intermediate signed word results. The word results are then summed and accumulated
in the destination dword element size operand.

And for VPDPWSSD:

Description
Multiplies the individual signed words of the first source operand by the corresponding signed words of the second
source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then
summed and accumulated in the destination operand.

tannergooding · 2021-06-17T23:22:45Z

This was implemented and merged.

hanblee added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Oct 23, 2020

Dotnet-GitSync-Bot added area-System.Runtime.Intrinsics untriaged New issue has not been triaged by the area owner labels Oct 23, 2020

tannergooding removed the untriaged New issue has not been triaged by the area owner label Oct 23, 2020

tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Apr 5, 2021

tannergooding added this to the 6.0.0 milestone Apr 13, 2021

tannergooding added the blocking Marks issues that we want to fast track in order to unblock other important work label Apr 23, 2021

bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation blocking Marks issues that we want to fast track in order to unblock other important work labels Apr 27, 2021

This was referenced Apr 28, 2021

Add support for AvxVnni instructions under Experimental. #51998

Merged

Add correct instruction execution characteristics for AVX-VNNI instrinsics #52121

Closed

tannergooding closed this as completed Jun 17, 2021

ghost locked as resolved and limited conversation to collaborators Jul 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Proposal: AVX-VNNI intrinsics #43780

API Proposal: AVX-VNNI intrinsics #43780

hanblee commented Oct 23, 2020 •

edited

Loading

ghost commented Oct 23, 2020

tannergooding commented Oct 23, 2020

GrabYourPitchforks commented Oct 23, 2020

tannergooding commented Oct 23, 2020

tannergooding commented Oct 23, 2020

echesakov commented Oct 23, 2020

hanblee commented Oct 23, 2020

echesakov commented Oct 23, 2020

hanblee commented Oct 23, 2020 •

edited

Loading

hanblee commented Nov 19, 2020

hanblee commented Mar 22, 2021 •

edited

Loading

tannergooding commented Mar 22, 2021 •

edited

Loading

bartonjs commented Apr 27, 2021 •

edited by dotnet-api-review bot

Loading

hanblee commented Apr 27, 2021

tannergooding commented Jun 17, 2021

API Proposal: AVX-VNNI intrinsics #43780

API Proposal: AVX-VNNI intrinsics #43780

Comments

hanblee commented Oct 23, 2020 • edited Loading

Background and Motivation

Proposed API

ghost commented Oct 23, 2020

tannergooding commented Oct 23, 2020

GrabYourPitchforks commented Oct 23, 2020

tannergooding commented Oct 23, 2020

tannergooding commented Oct 23, 2020

echesakov commented Oct 23, 2020

hanblee commented Oct 23, 2020

echesakov commented Oct 23, 2020

hanblee commented Oct 23, 2020 • edited Loading

hanblee commented Nov 19, 2020

hanblee commented Mar 22, 2021 • edited Loading

tannergooding commented Mar 22, 2021 • edited Loading

bartonjs commented Apr 27, 2021 • edited by dotnet-api-review bot Loading

hanblee commented Apr 27, 2021

tannergooding commented Jun 17, 2021

hanblee commented Oct 23, 2020 •

edited

Loading

hanblee commented Oct 23, 2020 •

edited

Loading

hanblee commented Mar 22, 2021 •

edited

Loading

tannergooding commented Mar 22, 2021 •

edited

Loading

bartonjs commented Apr 27, 2021 •

edited by dotnet-api-review bot

Loading