"native" instruction set alias for AOT compilers #73246

jkotas · 2022-08-02T16:58:52Z

It would match the native architecture of the processor on which publishing happens.

Context:

Add predefined cpu names for --instruction-set (e.g. haswell) #71911
NativeAOT benchmark started from .Net Framework host doesn't have all intrinsics enabled BenchmarkDotNet#2060

dotnet-issue-labeler · 2022-08-02T16:58:55Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

MichalStrehovsky · 2022-08-03T06:08:25Z

I think the most maintainable way might be to extract the CPU flag detection from the runtime:

runtime/src/coreclr/nativeaot/Runtime/startup.cpp

Lines 162 to 305 in cdf21f1

    
           bool DetectCPUFeatures() 
        
           { 
        
           #if defined(HOST_X86) || defined(HOST_AMD64) || defined(HOST_ARM64) 
        
           #if defined(HOST_X86) || defined(HOST_AMD64) 
        
               int cpuidInfo[4]; 
        
               const int EAX = 0; 
        
               const int EBX = 1; 
        
               const int ECX = 2; 
        
               const int EDX = 3; 
        
               __cpuid(cpuidInfo, 0x00000000); 
        
               uint32_t maxCpuId = static_cast<uint32_t>(cpuidInfo[EAX]); 
        
               if (maxCpuId >= 1) 
        
               { 
        
                   __cpuid(cpuidInfo, 0x00000001); 
        
                   if (((cpuidInfo[EDX] & (1 << 25)) != 0) && ((cpuidInfo[EDX] & (1 << 26)) != 0))                     // SSE & SSE2 
        
                   { 
        
                       if ((cpuidInfo[ECX] & (1 << 25)) != 0)                                                          // AESNI 
        
                       { 
        
                           g_cpuFeatures |= XArchIntrinsicConstants_Aes; 
        
                       } 
        
                       if ((cpuidInfo[ECX] & (1 << 1)) != 0)                                                           // PCLMULQDQ 
        
                       { 
        
                           g_cpuFeatures |= XArchIntrinsicConstants_Pclmulqdq; 
        
                       } 
        
                       if ((cpuidInfo[ECX] & (1 << 0)) != 0)                                                           // SSE3 
        
                       { 
        
                           g_cpuFeatures |= XArchIntrinsicConstants_Sse3; 
        
                           if ((cpuidInfo[ECX] & (1 << 9)) != 0)                                                       // SSSE3 
        
                           { 
        
                               g_cpuFeatures |= XArchIntrinsicConstants_Ssse3; 
        
                               if ((cpuidInfo[ECX] & (1 << 19)) != 0)                                                  // SSE4.1 
        
                               { 
        
                                   g_cpuFeatures |= XArchIntrinsicConstants_Sse41; 
        
                                   if ((cpuidInfo[ECX] & (1 << 20)) != 0)                                              // SSE4.2 
        
                                   { 
        
                                       g_cpuFeatures |= XArchIntrinsicConstants_Sse42; 
        
                                       if ((cpuidInfo[ECX] & (1 << 22)) != 0)                                          // MOVBE 
        
                                       { 
        
                                           g_cpuFeatures |= XArchIntrinsicConstants_Movbe; 
        
                                       } 
        
                                       if ((cpuidInfo[ECX] & (1 << 23)) != 0)                                          // POPCNT 
        
                                       { 
        
                                           g_cpuFeatures |= XArchIntrinsicConstants_Popcnt; 
        
                                       } 
        
                                       if (((cpuidInfo[ECX] & (1 << 27)) != 0) && ((cpuidInfo[ECX] & (1 << 28)) != 0)) // OSXSAVE & AVX 
        
                                       { 
        
                                           if (PalIsAvxEnabled() && (xmmYmmStateSupport() == 1)) 
        
                                           { 
        
                                               g_cpuFeatures |= XArchIntrinsicConstants_Avx; 
        
                                               if ((cpuidInfo[ECX] & (1 << 12)) != 0)                                  // FMA 
        
                                               { 
        
                                                   g_cpuFeatures |= XArchIntrinsicConstants_Fma; 
        
                                               } 
        
                                               if (maxCpuId >= 0x07) 
        
                                               { 
        
                                                   __cpuidex(cpuidInfo, 0x00000007, 0x00000000); 
        
                                                   if ((cpuidInfo[EBX] & (1 << 5)) != 0)                               // AVX2 
        
                                                   { 
        
                                                       g_cpuFeatures |= XArchIntrinsicConstants_Avx2; 
        
                                                       __cpuidex(cpuidInfo, 0x00000007, 0x00000001); 
        
                                                       if ((cpuidInfo[EAX] & (1 << 4)) != 0)                           // AVX-VNNI 
        
                                                       { 
        
                                                           g_cpuFeatures |= XArchIntrinsicConstants_AvxVnni; 
        
                                                       } 
        
                                                   } 
        
                                               } 
        
                                           } 
        
                                       } 
        
                                   } 
        
                               } 
        
                           } 
        
                       } 
        
                   } 
        
                   if (maxCpuId >= 0x07) 
        
                   { 
        
                       __cpuidex(cpuidInfo, 0x00000007, 0x00000000); 
        
                       if ((cpuidInfo[EBX] & (1 << 3)) != 0)                                                           // BMI1 
        
                       { 
        
                           g_cpuFeatures |= XArchIntrinsicConstants_Bmi1; 
        
                       } 
        
                       if ((cpuidInfo[EBX] & (1 << 8)) != 0)                                                           // BMI2 
        
                       { 
        
                           g_cpuFeatures |= XArchIntrinsicConstants_Bmi2; 
        
                       } 
        
                   } 
        
               } 
        
               __cpuid(cpuidInfo, 0x80000000); 
        
               uint32_t maxCpuIdEx = static_cast<uint32_t>(cpuidInfo[EAX]); 
        
               if (maxCpuIdEx >= 0x80000001) 
        
               { 
        
                   __cpuid(cpuidInfo, 0x80000001); 
        
                   if ((cpuidInfo[ECX] & (1 << 5)) != 0)                                                               // LZCNT 
        
                   { 
        
                       g_cpuFeatures |= XArchIntrinsicConstants_Lzcnt; 
        
                   } 
        
           #ifdef HOST_AMD64 
        
                   // AMD has a "fast" mode for fxsave/fxrstor, which omits the saving of xmm registers.  The OS will enable this mode 
        
                   // if it is supported.  So if we continue to use fxsave/fxrstor, we must manually save/restore the xmm registers. 
        
                   // fxsr_opt is bit 25 of EDX 
        
                   if ((cpuidInfo[EDX] & (1 << 25)) != 0) 
        
                       g_fHasFastFxsave = true; 
        
           #endif 
        
               } 
        
           #endif // HOST_X86 || HOST_AMD64 
        
           #if defined(HOST_ARM64) 
        
               PAL_GetCpuCapabilityFlags (&g_cpuFeatures); 
        
           #endif 
        
               if ((g_cpuFeatures & g_requiredCpuFeatures) != g_requiredCpuFeatures) 
        
               { 
        
                   PalPrintFatalError("\nThe required instruction sets are not supported by the current CPU.\n"); 
        
                   RhFailFast(); 
        
               } 
        
           #endif // HOST_X86|| HOST_AMD64 || HOST_ARM64 
        
               return true; 
        
           } 
        
           #endif // !USE_PORTABLE_HELPERS

Into a place that can be shared with the JitInterface native library:

https://github.com/dotnet/runtime/tree/cdf21f143735b8d104c8e636a37eb068904cdd8b/src/coreclr/tools/aot/jitinterface

Then compile that into jitinterface.dll (that ships with ILC) and p/invoke into this.

We already have managed definitions of the various flags this returns because the computed values are bitmasked with compile-time expectations burned into the produced executable to ensure we don't run on machines that don't have expected CPU features.

As a stretch goal, we might try to unify this detection with what's in CoreCLR VM, but that might be too much extra scope. Extracting something that would be eligible to be placed under src/native/minipal in the repo would be a very good first step towards that.

JamesNK · 2023-02-21T06:42:38Z

Performance hit was noticed when testing Native AOT gRPC app on Linux ARM.

AOT vs CoreCLR:

Compared to a minor perf hit of AOT on Linux Intel:

Probably culprit is EventSource methods that use Interlocked to increment longs:
https://github.com/grpc/grpc-dotnet/blob/0b365bf4633c9f05d0af374ed8607c046e8e74dd/src/Grpc.AspNetCore.Server/Internal/GrpcEventSource.cs#L67-L75

EgorBo · 2023-02-21T08:55:31Z

@JamesNK that makes sense, NativeAOT uses arm64 8.0 as a baseline while atomic instructions require 8.1, so you need to define lse capability for NativeAOT, e.g.: <IlcInstructionSet>lse</IlcInstructionSet> or

--application.buildArguments \"/p:IlcInstructionSet=lse\"

for crank

I think a while ago we discussed about a named instruction set for Azure (to include the baseline instructions)

JamesNK · 2023-02-21T10:48:29Z

Yes, that fixed it.

Before: 239,492 RPS
After: 849,835 RPS

Also, only using Interlocked when required with this gRPC PR - grpc/grpc-dotnet#2052 - will improve performance in the benchmark.

omariom · 2023-03-08T10:17:33Z

@JamesNK If the effect this large then may be the hottest counters should be placed on their own cache lines?

This allows compiling for the ISA extensions that the currently running CPU supports. Fixes dotnet#73246.

ghost added the untriaged New issue has not been triaged by the area owner label Aug 2, 2022

jkotas added this to the Future milestone Aug 2, 2022

ghost removed the untriaged New issue has not been triaged by the area owner label Aug 2, 2022

jkotas added the area-NativeAOT-coreclr label Aug 2, 2022

jkotas mentioned this issue Aug 2, 2022

NativeAOT benchmark started from .Net Framework host doesn't have all intrinsics enabled dotnet/BenchmarkDotNet#2060

Closed

jkotas added the help wanted [up-for-grabs] Good issue for external contributors label Aug 3, 2022

MichalStrehovsky modified the milestones: Future, 8.0.0 Feb 21, 2023

JamesNK mentioned this issue Feb 21, 2023

Wrap EventSource calls with IsEnabled grpc/grpc-dotnet#2052

Merged

MichalStrehovsky added a commit to MichalStrehovsky/runtime that referenced this issue Jun 21, 2023

Add support for --instruction-set:native

a695971

This allows compiling for the ISA extensions that the currently running CPU supports. Fixes dotnet#73246.

MichalStrehovsky mentioned this issue Jun 21, 2023

Add support for --instruction-set:native #87865

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 21, 2023

MichalStrehovsky closed this as completed in #87865 Jul 20, 2023

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 20, 2023

ghost locked as resolved and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"native" instruction set alias for AOT compilers #73246

"native" instruction set alias for AOT compilers #73246

jkotas commented Aug 2, 2022

dotnet-issue-labeler bot commented Aug 2, 2022

MichalStrehovsky commented Aug 3, 2022

JamesNK commented Feb 21, 2023

EgorBo commented Feb 21, 2023 •

edited

Loading

JamesNK commented Feb 21, 2023 •

edited

Loading

omariom commented Mar 8, 2023

"native" instruction set alias for AOT compilers #73246

"native" instruction set alias for AOT compilers #73246

Comments

jkotas commented Aug 2, 2022

dotnet-issue-labeler bot commented Aug 2, 2022

MichalStrehovsky commented Aug 3, 2022

JamesNK commented Feb 21, 2023

EgorBo commented Feb 21, 2023 • edited Loading

JamesNK commented Feb 21, 2023 • edited Loading

omariom commented Mar 8, 2023

EgorBo commented Feb 21, 2023 •

edited

Loading

JamesNK commented Feb 21, 2023 •

edited

Loading