Improve Guid v7 performance on Unix #106525

yaakov-h · 2024-08-16T05:38:12Z

Draft PR for discussion.

Seemingly just skipping 6/16 bytes of random generation is enough to speed up performance by almost 3x, and I really have no idea why.

Benchmarks summary:

// * Summary *

BenchmarkDotNet v0.13.12, macOS Sonoma 14.6 (23G80) [Darwin 23.6.0]
Apple M2 Pro, 1 CPU, 10 logical and 10 physical cores
.NET SDK 9.0.100-preview.7.24407.12
  [Host]     : .NET 9.0.0 (9.0.24.40507), Arm64 RyuJIT AdvSIMD
  Job-MJUZQJ : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  .NET 9.0   : .NET 9.0.0 (9.0.24.40507), Arm64 RyuJIT AdvSIMD

Runtime=.NET 9.0  

| Method  | Job        | Toolchain | Mean      | Error    | StdDev   | Ratio | Allocated | Alloc Ratio |
|-------- |----------- |---------- |----------:|---------:|---------:|------:|----------:|------------:|
| runtime | Job-MJUZQJ | CoreRun   |  95.65 ns | 0.773 ns | 0.603 ns |  1.00 |         - |          NA |
|         |            |           |           |          |          |       |           |             |
| runtime | .NET 9.0   | Default   | 261.56 ns | 1.714 ns | 1.519 ns |  1.00 |         - |          NA |

Where Job-MJUZQJ is this PR and .NET 9.0 is the public Preview 7 bits.

Fixes #106377.

src/libraries/System.Private.CoreLib/src/System/Guid.Unix.cs

yaakov-h · 2024-09-22T06:24:04Z

@jeffhandley what would it take to reopen this and get perf improved for this new API on non-Windows targets?

jkotas · 2024-09-22T15:40:29Z

@EgorBot -intel -arm64 -perf

using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Bench>(args: args);

public class Bench
{
    [Benchmark]
    public Guid Foo() => Guid.CreateVersion7();
}

jkotas · 2024-09-22T16:15:09Z

The performance improvement on Linux is 1-2%: EgorBot/runtime-utils#93 (comment) . It is the kind of improvement that I would expect from this change. (Unfortunately, we do not a quick automated way to run a micro-benchmark on macOS that you run it on.)

Could you please share the exact sources for the micro-benchmark that you have used?

yaakov-h · 2024-09-23T00:46:21Z

It was this, with the comments removed it looks identical to yours:

using System;
using BenchmarkDotNet;
using BenchmarkDotNet.Attributes;

namespace uuidv7.net
{
    // [SimpleJob(BenchmarkDotNet.Jobs.RuntimeMoniker.Net80)]
    [SimpleJob(BenchmarkDotNet.Jobs.RuntimeMoniker.Net90)]
    [MemoryDiagnoser]
    public class Benchmarks
    {
        // [Benchmark(Baseline = true)]
        // public Guid Original() => new Guid(UUIDv7_v1.Generate());

        // [Benchmark]
        // public Guid vcsjones() => new UUIDv7_v2().AsGuid();

        // [Benchmark]
        // public Guid yaakov() => new UUIDv7_v3().AsGuid();

        // [Benchmark]
        // public Guid yaakov_with_vcsjones_improved_fill() => new UUIDv7_v4().AsGuid();

        [Benchmark(Baseline = true)]
        public Guid runtime() => Guid.CreateVersion7();

        // [Benchmark]
        // public Guid faster() => new UUIDv7_v5().AsGuid();

        // [Benchmark]
        // public Guid faster_localsinit() => new UUIDv7_v6().AsGuid();
    }
}

Program.Main is just:

var config = DefaultConfig.Instance;
var summary = BenchmarkRunner.Run<Benchmarks>(config, args);

Command used to run the benchmark:

dotnet run -c release -- --coreRun "/Users/yaakov/Developer/GitHub/dotnet/runtime/artifacts/bin/testhost/net9.0-osx-Release-arm64/shared/Microsoft.NETCore.App/9.0.0/corerun"

Output as of today with RC1:

// * Summary *

BenchmarkDotNet v0.13.12, macOS 15.0 (24A335) [Darwin 24.0.0]
Apple M2 Pro, 1 CPU, 10 logical and 10 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]     : .NET 9.0.0 (9.0.24.43107), Arm64 RyuJIT AdvSIMD
  Job-GWPJNN : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  .NET 9.0   : .NET 9.0.0 (9.0.24.43107), Arm64 RyuJIT AdvSIMD

Runtime=.NET 9.0  

| Method  | Job        | Toolchain | Mean      | Error    | StdDev   | Ratio | Allocated | Alloc Ratio |
|-------- |----------- |---------- |----------:|---------:|---------:|------:|----------:|------------:|
| runtime | Job-GWPJNN | CoreRun   |  93.98 ns | 0.766 ns | 0.717 ns |  1.00 |         - |          NA |
|         |            |           |           |          |          |       |           |             |
| runtime | .NET 9.0   | Default   | 252.00 ns | 1.396 ns | 1.238 ns |  1.00 |         - |          NA |

EgorBo · 2024-09-23T00:52:39Z

It was this, with the comments removed it looks identical to yours:

My understanding that you PR does:

Inlines NewGuid method by hands (hence 1-2% improvement reported by the bot on Linux), can be seen via flamegraph
Reads 10 instead of 16 bytes from Interop.GetRandomBytes

It is really hard to imagine it being 3x faster unless mac has something special for smaller size for urandom..

yaakov-h · 2024-09-23T01:03:11Z

huh yeah, that is interesting.

I just restested on Asahi Fedora (arm64) and it's only about a 3.5% improvement there. Let me check macOS x64...

EgorBo · 2024-09-23T01:27:12Z

Command used to run the benchmark:

That's not how we test changes. Typically, we build corerun for baseline/main and then corerun for changes, then, a benchmark is ran as --corerun /path/to/base/corerun /path/to/diff/corerun

EgorBo · 2024-09-23T02:14:58Z

Interesting, I actually can reproduce the same 3-5X difference on my Macbook Pro M2 Max 😕

Self-contained benchmark:

using System.Reflection;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Bench>(args: args);

public unsafe class Bench
{
    delegate void GetRandomBytesDelegate(byte* pbBuffer, int count);
    static GetRandomBytesDelegate GetRandomBytes;
    static Bench()
    {
        MethodInfo getRandomBytesMethod = typeof(object).Assembly.GetType("Interop")!.GetMethod("GetRandomBytes",
            BindingFlags.NonPublic | BindingFlags.Static);
        GetRandomBytes = (GetRandomBytesDelegate)getRandomBytesMethod!.CreateDelegate(typeof(GetRandomBytesDelegate));
    }

    [Benchmark]
    public void GetRandom16()
    {
        byte* data = stackalloc byte[16];
        GetRandomBytes(data, 16);
        Consume(data);
    }

    [Benchmark]
    public void GetRandom10()
    {
        byte* data = stackalloc byte[10];
        GetRandomBytes(data, 10);
        Consume(data);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(byte* _){}
}

| Method      | Mean      | Error    | StdDev   |
|------------ |----------:|---------:|---------:|
| GetRandom16 | 257.92 ns | 0.831 ns | 0.737 ns |
| GetRandom10 |  51.95 ns | 0.429 ns | 0.380 ns |

yaakov-h · 2024-09-23T02:40:27Z

I get a 5x difference on macOS arm64 M2 Pro with that particular microbenchmark. What the... 🫢

EgorBo · 2024-09-23T02:51:56Z

I've taken a quick look at it via XCode Instruments:

Length=10:

Length=16:

So apparently, length 10 goes through ccrng_schedule_read while length 16 always go through ccdrbg_generate.
Perhaps, for small buffers Apple just returns time and larger buffers go through more secure routine?

Or perhaps it's some sort of a queue (cache) of random values and with length=16 we drain it too fast?

PS: Length >= 12 is where it becomes slow.

PS2: Ah, perhaps Length>=12 has to be a strong random because it's used as key/iv in various AES etc while something smaller is not

yaakov-h · 2024-09-23T13:18:46Z

I had a quick look through Apple Open Source and the not-so-open crypto library source but can't find a branch like that.

It seems to be fairly recent - my Intel Mac with macOS 12 is also only a small perf gain like on Linux.

EgorBo · 2024-09-23T13:37:03Z

can't find a branch like that.

I can clearly see the branch (>=12) from the Xcode Instruments.

@stephentoub @jkotas @vcsjones @bartonjs so do we want to accept an apple-specific improvement like this? it's based on an internal implementation detail (may be changed in future) + it looks like the api gives us, potentially, less cryptographically secure random for <12 bytes - it's probably still important for guid generation?

My personal opinion that if someone hits a bottle-neck in guid generation, they should consider some adhoc solutions like incremental guids etc. Since macOS is typically not used on back-ends, I presume it's unlikely anyone will notice this improvement.

EgorBo · 2024-09-23T13:38:40Z

src/libraries/System.Private.CoreLib/src/System/Guid.Unix.cs

+
+        private static unsafe Guid CreateRandomizedPartialVersion7()
+        {
+            Guid g;


is it ok for UUIDv7 that you use a non-initialized variable here so the upper 6 bytes will be zero or stack garbage? (since corelib is compiled with SkipLocalsInit)

stephentoub · 2024-09-23T13:41:56Z

it looks like the api gives us, potentially, less cryptographically secure random for <12 bytes

For better or worse, folks rely on the cryptographic nature of this data with guid. Do we know that this implementation actually impacts the quality of the csprng? That'd be concerning separate from this change.

EgorBo · 2024-09-23T13:50:10Z

That'd be concerning separate from this change.

That is just my blind guess given that the implementation is fully private. It's just that under 12 bytes the bottle-neck is mach_absolute_time.

Although, it's likely just cached random (generated by the slow path) for quick access.

vcsjones · 2024-09-23T15:02:36Z

Apple's CoreCrypto is "open" source. The reason for the difference at 12 is a line like this:

bool bypass_cache = rand_nbytes >= CCRNG_FIPS_REQUEST_SIZE_THRESHOLD;

Where CCRNG_FIPS_REQUEST_SIZE_THRESHOLD is 12. We are still using a cryptographically random number generator on Apple (CCRandomGenerateBytes). Apple is just required to turn off some of the caching it is allowed to do once the requested amount of data is > 96 bits. Since a full v4 GUID is larger than 96 bits, it does not use the cache.

I don't think the < 12 is any "less" random. It is just required to be fresh for NIST compliance reasons.

vcsjones · 2024-09-23T15:13:19Z

You can see the same performance difference in RandomNumberGenerator.Fill, our public API for generating CSPRNG, which also uses CCRandomGenerateBytes. Adapting @EgorBo's benchmark:

using System.Reflection;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Security.Cryptography;

BenchmarkRunner.Run<Bench>(args: args);

public class Bench
{
    [Benchmark]
    public void GetRandom16()
    {
        Span<byte> data = stackalloc byte[16];
        RandomNumberGenerator.Fill(data);
        Consume(data);
    }

    [Benchmark]
    public void GetRandom10()
    {
        Span<byte> data = stackalloc byte[10];
        RandomNumberGenerator.Fill(data);
        Consume(data);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(Span<byte> _){}
}

Gives

Method	Mean	Error	StdDev
GetRandom16	240.83 ns	0.179 ns	0.158 ns
GetRandom10	74.72 ns	0.501 ns	0.469 ns

So I do not believe the change here is doing anything meaningfully worse off. You can just get a fast-path now.

tannergooding · 2024-09-23T15:19:34Z

I personally don't think this is worth the extra complexity.

The perf "gain" here is up to 200ns on the latest Apple devices, which is highly unlikely to be a bottleneck to your application, even if you're generating billions of these UUIDs (in which case there are better ways to minimize the overhead anyways) then the overhead of the network requests to serialize the database, write these GUIDs to disk, or anything similar will cause this minor overhead to be lost as noise in comparison.

Additionally, it will be Apple specific and will not solve such an "issue" on Linux where its taking 350-470ns for the same operation to complete on various hardware (nearly 2x slower than the GetRandom16 implementation currently is for M2.

jkotas · 2024-09-23T15:28:53Z

The random bytes cache is 256 bytes. I would expect the Guid generation perf to be bi-modal with this change (every 25th Guid to take significantly longer to generate). The benchmark results do not seem to be capturing this. Is Benchmark.NET rejecting the slow calls as outliers?
The cache is process global and protected by process global lock. I would expect this change to make the Guid generation less scalable. It should be possible to validate this by running the Guid generation in parallel on multiple cores. (Unlikely to be a problem in real apps.)

vcsjones · 2024-09-23T16:00:31Z

The random bytes cache is 256 bytes. I would expect the Guid generation perf to be bi-modal with this change (every 25th Guid to take significantly longer to generate).

Yes, I can observe that.

The cache is process global and protected by process global lock.

I see a lock is taken regardless of the cache being used or not (CCRNG_CRYPTO_LOCK_LOCK). I do not think the caching behavior affects locks.

vcsjones · 2024-09-23T16:02:37Z

On top of locks, caches, and reseeding, a CSPRNG behavior is rarely "predicable" when you are talking about nanoseconds. For better or for worse since Guid uses a CSPRNG for the random bits, the behavior will always have outliers.

bartonjs · 2024-09-23T23:11:57Z

Based on @vcsjones ' notes, I don't have a principled disagreement with the change... but I agree with @tannergooding 's "I personally don't think this is worth the extra complexity."

Sure, this change, and 5000 others like it, could add up to saving 1ms on some operation somewhere; but as someone who is always slogging through OS-specific partials and the like, I'd generally happily sacrifice 200ns in a non-bottleneck function to avoid OS-specific code and/or a #if.

yaakov-h · 2024-09-26T09:35:24Z

Very interesting outcome there. I did not expect this to come down to an internal switch in Apple's clopen-source RNG.

I completely understand the complexity issue. Thanks all!

Improve Guid v7 performance on Unix

51e1841

dotnet-issue-labeler bot added the area-System.Runtime label Aug 16, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Aug 16, 2024

karakasa reviewed Aug 16, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Guid.Unix.cs Show resolved Hide resolved

Use constant for unix_ts_ms size

f6ee201

yaakov-h mentioned this pull request Aug 17, 2024

Guid.CreateVersion7() could be faster #106377

Open

Merge branch 'main' into faster-uuidv7

f4e0adc

yaakov-h marked this pull request as ready for review August 21, 2024 00:35

jeffhandley marked this pull request as draft August 23, 2024 02:06

dotnet-policy-service bot closed this Sep 22, 2024

jkotas reopened this Sep 22, 2024

EgorBot mentioned this pull request Sep 22, 2024

EgorBot for jkotas in #106525 EgorBot/runtime-utils#93

Open

build-analysis bot mentioned this pull request Sep 22, 2024

GetAsync_ServerNeedsAuthAndNoCredential_StatusCodeUnauthorized got cancelled #108019

Open

EgorBo reviewed Sep 23, 2024

View reviewed changes

yaakov-h closed this Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Guid v7 performance on Unix #106525

Improve Guid v7 performance on Unix #106525

yaakov-h commented Aug 16, 2024

yaakov-h commented Sep 22, 2024

jkotas commented Sep 22, 2024

jkotas commented Sep 22, 2024

yaakov-h commented Sep 23, 2024

EgorBo commented Sep 23, 2024

yaakov-h commented Sep 23, 2024

EgorBo commented Sep 23, 2024

EgorBo commented Sep 23, 2024 •

edited

Loading

yaakov-h commented Sep 23, 2024

EgorBo commented Sep 23, 2024 •

edited

Loading

yaakov-h commented Sep 23, 2024

EgorBo commented Sep 23, 2024 •

edited

Loading

EgorBo Sep 23, 2024

stephentoub commented Sep 23, 2024

EgorBo commented Sep 23, 2024 •

edited

Loading

vcsjones commented Sep 23, 2024 •

edited

Loading

vcsjones commented Sep 23, 2024

tannergooding commented Sep 23, 2024

jkotas commented Sep 23, 2024

vcsjones commented Sep 23, 2024

vcsjones commented Sep 23, 2024 •

edited

Loading

bartonjs commented Sep 23, 2024

yaakov-h commented Sep 26, 2024

Improve Guid v7 performance on Unix #106525

Improve Guid v7 performance on Unix #106525

Conversation

yaakov-h commented Aug 16, 2024

yaakov-h commented Sep 22, 2024

jkotas commented Sep 22, 2024

jkotas commented Sep 22, 2024

yaakov-h commented Sep 23, 2024

EgorBo commented Sep 23, 2024

yaakov-h commented Sep 23, 2024

EgorBo commented Sep 23, 2024

EgorBo commented Sep 23, 2024 • edited Loading

yaakov-h commented Sep 23, 2024

EgorBo commented Sep 23, 2024 • edited Loading

yaakov-h commented Sep 23, 2024

EgorBo commented Sep 23, 2024 • edited Loading

EgorBo Sep 23, 2024

Choose a reason for hiding this comment

stephentoub commented Sep 23, 2024

EgorBo commented Sep 23, 2024 • edited Loading

vcsjones commented Sep 23, 2024 • edited Loading

vcsjones commented Sep 23, 2024

tannergooding commented Sep 23, 2024

jkotas commented Sep 23, 2024

vcsjones commented Sep 23, 2024

vcsjones commented Sep 23, 2024 • edited Loading

bartonjs commented Sep 23, 2024

yaakov-h commented Sep 26, 2024

EgorBo commented Sep 23, 2024 •

edited

Loading

EgorBo commented Sep 23, 2024 •

edited

Loading

EgorBo commented Sep 23, 2024 •

edited

Loading

EgorBo commented Sep 23, 2024 •

edited

Loading

vcsjones commented Sep 23, 2024 •

edited

Loading

vcsjones commented Sep 23, 2024 •

edited

Loading