Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Guid v7 performance on Unix #106525

Closed
wants to merge 3 commits into from
Closed

Conversation

yaakov-h
Copy link
Member

Draft PR for discussion.

Seemingly just skipping 6/16 bytes of random generation is enough to speed up performance by almost 3x, and I really have no idea why.

Benchmarks summary:

// * Summary *

BenchmarkDotNet v0.13.12, macOS Sonoma 14.6 (23G80) [Darwin 23.6.0]
Apple M2 Pro, 1 CPU, 10 logical and 10 physical cores
.NET SDK 9.0.100-preview.7.24407.12
  [Host]     : .NET 9.0.0 (9.0.24.40507), Arm64 RyuJIT AdvSIMD
  Job-MJUZQJ : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  .NET 9.0   : .NET 9.0.0 (9.0.24.40507), Arm64 RyuJIT AdvSIMD

Runtime=.NET 9.0  

| Method  | Job        | Toolchain | Mean      | Error    | StdDev   | Ratio | Allocated | Alloc Ratio |
|-------- |----------- |---------- |----------:|---------:|---------:|------:|----------:|------------:|
| runtime | Job-MJUZQJ | CoreRun   |  95.65 ns | 0.773 ns | 0.603 ns |  1.00 |         - |          NA |
|         |            |           |           |          |          |       |           |             |
| runtime | .NET 9.0   | Default   | 261.56 ns | 1.714 ns | 1.519 ns |  1.00 |         - |          NA |

Where Job-MJUZQJ is this PR and .NET 9.0 is the public Preview 7 bits.

Fixes #106377.

@yaakov-h yaakov-h marked this pull request as ready for review August 21, 2024 00:35
@jeffhandley jeffhandley marked this pull request as draft August 23, 2024 02:06
@yaakov-h
Copy link
Member Author

@jeffhandley what would it take to reopen this and get perf improved for this new API on non-Windows targets?

@jkotas jkotas reopened this Sep 22, 2024
@jkotas
Copy link
Member

jkotas commented Sep 22, 2024

@EgorBot -intel -arm64 -perf

using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Bench>(args: args);

public class Bench
{
    [Benchmark]
    public Guid Foo() => Guid.CreateVersion7();
}

@jkotas
Copy link
Member

jkotas commented Sep 22, 2024

The performance improvement on Linux is 1-2%: EgorBot/runtime-utils#93 (comment) . It is the kind of improvement that I would expect from this change. (Unfortunately, we do not a quick automated way to run a micro-benchmark on macOS that you run it on.)

Could you please share the exact sources for the micro-benchmark that you have used?

@yaakov-h
Copy link
Member Author

It was this, with the comments removed it looks identical to yours:

using System;
using BenchmarkDotNet;
using BenchmarkDotNet.Attributes;

namespace uuidv7.net
{
    // [SimpleJob(BenchmarkDotNet.Jobs.RuntimeMoniker.Net80)]
    [SimpleJob(BenchmarkDotNet.Jobs.RuntimeMoniker.Net90)]
    [MemoryDiagnoser]
    public class Benchmarks
    {
        // [Benchmark(Baseline = true)]
        // public Guid Original() => new Guid(UUIDv7_v1.Generate());

        // [Benchmark]
        // public Guid vcsjones() => new UUIDv7_v2().AsGuid();

        // [Benchmark]
        // public Guid yaakov() => new UUIDv7_v3().AsGuid();

        // [Benchmark]
        // public Guid yaakov_with_vcsjones_improved_fill() => new UUIDv7_v4().AsGuid();

        [Benchmark(Baseline = true)]
        public Guid runtime() => Guid.CreateVersion7();

        // [Benchmark]
        // public Guid faster() => new UUIDv7_v5().AsGuid();

        // [Benchmark]
        // public Guid faster_localsinit() => new UUIDv7_v6().AsGuid();
    }
}

Program.Main is just:

var config = DefaultConfig.Instance;
var summary = BenchmarkRunner.Run<Benchmarks>(config, args);

Command used to run the benchmark:

dotnet run -c release -- --coreRun "/Users/yaakov/Developer/GitHub/dotnet/runtime/artifacts/bin/testhost/net9.0-osx-Release-arm64/shared/Microsoft.NETCore.App/9.0.0/corerun"

Output as of today with RC1:

// * Summary *

BenchmarkDotNet v0.13.12, macOS 15.0 (24A335) [Darwin 24.0.0]
Apple M2 Pro, 1 CPU, 10 logical and 10 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]     : .NET 9.0.0 (9.0.24.43107), Arm64 RyuJIT AdvSIMD
  Job-GWPJNN : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  .NET 9.0   : .NET 9.0.0 (9.0.24.43107), Arm64 RyuJIT AdvSIMD

Runtime=.NET 9.0  

| Method  | Job        | Toolchain | Mean      | Error    | StdDev   | Ratio | Allocated | Alloc Ratio |
|-------- |----------- |---------- |----------:|---------:|---------:|------:|----------:|------------:|
| runtime | Job-GWPJNN | CoreRun   |  93.98 ns | 0.766 ns | 0.717 ns |  1.00 |         - |          NA |
|         |            |           |           |          |          |       |           |             |
| runtime | .NET 9.0   | Default   | 252.00 ns | 1.396 ns | 1.238 ns |  1.00 |         - |          NA |

@EgorBo
Copy link
Member

EgorBo commented Sep 23, 2024

It was this, with the comments removed it looks identical to yours:

My understanding that you PR does:

  1. Inlines NewGuid method by hands (hence 1-2% improvement reported by the bot on Linux), can be seen via flamegraph
  2. Reads 10 instead of 16 bytes from Interop.GetRandomBytes

It is really hard to imagine it being 3x faster unless mac has something special for smaller size for urandom..

@yaakov-h
Copy link
Member Author

huh yeah, that is interesting.

I just restested on Asahi Fedora (arm64) and it's only about a 3.5% improvement there. Let me check macOS x64...

@EgorBo
Copy link
Member

EgorBo commented Sep 23, 2024

Command used to run the benchmark:

That's not how we test changes. Typically, we build corerun for baseline/main and then corerun for changes, then, a benchmark is ran as --corerun /path/to/base/corerun /path/to/diff/corerun

@EgorBo
Copy link
Member

EgorBo commented Sep 23, 2024

Interesting, I actually can reproduce the same 3-5X difference on my Macbook Pro M2 Max 😕

Self-contained benchmark:

using System.Reflection;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Bench>(args: args);

public unsafe class Bench
{
    delegate void GetRandomBytesDelegate(byte* pbBuffer, int count);
    static GetRandomBytesDelegate GetRandomBytes;
    static Bench()
    {
        MethodInfo getRandomBytesMethod = typeof(object).Assembly.GetType("Interop")!.GetMethod("GetRandomBytes",
            BindingFlags.NonPublic | BindingFlags.Static);
        GetRandomBytes = (GetRandomBytesDelegate)getRandomBytesMethod!.CreateDelegate(typeof(GetRandomBytesDelegate));
    }

    [Benchmark]
    public void GetRandom16()
    {
        byte* data = stackalloc byte[16];
        GetRandomBytes(data, 16);
        Consume(data);
    }

    [Benchmark]
    public void GetRandom10()
    {
        byte* data = stackalloc byte[10];
        GetRandomBytes(data, 10);
        Consume(data);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(byte* _){}
}
| Method      | Mean      | Error    | StdDev   |
|------------ |----------:|---------:|---------:|
| GetRandom16 | 257.92 ns | 0.831 ns | 0.737 ns |
| GetRandom10 |  51.95 ns | 0.429 ns | 0.380 ns |

@yaakov-h
Copy link
Member Author

I get a 5x difference on macOS arm64 M2 Pro with that particular microbenchmark. What the... 🫢

@EgorBo
Copy link
Member

EgorBo commented Sep 23, 2024

I've taken a quick look at it via XCode Instruments:

Length=10:
image

Length=16:
image

So apparently, length 10 goes through ccrng_schedule_read while length 16 always go through ccdrbg_generate.
Perhaps, for small buffers Apple just returns time and larger buffers go through more secure routine?

Or perhaps it's some sort of a queue (cache) of random values and with length=16 we drain it too fast?

PS: Length >= 12 is where it becomes slow.

PS2: Ah, perhaps Length>=12 has to be a strong random because it's used as key/iv in various AES etc while something smaller is not

@yaakov-h
Copy link
Member Author

I had a quick look through Apple Open Source and the not-so-open crypto library source but can't find a branch like that.

It seems to be fairly recent - my Intel Mac with macOS 12 is also only a small perf gain like on Linux.

@EgorBo
Copy link
Member

EgorBo commented Sep 23, 2024

can't find a branch like that.

I can clearly see the branch (>=12) from the Xcode Instruments.

@stephentoub @jkotas @vcsjones @bartonjs so do we want to accept an apple-specific improvement like this? it's based on an internal implementation detail (may be changed in future) + it looks like the api gives us, potentially, less cryptographically secure random for <12 bytes - it's probably still important for guid generation?

My personal opinion that if someone hits a bottle-neck in guid generation, they should consider some adhoc solutions like incremental guids etc. Since macOS is typically not used on back-ends, I presume it's unlikely anyone will notice this improvement.


private static unsafe Guid CreateRandomizedPartialVersion7()
{
Guid g;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ok for UUIDv7 that you use a non-initialized variable here so the upper 6 bytes will be zero or stack garbage? (since corelib is compiled with SkipLocalsInit)

@stephentoub
Copy link
Member

it looks like the api gives us, potentially, less cryptographically secure random for <12 bytes

For better or worse, folks rely on the cryptographic nature of this data with guid. Do we know that this implementation actually impacts the quality of the csprng? That'd be concerning separate from this change.

@EgorBo
Copy link
Member

EgorBo commented Sep 23, 2024

That'd be concerning separate from this change.

That is just my blind guess given that the implementation is fully private. It's just that under 12 bytes the bottle-neck is mach_absolute_time.

Although, it's likely just cached random (generated by the slow path) for quick access.

@vcsjones
Copy link
Member

vcsjones commented Sep 23, 2024

Apple's CoreCrypto is "open" source. The reason for the difference at 12 is a line like this:

bool bypass_cache = rand_nbytes >= CCRNG_FIPS_REQUEST_SIZE_THRESHOLD;

Where CCRNG_FIPS_REQUEST_SIZE_THRESHOLD is 12. We are still using a cryptographically random number generator on Apple (CCRandomGenerateBytes). Apple is just required to turn off some of the caching it is allowed to do once the requested amount of data is > 96 bits. Since a full v4 GUID is larger than 96 bits, it does not use the cache.

I don't think the < 12 is any "less" random. It is just required to be fresh for NIST compliance reasons.

@vcsjones
Copy link
Member

You can see the same performance difference in RandomNumberGenerator.Fill, our public API for generating CSPRNG, which also uses CCRandomGenerateBytes. Adapting @EgorBo's benchmark:

using System.Reflection;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Security.Cryptography;

BenchmarkRunner.Run<Bench>(args: args);

public class Bench
{
    [Benchmark]
    public void GetRandom16()
    {
        Span<byte> data = stackalloc byte[16];
        RandomNumberGenerator.Fill(data);
        Consume(data);
    }

    [Benchmark]
    public void GetRandom10()
    {
        Span<byte> data = stackalloc byte[10];
        RandomNumberGenerator.Fill(data);
        Consume(data);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(Span<byte> _){}
}

Gives

Method Mean Error StdDev
GetRandom16 240.83 ns 0.179 ns 0.158 ns
GetRandom10 74.72 ns 0.501 ns 0.469 ns

So I do not believe the change here is doing anything meaningfully worse off. You can just get a fast-path now.

@tannergooding
Copy link
Member

I personally don't think this is worth the extra complexity.

The perf "gain" here is up to 200ns on the latest Apple devices, which is highly unlikely to be a bottleneck to your application, even if you're generating billions of these UUIDs (in which case there are better ways to minimize the overhead anyways) then the overhead of the network requests to serialize the database, write these GUIDs to disk, or anything similar will cause this minor overhead to be lost as noise in comparison.

Additionally, it will be Apple specific and will not solve such an "issue" on Linux where its taking 350-470ns for the same operation to complete on various hardware (nearly 2x slower than the GetRandom16 implementation currently is for M2.

@jkotas
Copy link
Member

jkotas commented Sep 23, 2024

  • The random bytes cache is 256 bytes. I would expect the Guid generation perf to be bi-modal with this change (every 25th Guid to take significantly longer to generate). The benchmark results do not seem to be capturing this. Is Benchmark.NET rejecting the slow calls as outliers?

  • The cache is process global and protected by process global lock. I would expect this change to make the Guid generation less scalable. It should be possible to validate this by running the Guid generation in parallel on multiple cores. (Unlikely to be a problem in real apps.)

@vcsjones
Copy link
Member

The random bytes cache is 256 bytes. I would expect the Guid generation perf to be bi-modal with this change (every 25th Guid to take significantly longer to generate).

Yes, I can observe that.

The cache is process global and protected by process global lock.

I see a lock is taken regardless of the cache being used or not (CCRNG_CRYPTO_LOCK_LOCK). I do not think the caching behavior affects locks.

@vcsjones
Copy link
Member

vcsjones commented Sep 23, 2024

On top of locks, caches, and reseeding, a CSPRNG behavior is rarely "predicable" when you are talking about nanoseconds. For better or for worse since Guid uses a CSPRNG for the random bits, the behavior will always have outliers.

@bartonjs
Copy link
Member

Based on @vcsjones ' notes, I don't have a principled disagreement with the change... but I agree with @tannergooding 's "I personally don't think this is worth the extra complexity."

Sure, this change, and 5000 others like it, could add up to saving 1ms on some operation somewhere; but as someone who is always slogging through OS-specific partials and the like, I'd generally happily sacrifice 200ns in a non-bottleneck function to avoid OS-specific code and/or a #if.

@yaakov-h
Copy link
Member Author

Very interesting outcome there. I did not expect this to come down to an internal switch in Apple's clopen-source RNG.

I completely understand the complexity issue. Thanks all!

@yaakov-h yaakov-h closed this Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Runtime community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Guid.CreateVersion7() could be faster
8 participants