[API Proposal]: Simple one-shot hashing methods that produce the relevant type #76279

stephentoub · 2022-09-28T01:00:29Z

Background and motivation

System.IO.Hashing contains types for Crc32, Crc64, XxHash32, and XxHash64, and XxHash3 is being added. These types all have one-shot static methods for performing hashes over ReadOnlySpan<byte>, but all of these methods either allocate a resulting byte[] or write the result into a Span<byte>. However, the natural data types expecting in many use cases is just an int/uint for 32-bit or long/ulong for 64-bit hashes. We should add helpers for this.

API Proposal

namespace System.IO.Hashing;

public sealed class Crc32 : NonCryptographicHashAlgorithm
{
+   public static int HashToInt32(ReadOnlySpan<byte> source);
+   public int GetCurrentHashAsInt32();
    ...
}

public sealed class XxHash32 : NonCryptographicHashAlgorithm
{
+   public static int HashToInt32(ReadOnlySpan<byte> source, int seed = 0);
+   public int GetCurrentHashAsInt32();
    ...
}

public sealed class Crc64 : NonCryptographicHashAlgorithm
{
+   public static long HashToInt64(ReadOnlySpan<byte> source);
+   public long GetCurrentHashAsInt64();
    ...
}

public sealed class XxHash64 : NonCryptographicHashAlgorithm
{
+   public static long HashToInt64(ReadOnlySpan<byte> source, long seed = 0);
+   public long GetCurrentHashAsInt64();
    ...
}

public sealed class XxHash3 : NonCryptographicHashAlgorithm
{
+   public static long HashToInt64(ReadOnlySpan<byte> source, long seed = 0);
+   public long GetCurrentHashAsInt64();
    ...
}

public sealed class XxHash128 : NonCryptographicHashAlgorithm // assuming https://github.com/dotnet/runtime/issues/77885
{
+   public static Int128 HashToInt128(ReadOnlySpan<byte> source, long seed = 0);
+   public Int128 GetCurrentHashAsInt128();
    ...
}

All of these implementations already compute the relevant int/long and then write it to either a newly-allocated array or the destination span. We'd just return it instead from the new method.
The existing methods write the value to the destination as big-endian. The int/long-returning methods would simply return the value in machine-endianness.
Should we use unsigned types instead? All of the algorithms internally operate in unsigned land.

API Usage

ReadOnlySpan<byte> data = ...;
int hash = XxHash32.HashToInt32(data);

Alternative Designs

No response

Risks

No response

The text was updated successfully, but these errors were encountered:

ghost · 2022-09-28T01:00:40Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

System.IO.Hashing contains types for Crc32, Crc64, XxHash32, and XxHash64, and XxHash3 is being added. These types all have one-shot static methods for performing hashes over ReadOnlySpan<byte>, but all of these methods either allocate a resulting byte[] or write the result into a Span<byte>. However, the natural data types expecting in many use cases is just an int/uint for 32-bit or long/ulong for 64-bit hashes. We should add helpers for this.

API Proposal

namespace System.IO.Hashing;

public sealed class Crc32 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static uint HashToUInt32(ReadOnlySpan<byte> source);
    ...
}

public sealed class XxHash32 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static uint HashToUInt32(ReadOnlySpan<byte> source);
    ...
}

public sealed class Crc64 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    ...
}

public sealed class XxHash64 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    ...
}

public sealed class XxHash3 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    ...
}

All of these implementations already compute the relevant uint/ulong and then write it to either a newly-allocated array or the destination span. We'd just return it instead from the new method.
The existing methods write the value to the destination as big-endian. The uint/ulong-returning methods would simply return the value in machine-endianness.
If we ever added a 128-bit hash, like XxHash128, we'd return UInt128.
We could choose to return int/long instead of uint/ulong, but internally we're operating in terms of unsigned, our BitOperations APIs that hash operate on unsigned (e.g. Expose general purpose Crc32 APIs #2036 (comment)), and looking around the web, unsigned is generally what others appear to use.

API Usage

ReadOnlySpan<byte> data = ...;
uint hash = XxHash32.HashToUInt32(data);

Alternative Designs

No response

Risks

No response

Author:	stephentoub
Assignees:	-
Labels:	`api-suggestion`, `area-System.IO`
Milestone:	8.0.0

LeaFrock · 2022-09-28T05:41:33Z

Hah, that's what I want!

But so sad that I have to wait for .NET 8 😭

ghost · 2022-09-28T16:13:41Z

Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

System.IO.Hashing contains types for Crc32, Crc64, XxHash32, and XxHash64, and XxHash3 is being added. These types all have one-shot static methods for performing hashes over ReadOnlySpan<byte>, but all of these methods either allocate a resulting byte[] or write the result into a Span<byte>. However, the natural data types expecting in many use cases is just an int/uint for 32-bit or long/ulong for 64-bit hashes. We should add helpers for this.

API Proposal

namespace System.IO.Hashing;

public sealed class Crc32 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static uint HashToUInt32(ReadOnlySpan<byte> source);
    ...
}

public sealed class XxHash32 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static uint HashToUInt32(ReadOnlySpan<byte> source);
    ...
}

public sealed class Crc64 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    ...
}

public sealed class XxHash64 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    ...
}

public sealed class XxHash3 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    ...
}

All of these implementations already compute the relevant uint/ulong and then write it to either a newly-allocated array or the destination span. We'd just return it instead from the new method.
The existing methods write the value to the destination as big-endian. The uint/ulong-returning methods would simply return the value in machine-endianness.
If we ever added a 128-bit hash, like XxHash128, we'd return UInt128.
We could choose to return int/long instead of uint/ulong, but internally we're operating in terms of unsigned, our BitOperations APIs that hash operate on unsigned (e.g. Expose general purpose Crc32 APIs #2036 (comment)), and looking around the web, unsigned is generally what others appear to use.

API Usage

ReadOnlySpan<byte> data = ...;
uint hash = XxHash32.HashToUInt32(data);

Alternative Designs

No response

Risks

No response

Author:	stephentoub
Assignees:	-
Labels:	`api-suggestion`, `area-System.Security`
Milestone:	8.0.0

LeaFrock · 2022-10-05T16:49:58Z

I have a very small question. Let's take XxHash64 for example: why is the API long HashToInt64 rather than ulong HashToUInt64?

The existing methods write the value to the destination as big-endian. The int/long-returning methods would simply return the value in machine-endianness.

Is it just a convention? I have this question because I see many underlying calculations use unsigned number in fact, such as String.GetHashCode(),

Marvin.ComputeHash32(ref Unsafe.As<char, byte>(ref _firstChar), (uint)_stringLength * 2 /* in bytes, not chars */, (uint)seed, (uint)(seed >> 32));

bartonjs · 2022-10-05T17:47:22Z

@stephentoub I've removed ready-for-review because I want @GrabYourPitchforks to acknowledge the existence of the proposal (and perhaps even express opinions) before it comes up in a review meeting.

stephentoub · 2022-10-05T17:51:39Z

Ok, but I think it's important we add this. Literally every hashing API I've found that others expose just surface the uint/int/ulong/long. That our APIs force you to either allocate an array or stackalloc some space into which to store the result, figure out that it's been written as a big-endian value, and then convert it back to being a number is a pit of failure. I've seen zero use cases that actually want a byte[].

LeaFrock · 2022-10-06T08:08:12Z

Ok, but I think it's important we add this. Literally every hashing API I've found that others expose just surface the uint/int/ulong/long. That our APIs force you to either allocate an array or stackalloc some space into which to store the result, figure out that it's been written as a big-endian value, and then convert it back to being a number is a pit of failure. I've seen zero use cases that actually want a byte[].

Absolutely agree.

Should we use unsigned types instead? All of the algorithms internally operate in unsigned land.

My question above is just related to this.

GrabYourPitchforks · 2022-10-10T17:52:25Z

The only strong opinion I have is that the endianness should be explicit: include it as part of the API name, or force the caller to include it as an enum parameter. All of these digest algorithms are intended to generate stable output across machines. It's an API pit of failure if we make the "simple" API lead developers unwittingly down a path which violates this fundamental property.

stephentoub · 2022-10-10T18:00:03Z

The only strong opinion I have is that the endianness should be explicit: include it as part of the API name, or force the caller to include it as an enum parameter.

I can include that if we really believe it's important, but we've already failed at that in this library. None of the static {Try}Hash methods state the order of the bytes that were written out, and the derived types differ as to whether they use little or big endian to do so, both for the statics and for the instance members.

stephentoub · 2022-10-10T18:05:18Z

All of these digest algorithms are intended to generate stable output across machines.

From my perspective, using machine endianness for the hash code in these Int32/Int64-returning methods does that implicitly. The proposed API returns an Int32 and you get back e.g. 12345 as that Int32 regardless of whether the machine is little or big endian. Forcing the method into a specific endianness is what violates this, as it means you get 12345 on some machines and 959447040 on others.

GrabYourPitchforks · 2022-10-10T18:10:26Z

None of the static Hash methods state the order of the bytes that were written out, and the derived types differ as to whether they use little or big endian to do so, both for the statics and for the instance members.

The output of any digest algorithm is raw bytes. There's no concept of "endianness" unless you attempt to interpret those bytes as an integral value. For instance, we don't say that the digest output of SHA256 is big-endian, little-endian, or something else. It's just raw bytes. You're proposing here an API which performs its own opinionated interpretation of the digest output.

Looking through the implementations of all the [NonCryptographic]HashAlgorithm-derived classes, I don't immediately see anything which is ambiguous with regard to endianness. All the implementations I'm seeing follow the "everything is raw bytes" pattern. What am I missing?

stephentoub · 2022-10-10T18:22:27Z

The output of any digest algorithm is raw bytes

Every other example I can find of such hash functions, both in native libraries we consume (e.g. zlib), in exemplar implementations on which ours are based (e.g. xxhash), in other frameworks, in other .NET implementations of these same hash functions, and even elsewhere in .NET (e.g. BitOperations.Crc32C) operate in terms of int/long, not in terms of bytes. Even the internal workhorse functions of all of these algorithms are operating on numbers, and then we take a final step of serializing those numbers into bytes.

What am I missing?

Almost all of the examples I've found are folks who want the values as integers, and thus are having to reconstruct that from the bytes we spent energy producing, e.g.
https://github.com/dotnet/orleans/blob/cf4423ea4d75b4ab8ecf071968a2a4bfd463646d/src/Orleans.Serialization/TypeSystem/TypeCodec.cs#L210
https://github.com/LeaFrock/BloomFilterSharp/blob/662b0475b3acab4d659a1b298dba4423b53399c3/src/BloomFilterSharp.Hashing/XxHash64Function.cs#L11-L12
https://github.com/avafloww/Thaliak/blob/01eb1e0668b399a70c580f2be1c1480a9699ccda/Thaliak.Common.Database/Models/XivRepository.cs#L26-L28
and in some cases doing so in a way that doesn't pay attention to endianness and thus will violate the "same output" concern because we focused on outputting bytes.

For instance, we don't say that the digest output of SHA256 is big-endian, little-endian, or something else

We have these docs for System.IO.Hashing.Crc32, for example:
https://learn.microsoft.com/en-us/dotnet/api/system.io.hashing.crc32?view=dotnet-plat-ext-7.0
"This implementation emits the answer in the Little Endian byte order so that the CRC residue relationship (CRC(message concat CRC(message)) is a fixed value) holds. For CRC-32, this stable output is the byte sequence { 0x1C, 0xDF, 0x44, 0x21 }, the Little Endian representation of 0x2144DF1C."

GrabYourPitchforks · 2022-10-10T18:34:34Z

Even the internal workhorse functions of all of these algorithms are operating on numbers, and then we take a final step of serializing those numbers into bytes.

The public API surface shouldn't be concerned with internal implementation details. The API surface should only ever be concerned with the caller-observable output.

We have these docs for System.IO.Hashing.Crc32, for example

Because the CRC32 spec is deeply weird and isn't written as a typical digest algorithm spec, so callers need to know "this is the version of CRC32 you're calling." The paragraph immediately under the one you quote in the docs even says as such. It's unfortunately not a good model to base a usable API off of. :(

To the samples that you gave, I think they actually provide a strong argument for why any opinionated APIs we add should explicitly state the endianness.

Orleans: They want the output in little endian.
BloomFilterSharp: Their call site appears to be buggy, since they're serializing / persisting a machine-endian value.
Thaliak: They want the output in big endian.

stephentoub · 2022-10-10T18:37:19Z

Orleans: They want the output in little endian.

No, they want a number that's going to be the same regardless of architecture. Which is exactly what the proposed API will give them.

BloomFilterSharp: Their call site appears to be buggy, since they're serializing / persisting a machine-endian value.

That's my point: because we've forced them to go through bytes, we've forced them to consider endianness, and they're not... we've created a pit of failure. An API that returns machine endianness for the number will always produce the same hash value for the same input bytes, regardless of current machine endianness.

stephentoub · 2022-10-10T18:40:34Z

The API surface should only ever be concerned with the caller-observable output.

XxHash3.HashToInt32(new[] { 1, 2, 3, 4, 5, 6 }) will always return the same numeric ulong value 0xACCC1C819D65620C, regardless of architecture. I don't understand why endianness is even an issue in the conversation here. Seems to me it'd be like saying Int32.Parse should have been Int32.ParseLittleEndian such that ParseLittleEndian("12345") would return the Int32 value 959447040 on a big-endian machine.

LeaFrock · 2022-10-11T03:16:24Z

BloomFilterSharp: Their call site appears to be buggy, since they're serializing / persisting a machine-endian value

I never consider about it because I'm just developing a filter using local memory, so the machine endianness isn't a problem. But you're right as the hash function could be called on different machines in a distributed scenario.

Now the key point of divergence is whether the top-level developers want a number that's going to be the same regardless of architecture. Both of your opinions sound reasonable depending on the concrete scenarios.

An API that returns machine endianness for the number will always produce the same hash value for the same input bytes, regardless of current machine endianness.

I support it for most requirements. But considering for a situation that people want to appoint the concrete endian for compatible with interactions between existing systems (which may be based on other languages) or machines, it'd be better to retain some flexibility. (You may argue that people should use byte[] API but you can't guarantee that)

I think this is more appropriate for the analogy of System.StringComparison. Could it be possible that we add a similar enum for the proposed API?

public static long HashToInt64(ReadOnlySpan<byte> source, long seed = 0, EndianOption endianness = default);

public enum EndianOption // don’t care about the name
{
       Default = 0, // As stephentoub says, always return the same value regardless of machine
       Current, // Obey the current machine endian
       Big,
       Little,
}

Or even keep a HashToOptions for potential extensibility,

public static long HashToInt64(ReadOnlySpan<byte> source, long seed = 0, HashToOptions? option = null);

public struct HashToOptions
{
      public EndianOption EndianOption { get;set }
}

stephentoub · 2022-10-11T03:19:25Z

Default = 0, // As stephentoub says, always return the same value regardless of machine
Current, // Obey the current machine endian

What is the difference between these two?

LeaFrock · 2022-10-11T03:27:32Z

Well, I'm not sure if my understanding is right. Suppose that machine A uses BE as default, when machine B uses LE.

Default will obey the rule of API itself which ignores all the differences between machines. A and B always produce the same outputs.
Current obey the rule of machine system so that the outputs are different from A and B.

stephentoub · 2022-10-11T03:44:34Z

Current obey the rule of machine system so that the outputs are different from A and B.

As long as an algorithm isn't reinterpreting the raw bytes that make up an integer, endianness has no effect on the math involved, e.g. 2*3 is 6 regardless of whether you're on a little-endian or big-endian machine, 42 >> 3 is 5 regardless of whether you're on a little-endian or big-endian machine, (byte)0x12345678 is 0x78 regardless of whether you're on a little-endian or big-endian machine, etc., but int i = 0x12345678; *(byte*)&i will differ, as it's bypassing the normal language rules for how to interpret the bytes and instead accessing a specific memory location directly, and thus will be 0x78 on a little-endian machine and 0x12 on a big-endian machine.

All of these algorithms just operate on the numbers, not the raw bytes, or if they do operate on the raw bytes, the algorithms themselves account for endianness by dictating the ordering in which the bytes need to be interpreted. For example, the Crc32 algorithm's main loop is just:

runtime/src/libraries/System.IO.Hashing/src/System/IO/Hashing/Crc32.cs

Lines 160 to 165 in 6893524

    
           for (int i = 0; i < source.Length; i++) 
        
           { 
        
               byte idx = (byte)crc; 
        
               idx ^= source[i]; 
        
               crc = s_crcLookup[idx] ^ (crc >> 8); 
        
           }

There's nothing here endianness-specific: on both little-endian and big-endian machines, it'll produce the exact same uint value (the underlying memory will store the bytes that make up that uint differently, but the uint will have the same value). Or in XXH64, for example, if bytes need to be read and re-interpreted as an integer, it's done so with a fixed endianness regardless of the machine's architecture:

runtime/src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash64.State.cs

Line 81 in 6893524

return ApplyRound(acc, BinaryPrimitives.ReadUInt64LittleEndian(lane));

and that's dictated by the algorithm, e.g. https://github.com/Cyan4973/xxHash/blob/f9155bd4c57e2270a4ffbb176485e5d713de1c9b/doc/xxhash_spec.md states "However, a given variant shall produce exactly the same output, irrespective of the cpu / os used. In particular, the result remains identical whatever the endianness and width of the cpu is."

In other words, regardless of whatever additional APIs .NET wants to layer on top, these algorithms produce a number and that number is computed to have a value that's the same regardless of endianness, e.g.
https://github.com/Cyan4973/xxHash/blob/43ea6fdf838782688716f6ad85792a39e7e38e9a/xxhash.h#L161
producing a uint64_t. The only reason endianness comes into play in the discussions about these NonCryptographicHashAlgorithm APIs is because we exposed APIs that take the resulting number and export that number as the bytes that make it up, as the {Try}Hash methods all only return the data as byte[] or written into a Span<byte>, in which case we have to choose an ordering for those bytes. But when talking about the actual numerical result of the algorithm, endianness shouldn't be a factor, as the results are just numbers. Which is why I don't think anything to do with endianness should be a part of the APIs being proposed here that are just returning int/long. Doing so introduces complexity and problems rather than solving them.

LeaFrock · 2022-10-11T04:15:37Z

As long as an algorithm isn't reinterpreting the raw bytes that make up an integer, endianness has no effect on the math involved
...
All of these algorithms just operate on the numbers, not the raw bytes, or if they do operate on the raw bytes, the algorithms themselves account for endianness by dictating the ordering in which the bytes need to be interpreted

Got it. Thanks for your explanation and patience 😄!

The last worry I have is that if a library has already used (Try)Hash and BitConverter with byte[] and appointed endianness to get a int/long value, it would meet potential break changes after switching to the new API. But it's not a big problem as more detailed docs will help that.

Almost all of the examples I've found are folks who want the values as integers, and thus are having to reconstruct that from the bytes we spent energy producing

What I regret is that the proposed APIs and related discussions did not occur before .NET 6 (adding System.IO.Hashing).

GrabYourPitchforks · 2022-10-12T01:10:19Z

I think we're all in agreement on the basic premise: we should provide a nice API for getting integral hash codes. This is immediately useful for lots of people.

The things we disagree on: how should we translate the output into an integer, and should the mechanism that we use for this dictate the chosen name? But after talking with Steve and Jeremy more offline I think we're actually close to agreement here as well. :)

A quick note on terminology. These are all digest algorithms (a.k.a. hash algorithms). Digest algorithms are generally defined as "a variable number of bytes in" and "a fixed-length sequence of bytes out." This means that for [NonCryptographic]HashAlgorithm-derived classes we want the resulting digest bytes to be the source of truth. Always. Regardless of the underlying implementation.

A digest algorithm might choose to use integral types for its intermediate or terminal state, before that state is converted to a fixed-length byte sequence. But from the start we have made a very deliberate decision across all [NonCryptographic]HashAlgorithm-derived types to treat them as digest algorithms and to expose digest-style APIs as the primary interface for these types. It's right there in the name of the base class!

So if we expose GetHashCodeAsXxx APIs on these types, it really should follow two constraints.

The output should be stable across architectures. The digest itself is by-design stable across architectures, so anything derived from it should be similarly stable. All three examples posted at [API Proposal]: Simple one-shot hashing methods that produce the relevant type #76279 (comment) persist the hash code to storage, so we have evidence that people already have this expectation.
The GetHashCodeAsXxx method should be documented in terms of the source-of-truth digest bytes. For example, the doc page for XxHash32.GetHashCodeAsInt32 should say something like this:
The return value of XxHash32.GetHashCodeAsInt32 is equivalent to having written the following:
```
byte[] bytes = new byte[sizeof(int)];
XxHash32.HashData(/* ... */, bytes);
return BinaryPrimitives.ReadInt32BigEndian(bytes);
```

That second part is where the "endianness" concern comes into play. Since the source of truth is the digest bytes (it's a digest algorithm, after all!), we should document the exact transform we're using to go from one representation to the other.

Of course, nothing requires us to have the GetHashCodeAsXxx method actually do the conversion from bytes -> integer. If we already have the integer as an intermediate representation from the internal digest state, great! Return it. But if we change the algorithm implementation in the future to use intrinsics, or to use novel mathematical constructs, then we might no longer have that intermediate representation readily available to us and may need to fall back to a standard bytes -> integer approach.

When Jeremy and I spoke on Monday, he mentioned that there was no need for different algorithms to use identical conversions to go from source of truth -> friendly representation. Some could use little, some could use big, some could do their own weird byte shuffle. The only important aspect is that it's documented. This makes me a bit uneasy because it means that consumers need to consult the docs for each subtype they might be working with, but I'm not willing to die on this hill as long as we have docs.

jods4 · 2022-11-02T21:48:59Z

For digest algorithms (a.k.a. hash algorithms) [...] we want the resulting digest bytes to be the source of truth. Always.
So if we expose GetHashCodeAsXxx APIs on these types, it really should follow two constraints.

Amusingly, one of the most fundamental method of .net is Object.GetHashCode() and it returns an int ;)

It has the same name as GetHashCodeAsXxx and it fits the terminology of "a variable number of bytes in" and "a fixed-length sequence of bytes out.", especially in cases like string or even byte[]!

Maybe it's a huge mistake in framework design, but I have a hunch that if we designed .net from scratch today we wouldn't make it byte[] Object.GetHashCode().

GrabYourPitchforks · 2022-11-05T17:07:36Z

@jods4 - object.GetHashCode is very specifically intended for usage within a dictionary and other keyed hash collections in-proc. It is not meant to be persisted. It does not need to be stable across architectures, or even across back-to-back executions of the app on the same machine! Its underlying algorithm is strictly an implementation detail and callers cannot rely on it to be a member of any specific algorithm family.

@stephentoub - since Int128 is net7+ only, do we need to condition the friendly one-shot helper on Xxhash128 to be exposed only in net7+?

stephentoub · 2022-11-06T21:37:00Z

since Int128 is net7+ only, do we need to condition the friendly one-shot helper on Xxhash128 to be exposed only in net7+?

Yes, it'll be .NET 7 only.

terrajobst · 2022-11-08T00:49:37Z

That our APIs force you to either allocate an array or stackalloc some space into which to store the result, figure out that it's been written as a big-endian value, and then convert it back to being a number is a pit of failure.

LOL. I literally just wrote that code yesterday.

bartonjs · 2022-11-08T20:00:31Z

Video

Note: .NET Standard won't have the UInt128 methods as the type doesn't exist for .NET Standard today

namespace System.IO.Hashing;

public partial sealed class Crc32 : NonCryptographicHashAlgorithm
{
    public static uint HashToUInt32(ReadOnlySpan<byte> source);
    public uint GetCurrentHashAsUInt32();
}

public partial sealed class XxHash32 : NonCryptographicHashAlgorithm
{
    public static uint HashToUInt32(ReadOnlySpan<byte> source, int seed = 0);
    public uint GetCurrentHashAsUInt32();
}

public partial sealed class Crc64 : NonCryptographicHashAlgorithm
{
    public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    public ulong GetCurrentHashAsInt64();
}

public partial sealed class XxHash64 : NonCryptographicHashAlgorithm
{
    public static ulong HashToUInt64(ReadOnlySpan<byte> source, long seed = 0);
    public ulong GetCurrentHashAsUInt64();
}

public partial sealed class XxHash3 : NonCryptographicHashAlgorithm
{
    public static ulong HashToUInt64(ReadOnlySpan<byte> source, long seed = 0);
    public ulong GetCurrentHashAsUInt64();
}

public partial sealed class XxHash128 : NonCryptographicHashAlgorithm
{
    public static UInt128 HashToUInt128(ReadOnlySpan<byte> source, long seed = 0);
    public UInt128 GetCurrentHashAsUInt128();
}

teo-tsirpanis · 2022-11-08T21:32:31Z

The entry in dotnet/apireviews got a bit messed up with #69590.

xoofx · 2022-11-12T10:11:03Z

Just to make sure, for XxHash128, it is actually a UInt128 that we should return, right?

public partial sealed class XxHash128 : NonCryptographicHashAlgorithm
{
-    public static Int128 HashToUInt128(ReadOnlySpan<byte> source, long seed = 0);
+    public static UInt128 HashToUInt128(ReadOnlySpan<byte> source, long seed = 0);
-    public Int128 GetCurrentHashAsUInt128();
+    public UInt128 GetCurrentHashAsUInt128();
}

stephentoub · 2022-11-12T13:20:33Z

yup, UInt128, thanks

stephentoub added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO labels Sep 28, 2022

stephentoub added this to the 8.0.0 milestone Sep 28, 2022

adamsitnik added area-System.Security and removed area-System.IO labels Sep 28, 2022

stephentoub added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Oct 4, 2022

stephentoub self-assigned this Oct 4, 2022

bartonjs added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Oct 5, 2022

stephentoub added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Oct 12, 2022

stephentoub mentioned this issue Nov 2, 2022

Optimize XxHash3 #77756

Merged

bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Nov 8, 2022

This was referenced Nov 8, 2022

Add HashTo/GetCurrentHashAs{UInt32/64} methods #78075

Merged

Add XxHash128 #77944

Merged

stephentoub added the in-pr There is an active PR which will close this issue when it is merged label Nov 9, 2022

stephentoub closed this as completed in #78075 Nov 16, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Nov 16, 2022

jeffhandley added area-System.IO.Hashing and removed area-System.Security labels Nov 22, 2022

tarekgh mentioned this issue Nov 22, 2022

Libraries work to support Cloud Native effort for .NET 8 #78518

Closed

61 tasks

ghost locked as resolved and limited conversation to collaborators Dec 22, 2022

[API Proposal]: Simple one-shot hashing methods that produce the relevant type #76279

[API Proposal]: Simple one-shot hashing methods that produce the relevant type #76279

Comments

stephentoub commented Sep 28, 2022 • edited Loading

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

ghost commented Sep 28, 2022

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

LeaFrock commented Sep 28, 2022

ghost commented Sep 28, 2022

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

LeaFrock commented Oct 5, 2022 • edited Loading

bartonjs commented Oct 5, 2022

stephentoub commented Oct 5, 2022 • edited Loading

LeaFrock commented Oct 6, 2022

GrabYourPitchforks commented Oct 10, 2022

stephentoub commented Oct 10, 2022 • edited Loading

stephentoub commented Oct 10, 2022 • edited Loading

GrabYourPitchforks commented Oct 10, 2022

stephentoub commented Oct 10, 2022 • edited Loading

GrabYourPitchforks commented Oct 10, 2022

stephentoub commented Oct 10, 2022 • edited Loading

stephentoub commented Oct 10, 2022 • edited Loading

LeaFrock commented Oct 11, 2022 • edited Loading

stephentoub commented Oct 11, 2022

LeaFrock commented Oct 11, 2022

stephentoub commented Oct 11, 2022 • edited Loading

LeaFrock commented Oct 11, 2022

GrabYourPitchforks commented Oct 12, 2022 • edited Loading

jods4 commented Nov 2, 2022

GrabYourPitchforks commented Nov 5, 2022

stephentoub commented Nov 6, 2022

terrajobst commented Nov 8, 2022

bartonjs commented Nov 8, 2022 • edited by stephentoub Loading

teo-tsirpanis commented Nov 8, 2022

xoofx commented Nov 12, 2022

stephentoub commented Nov 12, 2022

stephentoub commented Sep 28, 2022 •

edited

Loading

LeaFrock commented Oct 5, 2022 •

edited

Loading

stephentoub commented Oct 5, 2022 •

edited

Loading

stephentoub commented Oct 10, 2022 •

edited

Loading

stephentoub commented Oct 10, 2022 •

edited

Loading

stephentoub commented Oct 10, 2022 •

edited

Loading

stephentoub commented Oct 10, 2022 •

edited

Loading

stephentoub commented Oct 10, 2022 •

edited

Loading

LeaFrock commented Oct 11, 2022 •

edited

Loading

stephentoub commented Oct 11, 2022 •

edited

Loading

GrabYourPitchforks commented Oct 12, 2022 •

edited

Loading

bartonjs commented Nov 8, 2022 •

edited by stephentoub

Loading