Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Simple one-shot hashing methods that produce the relevant type #76279

Closed
stephentoub opened this issue Sep 28, 2022 · 29 comments · Fixed by #78075
Closed

[API Proposal]: Simple one-shot hashing methods that produce the relevant type #76279

stephentoub opened this issue Sep 28, 2022 · 29 comments · Fixed by #78075
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.IO.Hashing
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Sep 28, 2022

Background and motivation

System.IO.Hashing contains types for Crc32, Crc64, XxHash32, and XxHash64, and XxHash3 is being added. These types all have one-shot static methods for performing hashes over ReadOnlySpan<byte>, but all of these methods either allocate a resulting byte[] or write the result into a Span<byte>. However, the natural data types expecting in many use cases is just an int/uint for 32-bit or long/ulong for 64-bit hashes. We should add helpers for this.

API Proposal

namespace System.IO.Hashing;

public sealed class Crc32 : NonCryptographicHashAlgorithm
{
+   public static int HashToInt32(ReadOnlySpan<byte> source);
+   public int GetCurrentHashAsInt32();
    ...
}

public sealed class XxHash32 : NonCryptographicHashAlgorithm
{
+   public static int HashToInt32(ReadOnlySpan<byte> source, int seed = 0);
+   public int GetCurrentHashAsInt32();
    ...
}

public sealed class Crc64 : NonCryptographicHashAlgorithm
{
+   public static long HashToInt64(ReadOnlySpan<byte> source);
+   public long GetCurrentHashAsInt64();
    ...
}

public sealed class XxHash64 : NonCryptographicHashAlgorithm
{
+   public static long HashToInt64(ReadOnlySpan<byte> source, long seed = 0);
+   public long GetCurrentHashAsInt64();
    ...
}

public sealed class XxHash3 : NonCryptographicHashAlgorithm
{
+   public static long HashToInt64(ReadOnlySpan<byte> source, long seed = 0);
+   public long GetCurrentHashAsInt64();
    ...
}

public sealed class XxHash128 : NonCryptographicHashAlgorithm // assuming https://github.com/dotnet/runtime/issues/77885
{
+   public static Int128 HashToInt128(ReadOnlySpan<byte> source, long seed = 0);
+   public Int128 GetCurrentHashAsInt128();
    ...
}
  • All of these implementations already compute the relevant int/long and then write it to either a newly-allocated array or the destination span. We'd just return it instead from the new method.
  • The existing methods write the value to the destination as big-endian. The int/long-returning methods would simply return the value in machine-endianness.
  • Should we use unsigned types instead? All of the algorithms internally operate in unsigned land.

API Usage

ReadOnlySpan<byte> data = ...;
int hash = XxHash32.HashToInt32(data);

Alternative Designs

No response

Risks

No response

@stephentoub stephentoub added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO labels Sep 28, 2022
@stephentoub stephentoub added this to the 8.0.0 milestone Sep 28, 2022
@ghost
Copy link

ghost commented Sep 28, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

System.IO.Hashing contains types for Crc32, Crc64, XxHash32, and XxHash64, and XxHash3 is being added. These types all have one-shot static methods for performing hashes over ReadOnlySpan<byte>, but all of these methods either allocate a resulting byte[] or write the result into a Span<byte>. However, the natural data types expecting in many use cases is just an int/uint for 32-bit or long/ulong for 64-bit hashes. We should add helpers for this.

API Proposal

namespace System.IO.Hashing;

public sealed class Crc32 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static uint HashToUInt32(ReadOnlySpan<byte> source);
    ...
}

public sealed class XxHash32 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static uint HashToUInt32(ReadOnlySpan<byte> source);
    ...
}

public sealed class Crc64 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    ...
}

public sealed class XxHash64 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    ...
}

public sealed class XxHash3 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    ...
}
  • All of these implementations already compute the relevant uint/ulong and then write it to either a newly-allocated array or the destination span. We'd just return it instead from the new method.
  • The existing methods write the value to the destination as big-endian. The uint/ulong-returning methods would simply return the value in machine-endianness.
  • If we ever added a 128-bit hash, like XxHash128, we'd return UInt128.
  • We could choose to return int/long instead of uint/ulong, but internally we're operating in terms of unsigned, our BitOperations APIs that hash operate on unsigned (e.g. Expose general purpose Crc32 APIs #2036 (comment)), and looking around the web, unsigned is generally what others appear to use.

API Usage

ReadOnlySpan<byte> data = ...;
uint hash = XxHash32.HashToUInt32(data);

Alternative Designs

No response

Risks

No response

Author: stephentoub
Assignees: -
Labels:

api-suggestion, area-System.IO

Milestone: 8.0.0

@LeaFrock
Copy link
Contributor

Hah, that's what I want!

But so sad that I have to wait for .NET 8 😭

@ghost
Copy link

ghost commented Sep 28, 2022

Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

System.IO.Hashing contains types for Crc32, Crc64, XxHash32, and XxHash64, and XxHash3 is being added. These types all have one-shot static methods for performing hashes over ReadOnlySpan<byte>, but all of these methods either allocate a resulting byte[] or write the result into a Span<byte>. However, the natural data types expecting in many use cases is just an int/uint for 32-bit or long/ulong for 64-bit hashes. We should add helpers for this.

API Proposal

namespace System.IO.Hashing;

public sealed class Crc32 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static uint HashToUInt32(ReadOnlySpan<byte> source);
    ...
}

public sealed class XxHash32 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static uint HashToUInt32(ReadOnlySpan<byte> source);
    ...
}

public sealed class Crc64 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    ...
}

public sealed class XxHash64 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    ...
}

public sealed class XxHash3 : NonCryptographicHashAlgorithm
{
    public static byte[] Hash(ReadOnlySpan<byte> source);
    public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+   public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    ...
}
  • All of these implementations already compute the relevant uint/ulong and then write it to either a newly-allocated array or the destination span. We'd just return it instead from the new method.
  • The existing methods write the value to the destination as big-endian. The uint/ulong-returning methods would simply return the value in machine-endianness.
  • If we ever added a 128-bit hash, like XxHash128, we'd return UInt128.
  • We could choose to return int/long instead of uint/ulong, but internally we're operating in terms of unsigned, our BitOperations APIs that hash operate on unsigned (e.g. Expose general purpose Crc32 APIs #2036 (comment)), and looking around the web, unsigned is generally what others appear to use.

API Usage

ReadOnlySpan<byte> data = ...;
uint hash = XxHash32.HashToUInt32(data);

Alternative Designs

No response

Risks

No response

Author: stephentoub
Assignees: -
Labels:

api-suggestion, area-System.Security

Milestone: 8.0.0

@stephentoub stephentoub added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Oct 4, 2022
@stephentoub stephentoub self-assigned this Oct 4, 2022
@LeaFrock
Copy link
Contributor

LeaFrock commented Oct 5, 2022

I have a very small question. Let's take XxHash64 for example: why is the API long HashToInt64 rather than ulong HashToUInt64?

The existing methods write the value to the destination as big-endian. The int/long-returning methods would simply return the value in machine-endianness.

Is it just a convention? I have this question because I see many underlying calculations use unsigned number in fact, such as String.GetHashCode(),

Marvin.ComputeHash32(ref Unsafe.As<char, byte>(ref _firstChar), (uint)_stringLength * 2 /* in bytes, not chars */, (uint)seed, (uint)(seed >> 32));

@bartonjs bartonjs added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Oct 5, 2022
@bartonjs
Copy link
Member

bartonjs commented Oct 5, 2022

@stephentoub I've removed ready-for-review because I want @GrabYourPitchforks to acknowledge the existence of the proposal (and perhaps even express opinions) before it comes up in a review meeting.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 5, 2022

Ok, but I think it's important we add this. Literally every hashing API I've found that others expose just surface the uint/int/ulong/long. That our APIs force you to either allocate an array or stackalloc some space into which to store the result, figure out that it's been written as a big-endian value, and then convert it back to being a number is a pit of failure. I've seen zero use cases that actually want a byte[].

@LeaFrock
Copy link
Contributor

LeaFrock commented Oct 6, 2022

Ok, but I think it's important we add this. Literally every hashing API I've found that others expose just surface the uint/int/ulong/long. That our APIs force you to either allocate an array or stackalloc some space into which to store the result, figure out that it's been written as a big-endian value, and then convert it back to being a number is a pit of failure. I've seen zero use cases that actually want a byte[].

Absolutely agree.

Should we use unsigned types instead? All of the algorithms internally operate in unsigned land.

My question above is just related to this.

@GrabYourPitchforks
Copy link
Member

The only strong opinion I have is that the endianness should be explicit: include it as part of the API name, or force the caller to include it as an enum parameter. All of these digest algorithms are intended to generate stable output across machines. It's an API pit of failure if we make the "simple" API lead developers unwittingly down a path which violates this fundamental property.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 10, 2022

The only strong opinion I have is that the endianness should be explicit: include it as part of the API name, or force the caller to include it as an enum parameter.

I can include that if we really believe it's important, but we've already failed at that in this library. None of the static {Try}Hash methods state the order of the bytes that were written out, and the derived types differ as to whether they use little or big endian to do so, both for the statics and for the instance members.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 10, 2022

All of these digest algorithms are intended to generate stable output across machines.

From my perspective, using machine endianness for the hash code in these Int32/Int64-returning methods does that implicitly. The proposed API returns an Int32 and you get back e.g. 12345 as that Int32 regardless of whether the machine is little or big endian. Forcing the method into a specific endianness is what violates this, as it means you get 12345 on some machines and 959447040 on others.

@GrabYourPitchforks
Copy link
Member

None of the static Hash methods state the order of the bytes that were written out, and the derived types differ as to whether they use little or big endian to do so, both for the statics and for the instance members.

The output of any digest algorithm is raw bytes. There's no concept of "endianness" unless you attempt to interpret those bytes as an integral value. For instance, we don't say that the digest output of SHA256 is big-endian, little-endian, or something else. It's just raw bytes. You're proposing here an API which performs its own opinionated interpretation of the digest output.

Looking through the implementations of all the [NonCryptographic]HashAlgorithm-derived classes, I don't immediately see anything which is ambiguous with regard to endianness. All the implementations I'm seeing follow the "everything is raw bytes" pattern. What am I missing?

@stephentoub
Copy link
Member Author

stephentoub commented Oct 10, 2022

The output of any digest algorithm is raw bytes

Every other example I can find of such hash functions, both in native libraries we consume (e.g. zlib), in exemplar implementations on which ours are based (e.g. xxhash), in other frameworks, in other .NET implementations of these same hash functions, and even elsewhere in .NET (e.g. BitOperations.Crc32C) operate in terms of int/long, not in terms of bytes. Even the internal workhorse functions of all of these algorithms are operating on numbers, and then we take a final step of serializing those numbers into bytes.

What am I missing?

Almost all of the examples I've found are folks who want the values as integers, and thus are having to reconstruct that from the bytes we spent energy producing, e.g.
https://github.com/dotnet/orleans/blob/cf4423ea4d75b4ab8ecf071968a2a4bfd463646d/src/Orleans.Serialization/TypeSystem/TypeCodec.cs#L210
https://github.com/LeaFrock/BloomFilterSharp/blob/662b0475b3acab4d659a1b298dba4423b53399c3/src/BloomFilterSharp.Hashing/XxHash64Function.cs#L11-L12
https://github.com/avafloww/Thaliak/blob/01eb1e0668b399a70c580f2be1c1480a9699ccda/Thaliak.Common.Database/Models/XivRepository.cs#L26-L28
and in some cases doing so in a way that doesn't pay attention to endianness and thus will violate the "same output" concern because we focused on outputting bytes.

For instance, we don't say that the digest output of SHA256 is big-endian, little-endian, or something else

We have these docs for System.IO.Hashing.Crc32, for example:
https://learn.microsoft.com/en-us/dotnet/api/system.io.hashing.crc32?view=dotnet-plat-ext-7.0
"This implementation emits the answer in the Little Endian byte order so that the CRC residue relationship (CRC(message concat CRC(message)) is a fixed value) holds. For CRC-32, this stable output is the byte sequence { 0x1C, 0xDF, 0x44, 0x21 }, the Little Endian representation of 0x2144DF1C."

@GrabYourPitchforks
Copy link
Member

Even the internal workhorse functions of all of these algorithms are operating on numbers, and then we take a final step of serializing those numbers into bytes.

The public API surface shouldn't be concerned with internal implementation details. The API surface should only ever be concerned with the caller-observable output.

We have these docs for System.IO.Hashing.Crc32, for example

Because the CRC32 spec is deeply weird and isn't written as a typical digest algorithm spec, so callers need to know "this is the version of CRC32 you're calling." The paragraph immediately under the one you quote in the docs even says as such. It's unfortunately not a good model to base a usable API off of. :(

To the samples that you gave, I think they actually provide a strong argument for why any opinionated APIs we add should explicitly state the endianness.

  • Orleans: They want the output in little endian.
  • BloomFilterSharp: Their call site appears to be buggy, since they're serializing / persisting a machine-endian value.
  • Thaliak: They want the output in big endian.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 10, 2022

  • Orleans: They want the output in little endian.

No, they want a number that's going to be the same regardless of architecture. Which is exactly what the proposed API will give them.

BloomFilterSharp: Their call site appears to be buggy, since they're serializing / persisting a machine-endian value.

That's my point: because we've forced them to go through bytes, we've forced them to consider endianness, and they're not... we've created a pit of failure. An API that returns machine endianness for the number will always produce the same hash value for the same input bytes, regardless of current machine endianness.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 10, 2022

The API surface should only ever be concerned with the caller-observable output.

XxHash3.HashToInt32(new[] { 1, 2, 3, 4, 5, 6 }) will always return the same numeric ulong value 0xACCC1C819D65620C, regardless of architecture. I don't understand why endianness is even an issue in the conversation here. Seems to me it'd be like saying Int32.Parse should have been Int32.ParseLittleEndian such that ParseLittleEndian("12345") would return the Int32 value 959447040 on a big-endian machine.

@LeaFrock
Copy link
Contributor

LeaFrock commented Oct 11, 2022

BloomFilterSharp: Their call site appears to be buggy, since they're serializing / persisting a machine-endian value

I never consider about it because I'm just developing a filter using local memory, so the machine endianness isn't a problem. But you're right as the hash function could be called on different machines in a distributed scenario.

Now the key point of divergence is whether the top-level developers want a number that's going to be the same regardless of architecture. Both of your opinions sound reasonable depending on the concrete scenarios.

An API that returns machine endianness for the number will always produce the same hash value for the same input bytes, regardless of current machine endianness.

I support it for most requirements. But considering for a situation that people want to appoint the concrete endian for compatible with interactions between existing systems (which may be based on other languages) or machines, it'd be better to retain some flexibility. (You may argue that people should use byte[] API but you can't guarantee that)

I think this is more appropriate for the analogy of System.StringComparison. Could it be possible that we add a similar enum for the proposed API?

public static long HashToInt64(ReadOnlySpan<byte> source, long seed = 0, EndianOption endianness = default);

public enum EndianOption // don’t care about the name
{
       Default = 0, // As stephentoub says, always return the same value regardless of machine
       Current, // Obey the current machine endian
       Big,
       Little,
}

Or even keep a HashToOptions for potential extensibility,

public static long HashToInt64(ReadOnlySpan<byte> source, long seed = 0, HashToOptions? option = null);

public struct HashToOptions
{
      public EndianOption EndianOption { get;set }
}

@stephentoub
Copy link
Member Author

Default = 0, // As stephentoub says, always return the same value regardless of machine
Current, // Obey the current machine endian

What is the difference between these two?

@LeaFrock
Copy link
Contributor

Well, I'm not sure if my understanding is right. Suppose that machine A uses BE as default, when machine B uses LE.

Default will obey the rule of API itself which ignores all the differences between machines. A and B always produce the same outputs.
Current obey the rule of machine system so that the outputs are different from A and B.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 11, 2022

Current obey the rule of machine system so that the outputs are different from A and B.

As long as an algorithm isn't reinterpreting the raw bytes that make up an integer, endianness has no effect on the math involved, e.g. 2*3 is 6 regardless of whether you're on a little-endian or big-endian machine, 42 >> 3 is 5 regardless of whether you're on a little-endian or big-endian machine, (byte)0x12345678 is 0x78 regardless of whether you're on a little-endian or big-endian machine, etc., but int i = 0x12345678; *(byte*)&i will differ, as it's bypassing the normal language rules for how to interpret the bytes and instead accessing a specific memory location directly, and thus will be 0x78 on a little-endian machine and 0x12 on a big-endian machine.

All of these algorithms just operate on the numbers, not the raw bytes, or if they do operate on the raw bytes, the algorithms themselves account for endianness by dictating the ordering in which the bytes need to be interpreted. For example, the Crc32 algorithm's main loop is just:

for (int i = 0; i < source.Length; i++)
{
byte idx = (byte)crc;
idx ^= source[i];
crc = s_crcLookup[idx] ^ (crc >> 8);
}

There's nothing here endianness-specific: on both little-endian and big-endian machines, it'll produce the exact same uint value (the underlying memory will store the bytes that make up that uint differently, but the uint will have the same value). Or in XXH64, for example, if bytes need to be read and re-interpreted as an integer, it's done so with a fixed endianness regardless of the machine's architecture:
return ApplyRound(acc, BinaryPrimitives.ReadUInt64LittleEndian(lane));

and that's dictated by the algorithm, e.g. https://github.com/Cyan4973/xxHash/blob/f9155bd4c57e2270a4ffbb176485e5d713de1c9b/doc/xxhash_spec.md states "However, a given variant shall produce exactly the same output, irrespective of the cpu / os used. In particular, the result remains identical whatever the endianness and width of the cpu is."

In other words, regardless of whatever additional APIs .NET wants to layer on top, these algorithms produce a number and that number is computed to have a value that's the same regardless of endianness, e.g.
https://github.com/Cyan4973/xxHash/blob/43ea6fdf838782688716f6ad85792a39e7e38e9a/xxhash.h#L161
producing a uint64_t. The only reason endianness comes into play in the discussions about these NonCryptographicHashAlgorithm APIs is because we exposed APIs that take the resulting number and export that number as the bytes that make it up, as the {Try}Hash methods all only return the data as byte[] or written into a Span<byte>, in which case we have to choose an ordering for those bytes. But when talking about the actual numerical result of the algorithm, endianness shouldn't be a factor, as the results are just numbers. Which is why I don't think anything to do with endianness should be a part of the APIs being proposed here that are just returning int/long. Doing so introduces complexity and problems rather than solving them.

@LeaFrock
Copy link
Contributor

As long as an algorithm isn't reinterpreting the raw bytes that make up an integer, endianness has no effect on the math involved
...
All of these algorithms just operate on the numbers, not the raw bytes, or if they do operate on the raw bytes, the algorithms themselves account for endianness by dictating the ordering in which the bytes need to be interpreted

Got it. Thanks for your explanation and patience 😄!

The last worry I have is that if a library has already used (Try)Hash and BitConverter with byte[] and appointed endianness to get a int/long value, it would meet potential break changes after switching to the new API. But it's not a big problem as more detailed docs will help that.

Almost all of the examples I've found are folks who want the values as integers, and thus are having to reconstruct that from the bytes we spent energy producing

What I regret is that the proposed APIs and related discussions did not occur before .NET 6 (adding System.IO.Hashing).

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Oct 12, 2022

I think we're all in agreement on the basic premise: we should provide a nice API for getting integral hash codes. This is immediately useful for lots of people.

The things we disagree on: how should we translate the output into an integer, and should the mechanism that we use for this dictate the chosen name? But after talking with Steve and Jeremy more offline I think we're actually close to agreement here as well. :)

A quick note on terminology. These are all digest algorithms (a.k.a. hash algorithms). Digest algorithms are generally defined as "a variable number of bytes in" and "a fixed-length sequence of bytes out." This means that for [NonCryptographic]HashAlgorithm-derived classes we want the resulting digest bytes to be the source of truth. Always. Regardless of the underlying implementation.

A digest algorithm might choose to use integral types for its intermediate or terminal state, before that state is converted to a fixed-length byte sequence. But from the start we have made a very deliberate decision across all [NonCryptographic]HashAlgorithm-derived types to treat them as digest algorithms and to expose digest-style APIs as the primary interface for these types. It's right there in the name of the base class!

So if we expose GetHashCodeAsXxx APIs on these types, it really should follow two constraints.

  1. The output should be stable across architectures. The digest itself is by-design stable across architectures, so anything derived from it should be similarly stable. All three examples posted at [API Proposal]: Simple one-shot hashing methods that produce the relevant type #76279 (comment) persist the hash code to storage, so we have evidence that people already have this expectation.

  2. The GetHashCodeAsXxx method should be documented in terms of the source-of-truth digest bytes. For example, the doc page for XxHash32.GetHashCodeAsInt32 should say something like this:

    The return value of XxHash32.GetHashCodeAsInt32 is equivalent to having written the following:

    byte[] bytes = new byte[sizeof(int)];
    XxHash32.HashData(/* ... */, bytes);
    return BinaryPrimitives.ReadInt32BigEndian(bytes);

That second part is where the "endianness" concern comes into play. Since the source of truth is the digest bytes (it's a digest algorithm, after all!), we should document the exact transform we're using to go from one representation to the other.

Of course, nothing requires us to have the GetHashCodeAsXxx method actually do the conversion from bytes -> integer. If we already have the integer as an intermediate representation from the internal digest state, great! Return it. But if we change the algorithm implementation in the future to use intrinsics, or to use novel mathematical constructs, then we might no longer have that intermediate representation readily available to us and may need to fall back to a standard bytes -> integer approach.

When Jeremy and I spoke on Monday, he mentioned that there was no need for different algorithms to use identical conversions to go from source of truth -> friendly representation. Some could use little, some could use big, some could do their own weird byte shuffle. The only important aspect is that it's documented. This makes me a bit uneasy because it means that consumers need to consult the docs for each subtype they might be working with, but I'm not willing to die on this hill as long as we have docs.

@stephentoub stephentoub added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Oct 12, 2022
@jods4
Copy link

jods4 commented Nov 2, 2022

For digest algorithms (a.k.a. hash algorithms) [...] we want the resulting digest bytes to be the source of truth. Always.
So if we expose GetHashCodeAsXxx APIs on these types, it really should follow two constraints.

Amusingly, one of the most fundamental method of .net is Object.GetHashCode() and it returns an int ;)

It has the same name as GetHashCodeAsXxx and it fits the terminology of "a variable number of bytes in" and "a fixed-length sequence of bytes out.", especially in cases like string or even byte[]!

Maybe it's a huge mistake in framework design, but I have a hunch that if we designed .net from scratch today we wouldn't make it byte[] Object.GetHashCode().

@GrabYourPitchforks
Copy link
Member

@jods4 - object.GetHashCode is very specifically intended for usage within a dictionary and other keyed hash collections in-proc. It is not meant to be persisted. It does not need to be stable across architectures, or even across back-to-back executions of the app on the same machine! Its underlying algorithm is strictly an implementation detail and callers cannot rely on it to be a member of any specific algorithm family.

@stephentoub - since Int128 is net7+ only, do we need to condition the friendly one-shot helper on Xxhash128 to be exposed only in net7+?

@stephentoub
Copy link
Member Author

since Int128 is net7+ only, do we need to condition the friendly one-shot helper on Xxhash128 to be exposed only in net7+?

Yes, it'll be .NET 7 only.

@terrajobst
Copy link
Member

That our APIs force you to either allocate an array or stackalloc some space into which to store the result, figure out that it's been written as a big-endian value, and then convert it back to being a number is a pit of failure.

LOL. I literally just wrote that code yesterday.

@bartonjs
Copy link
Member

bartonjs commented Nov 8, 2022

Video

  • Note: .NET Standard won't have the UInt128 methods as the type doesn't exist for .NET Standard today
namespace System.IO.Hashing;

public partial sealed class Crc32 : NonCryptographicHashAlgorithm
{
    public static uint HashToUInt32(ReadOnlySpan<byte> source);
    public uint GetCurrentHashAsUInt32();
}

public partial sealed class XxHash32 : NonCryptographicHashAlgorithm
{
    public static uint HashToUInt32(ReadOnlySpan<byte> source, int seed = 0);
    public uint GetCurrentHashAsUInt32();
}

public partial sealed class Crc64 : NonCryptographicHashAlgorithm
{
    public static ulong HashToUInt64(ReadOnlySpan<byte> source);
    public ulong GetCurrentHashAsInt64();
}

public partial sealed class XxHash64 : NonCryptographicHashAlgorithm
{
    public static ulong HashToUInt64(ReadOnlySpan<byte> source, long seed = 0);
    public ulong GetCurrentHashAsUInt64();
}

public partial sealed class XxHash3 : NonCryptographicHashAlgorithm
{
    public static ulong HashToUInt64(ReadOnlySpan<byte> source, long seed = 0);
    public ulong GetCurrentHashAsUInt64();
}

public partial sealed class XxHash128 : NonCryptographicHashAlgorithm
{
    public static UInt128 HashToUInt128(ReadOnlySpan<byte> source, long seed = 0);
    public UInt128 GetCurrentHashAsUInt128();
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Nov 8, 2022
@teo-tsirpanis
Copy link
Contributor

The entry in dotnet/apireviews got a bit messed up with #69590.

@stephentoub stephentoub added the in-pr There is an active PR which will close this issue when it is merged label Nov 9, 2022
@xoofx
Copy link
Member

xoofx commented Nov 12, 2022

Just to make sure, for XxHash128, it is actually a UInt128 that we should return, right?

public partial sealed class XxHash128 : NonCryptographicHashAlgorithm
{
-    public static Int128 HashToUInt128(ReadOnlySpan<byte> source, long seed = 0);
+    public static UInt128 HashToUInt128(ReadOnlySpan<byte> source, long seed = 0);
-    public Int128 GetCurrentHashAsUInt128();
+    public UInt128 GetCurrentHashAsUInt128();
}

@stephentoub
Copy link
Member Author

yup, UInt128, thanks

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Nov 16, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Dec 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.IO.Hashing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants