-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: Simple one-shot hashing methods that produce the relevant type #76279
Comments
Tagging subscribers to this area: @dotnet/area-system-io Issue DetailsBackground and motivationSystem.IO.Hashing contains types for Crc32, Crc64, XxHash32, and XxHash64, and XxHash3 is being added. These types all have one-shot static methods for performing hashes over API Proposalnamespace System.IO.Hashing;
public sealed class Crc32 : NonCryptographicHashAlgorithm
{
public static byte[] Hash(ReadOnlySpan<byte> source);
public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+ public static uint HashToUInt32(ReadOnlySpan<byte> source);
...
}
public sealed class XxHash32 : NonCryptographicHashAlgorithm
{
public static byte[] Hash(ReadOnlySpan<byte> source);
public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+ public static uint HashToUInt32(ReadOnlySpan<byte> source);
...
}
public sealed class Crc64 : NonCryptographicHashAlgorithm
{
public static byte[] Hash(ReadOnlySpan<byte> source);
public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+ public static ulong HashToUInt64(ReadOnlySpan<byte> source);
...
}
public sealed class XxHash64 : NonCryptographicHashAlgorithm
{
public static byte[] Hash(ReadOnlySpan<byte> source);
public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+ public static ulong HashToUInt64(ReadOnlySpan<byte> source);
...
}
public sealed class XxHash3 : NonCryptographicHashAlgorithm
{
public static byte[] Hash(ReadOnlySpan<byte> source);
public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+ public static ulong HashToUInt64(ReadOnlySpan<byte> source);
...
}
API UsageReadOnlySpan<byte> data = ...;
uint hash = XxHash32.HashToUInt32(data); Alternative DesignsNo response RisksNo response
|
Hah, that's what I want! But so sad that I have to wait for .NET 8 😭 |
Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones Issue DetailsBackground and motivationSystem.IO.Hashing contains types for Crc32, Crc64, XxHash32, and XxHash64, and XxHash3 is being added. These types all have one-shot static methods for performing hashes over API Proposalnamespace System.IO.Hashing;
public sealed class Crc32 : NonCryptographicHashAlgorithm
{
public static byte[] Hash(ReadOnlySpan<byte> source);
public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+ public static uint HashToUInt32(ReadOnlySpan<byte> source);
...
}
public sealed class XxHash32 : NonCryptographicHashAlgorithm
{
public static byte[] Hash(ReadOnlySpan<byte> source);
public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+ public static uint HashToUInt32(ReadOnlySpan<byte> source);
...
}
public sealed class Crc64 : NonCryptographicHashAlgorithm
{
public static byte[] Hash(ReadOnlySpan<byte> source);
public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+ public static ulong HashToUInt64(ReadOnlySpan<byte> source);
...
}
public sealed class XxHash64 : NonCryptographicHashAlgorithm
{
public static byte[] Hash(ReadOnlySpan<byte> source);
public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+ public static ulong HashToUInt64(ReadOnlySpan<byte> source);
...
}
public sealed class XxHash3 : NonCryptographicHashAlgorithm
{
public static byte[] Hash(ReadOnlySpan<byte> source);
public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+ public static ulong HashToUInt64(ReadOnlySpan<byte> source);
...
}
API UsageReadOnlySpan<byte> data = ...;
uint hash = XxHash32.HashToUInt32(data); Alternative DesignsNo response RisksNo response
|
I have a very small question. Let's take
Is it just a convention? I have this question because I see many underlying calculations use unsigned number in fact, such as Marvin.ComputeHash32(ref Unsafe.As<char, byte>(ref _firstChar), (uint)_stringLength * 2 /* in bytes, not chars */, (uint)seed, (uint)(seed >> 32)); |
@stephentoub I've removed ready-for-review because I want @GrabYourPitchforks to acknowledge the existence of the proposal (and perhaps even express opinions) before it comes up in a review meeting. |
Ok, but I think it's important we add this. Literally every hashing API I've found that others expose just surface the uint/int/ulong/long. That our APIs force you to either allocate an array or stackalloc some space into which to store the result, figure out that it's been written as a big-endian value, and then convert it back to being a number is a pit of failure. I've seen zero use cases that actually want a byte[]. |
Absolutely agree.
My question above is just related to this. |
The only strong opinion I have is that the endianness should be explicit: include it as part of the API name, or force the caller to include it as an enum parameter. All of these digest algorithms are intended to generate stable output across machines. It's an API pit of failure if we make the "simple" API lead developers unwittingly down a path which violates this fundamental property. |
I can include that if we really believe it's important, but we've already failed at that in this library. None of the static {Try}Hash methods state the order of the bytes that were written out, and the derived types differ as to whether they use little or big endian to do so, both for the statics and for the instance members. |
From my perspective, using machine endianness for the hash code in these Int32/Int64-returning methods does that implicitly. The proposed API returns an Int32 and you get back e.g. 12345 as that Int32 regardless of whether the machine is little or big endian. Forcing the method into a specific endianness is what violates this, as it means you get 12345 on some machines and 959447040 on others. |
The output of any digest algorithm is raw bytes. There's no concept of "endianness" unless you attempt to interpret those bytes as an integral value. For instance, we don't say that the digest output of SHA256 is big-endian, little-endian, or something else. It's just raw bytes. You're proposing here an API which performs its own opinionated interpretation of the digest output. Looking through the implementations of all the |
Every other example I can find of such hash functions, both in native libraries we consume (e.g. zlib), in exemplar implementations on which ours are based (e.g. xxhash), in other frameworks, in other .NET implementations of these same hash functions, and even elsewhere in .NET (e.g. BitOperations.Crc32C) operate in terms of int/long, not in terms of bytes. Even the internal workhorse functions of all of these algorithms are operating on numbers, and then we take a final step of serializing those numbers into bytes.
Almost all of the examples I've found are folks who want the values as integers, and thus are having to reconstruct that from the bytes we spent energy producing, e.g.
We have these docs for System.IO.Hashing.Crc32, for example: |
The public API surface shouldn't be concerned with internal implementation details. The API surface should only ever be concerned with the caller-observable output.
Because the CRC32 spec is deeply weird and isn't written as a typical digest algorithm spec, so callers need to know "this is the version of CRC32 you're calling." The paragraph immediately under the one you quote in the docs even says as such. It's unfortunately not a good model to base a usable API off of. :( To the samples that you gave, I think they actually provide a strong argument for why any opinionated APIs we add should explicitly state the endianness.
|
No, they want a number that's going to be the same regardless of architecture. Which is exactly what the proposed API will give them.
That's my point: because we've forced them to go through bytes, we've forced them to consider endianness, and they're not... we've created a pit of failure. An API that returns machine endianness for the number will always produce the same hash value for the same input bytes, regardless of current machine endianness. |
|
I never consider about it because I'm just developing a filter using local memory, so the machine endianness isn't a problem. But you're right as the hash function could be called on different machines in a distributed scenario. Now the key point of divergence is whether the top-level developers
I support it for most requirements. But considering for a situation that people want to appoint the concrete endian for compatible with interactions between existing systems (which may be based on other languages) or machines, it'd be better to retain some flexibility. (You may argue that people should use I think this is more appropriate for the analogy of public static long HashToInt64(ReadOnlySpan<byte> source, long seed = 0, EndianOption endianness = default);
public enum EndianOption // don’t care about the name
{
Default = 0, // As stephentoub says, always return the same value regardless of machine
Current, // Obey the current machine endian
Big,
Little,
} Or even keep a public static long HashToInt64(ReadOnlySpan<byte> source, long seed = 0, HashToOptions? option = null);
public struct HashToOptions
{
public EndianOption EndianOption { get;set }
} |
What is the difference between these two? |
Well, I'm not sure if my understanding is right. Suppose that machine A uses BE as default, when machine B uses LE.
|
As long as an algorithm isn't reinterpreting the raw bytes that make up an integer, endianness has no effect on the math involved, e.g. 2*3 is 6 regardless of whether you're on a little-endian or big-endian machine, 42 >> 3 is 5 regardless of whether you're on a little-endian or big-endian machine, (byte)0x12345678 is 0x78 regardless of whether you're on a little-endian or big-endian machine, etc., but All of these algorithms just operate on the numbers, not the raw bytes, or if they do operate on the raw bytes, the algorithms themselves account for endianness by dictating the ordering in which the bytes need to be interpreted. For example, the Crc32 algorithm's main loop is just: runtime/src/libraries/System.IO.Hashing/src/System/IO/Hashing/Crc32.cs Lines 160 to 165 in 6893524
There's nothing here endianness-specific: on both little-endian and big-endian machines, it'll produce the exact same uint value (the underlying memory will store the bytes that make up that uint differently, but the uint will have the same value). Or in XXH64, for example, if bytes need to be read and re-interpreted as an integer, it's done so with a fixed endianness regardless of the machine's architecture:
and that's dictated by the algorithm, e.g. https://github.com/Cyan4973/xxHash/blob/f9155bd4c57e2270a4ffbb176485e5d713de1c9b/doc/xxhash_spec.md states "However, a given variant shall produce exactly the same output, irrespective of the cpu / os used. In particular, the result remains identical whatever the endianness and width of the cpu is." In other words, regardless of whatever additional APIs .NET wants to layer on top, these algorithms produce a number and that number is computed to have a value that's the same regardless of endianness, e.g. |
Got it. Thanks for your explanation and patience 😄! The last worry I have is that if a library has already used
What I regret is that the proposed APIs and related discussions did not occur before .NET 6 (adding |
I think we're all in agreement on the basic premise: we should provide a nice API for getting integral hash codes. This is immediately useful for lots of people. The things we disagree on: how should we translate the output into an integer, and should the mechanism that we use for this dictate the chosen name? But after talking with Steve and Jeremy more offline I think we're actually close to agreement here as well. :) A quick note on terminology. These are all digest algorithms (a.k.a. hash algorithms). Digest algorithms are generally defined as "a variable number of bytes in" and "a fixed-length sequence of bytes out." This means that for
So if we expose
That second part is where the "endianness" concern comes into play. Since the source of truth is the digest bytes (it's a digest algorithm, after all!), we should document the exact transform we're using to go from one representation to the other. Of course, nothing requires us to have the When Jeremy and I spoke on Monday, he mentioned that there was no need for different algorithms to use identical conversions to go from source of truth -> friendly representation. Some could use little, some could use big, some could do their own weird byte shuffle. The only important aspect is that it's documented. This makes me a bit uneasy because it means that consumers need to consult the docs for each subtype they might be working with, but I'm not willing to die on this hill as long as we have docs. |
Amusingly, one of the most fundamental method of .net is It has the same name as Maybe it's a huge mistake in framework design, but I have a hunch that if we designed .net from scratch today we wouldn't make it |
@jods4 - @stephentoub - since |
Yes, it'll be .NET 7 only. |
LOL. I literally just wrote that code yesterday. |
namespace System.IO.Hashing;
public partial sealed class Crc32 : NonCryptographicHashAlgorithm
{
public static uint HashToUInt32(ReadOnlySpan<byte> source);
public uint GetCurrentHashAsUInt32();
}
public partial sealed class XxHash32 : NonCryptographicHashAlgorithm
{
public static uint HashToUInt32(ReadOnlySpan<byte> source, int seed = 0);
public uint GetCurrentHashAsUInt32();
}
public partial sealed class Crc64 : NonCryptographicHashAlgorithm
{
public static ulong HashToUInt64(ReadOnlySpan<byte> source);
public ulong GetCurrentHashAsInt64();
}
public partial sealed class XxHash64 : NonCryptographicHashAlgorithm
{
public static ulong HashToUInt64(ReadOnlySpan<byte> source, long seed = 0);
public ulong GetCurrentHashAsUInt64();
}
public partial sealed class XxHash3 : NonCryptographicHashAlgorithm
{
public static ulong HashToUInt64(ReadOnlySpan<byte> source, long seed = 0);
public ulong GetCurrentHashAsUInt64();
}
public partial sealed class XxHash128 : NonCryptographicHashAlgorithm
{
public static UInt128 HashToUInt128(ReadOnlySpan<byte> source, long seed = 0);
public UInt128 GetCurrentHashAsUInt128();
} |
The entry in |
Just to make sure, for public partial sealed class XxHash128 : NonCryptographicHashAlgorithm
{
- public static Int128 HashToUInt128(ReadOnlySpan<byte> source, long seed = 0);
+ public static UInt128 HashToUInt128(ReadOnlySpan<byte> source, long seed = 0);
- public Int128 GetCurrentHashAsUInt128();
+ public UInt128 GetCurrentHashAsUInt128();
} |
yup, UInt128, thanks |
Background and motivation
System.IO.Hashing contains types for Crc32, Crc64, XxHash32, and XxHash64, and XxHash3 is being added. These types all have one-shot static methods for performing hashes over
ReadOnlySpan<byte>
, but all of these methods either allocate a resultingbyte[]
or write the result into aSpan<byte>
. However, the natural data types expecting in many use cases is just an int/uint for 32-bit or long/ulong for 64-bit hashes. We should add helpers for this.API Proposal
API Usage
Alternative Designs
No response
Risks
No response
The text was updated successfully, but these errors were encountered: