Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting comparisons using byte[] #35

Closed
RenderMichael opened this issue Aug 21, 2023 · 11 comments · Fixed by #36
Closed

Supporting comparisons using byte[] #35

RenderMichael opened this issue Aug 21, 2023 · 11 comments · Fixed by #36

Comments

@RenderMichael
Copy link

I want to use this library but I have data as byte[].

Would it be possible to add support for ReadOnlySpan<byte> as well as string? In fact, there's probably room for tons of optimizations if the algorithms take ReadOnlySpan<T> which would be ROS for string inputs

@paulirwin
Copy link
Member

Thanks for the suggestion! One problem off the top of my head is that byte does not equal char in .NET, as .NET strings are UTF-16 (two bytes/char) by default. So I don't think it's as simple as just making those methods generic. I'm assuming you've got a UTF-8 string as a byte[]? Are UTF-8 codepoints (where one "character" can be multiple bytes in that array) not an issue for you in terms of string similarity?

I'd say if you care about codepoints, the best option today is probably to use Encoding.UTF8.GetString(...) to convert the UTF-8 codepoints into a UTF-16 string first. If you don't care about codepoints, and i.e. your byte[] is straight ASCII or something, I could see a use case for that with the codepoint caveat (i.e. international language strings would break down quickly and could be incorrectly indicated as very dissimilar).

Adding support for comparing UTF-8 codepoints in byte arrays is not something that is in scope for this project at the moment unless the upstream java-string-similarity adds support for it, although we could keep it on the roadmap for the future in a modular way that allows for maintaining full upstream compatibility while adding additional methods of our own.

@RenderMichael
Copy link
Author

Thanks for the reply @paulirwin! Actually we have a system comparing two Streams for equality, within a given tolerance. so we pull out the byte[] from the Stream, and it would be optimal for us to compare those bytes directly. Currently we map to a char[] with Encoding.Latin1.GetString() because Latin1 has a 1-to-1 mapping between char <-> byte.

Your concern about UTF8-encoding is completely correct, and any ByteSimilarity implementation would have to make those caveats clear.

Another think I realized is the library targets .NET standard 2.0, which would make a lot of this challenging. .NET 7 introduced lots of support for generic math, which would make a byte implementation share almost all (maybe all?) code with a char version. I.E. both byte and char implement the interface IBinaryInteger<TSelf> where TSelf : IBinaryInteger<TSelf>.

That doesn't lend itself well to multi-targeting, since you would basically have two versions of the same code, at which point it might as well be 2 libraries.

@paulirwin
Copy link
Member

That makes sense, thank you. Perhaps we might start with just a single algorithm rather than doing it for all of them upfront. Do you have a specific similarity algorithm that you'd use that could be a good first test case for this?

@RenderMichael
Copy link
Author

We default to the Damerau algorithm, but switch to Levenshtein for the occasional huge input (if you think there's a better one to choose for our purposes please let me know! I'm no expert in the details here).

Thanks for considering this!

@paulirwin
Copy link
Member

@RenderMichael Can you please pull the issue/35 branch and see how that works for you? It's a rough draft of how I think this could work. As a side bonus, it adds support for i.e. ReadOnlySpan<char>. Certain algorithms like ShingleBased ones and those that need characters explicitly aren't supported by this change, but your use cases of Damerau and Levenshtein are included.

With this, I'm giving up a little bit of my strict adherence to upstream code, but it remains close enough I think to be able to keep in step with it when it changes.

paulirwin added a commit that referenced this issue Aug 29, 2023
@RenderMichael
Copy link
Author

@paulirwin I like it!

One thing though, is there any way the ReadOnlySpan<T> methods could constrain T : IEquatable<T>? I see there's s1 == s2 between two ROS's, which tests identity rather than the same contents.

The usual way to test ROS content equality is s1.SequenceEqual(s2) but that requires the above constraint. It would also optimize the T.Equals(object) method by automatically rerouting to T.Equals(T), which prevents boxing.

The previous implementation using string required an explicit conversion to a char[] but ROS<T> is automatically T[]-like, so in LongestCommonSubsequence the s1.ToCharArray() which was converted to s1.ToArray() are no longer necessary, and s1 can be used as is.

Thanks for the effort!

paulirwin added a commit that referenced this issue Aug 31, 2023
paulirwin added a commit that referenced this issue Aug 31, 2023
@paulirwin
Copy link
Member

Great catches all around, thank you. I think I've fixed all of them. Take a look at the branch or the PR for it and let me know what you think. #36

@RenderMichael
Copy link
Author

RenderMichael commented Oct 3, 2023

@paulirwin Sorry for the long delay, kids are kids.

The contents of the PR look really nice! I tested Damerau and Levenshtein, the two algorithms I've been using. And no performance regressions compared to the main branch. Goodness all around.

@RenderMichael
Copy link
Author

@paulirwin The work in the associated PR looks great, any chance it could be merged?

paulirwin added a commit that referenced this issue Nov 8, 2023
Resolves #35 for comparison using byte[], or any scenarios where i.e. ReadOnlySpan might be preferred.
@paulirwin
Copy link
Member

@RenderMichael This has been merged. Let me know if you have any issues with that. We'll publish a NuGet release soon. I might make this a major version bump to v6.0 due to the changed interfaces.

@paulirwin
Copy link
Member

@RenderMichael v6.0.0 has been released to NuGet. Please let me know if you have any issues. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants