-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hashing Irregularities #278
Comments
The current system encoding must be taken into account when generating a hash on a text string. To make the hash more consistent with what you would expect from other programs or online generators, we are converting the text from the current system encoding to a UTF-8 byte array before passing it to the crypto APIs. Fixes #278
Thanks for pointing this out! I appreciate you taking the time to put together the pull request as well. After some review, I went ahead and used the Before After |
Testing this out on my end; from initial tests it looks like it works. |
This change seems to produce the most consistent results when comparing to online generators, especially when 4-byte Unicode characters are involved. #278
Ok, looks good here! I did some checking, and weirdly, it looks like it's faster than the un-corrected string. I think it may be because you have twice as many bits otherwise. The I think we should put this on the |
Well, this turned out to be more complicated than I thought. Not all online generators take the same approach for hashing Unicode strings, so they sometimes produce different results. As one person pointed out, we are not hashing a string, we are hashing a byte array. The way that we create the byte array is what makes the difference. After unsuccessfully testing with trial and error using the For the sake of testing, you can use With the updated function, we get the same results as online. Some online generators, such as xorbin do not hash the Unicode correctly. (Or more precisely, do not create the same UTF-8 byte array before hashing.) |
I literally just discovered this as I was going along! UGH! |
Thanks for taking a look at this and getting back to me so quick! |
The current system encoding must be taken into account when generating a hash on a text string. To make the hash more consistent with what you would expect from other programs or online generators, we are converting the text from the current system encoding to a UTF-8 byte array before passing it to the crypto APIs. Fixes #278
This change seems to produce the most consistent results when comparing to online generators, especially when 4-byte Unicode characters are involved. #278
I found some cases where what was hashed via
GetStringHash
did not have the same output as numerous online comparison tools.For indexing, I don't think it really matters, as the hash was consistently generated (within the use case).
I believe it has something to do with the way that
strings
were converted toByte()
and system encoding. I also ended up needing the actual number outputs for a legacy operation, so I've included that as well.I put this in three files, (in my own case, I just added all this to
modHash
, as I don't have encoding changes needed for me).Let me know what you think.
Note that in this environment, the change won't make a big difference (we're using hashes for difference comparisons here, vice comparing data integrity with other systems. But if someone finds our use of modHash and implements it, they will likely have issues with hashes generated by this system and others.
The text was updated successfully, but these errors were encountered: