Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashing Irregularities #278

Closed
hecon5 opened this issue Nov 2, 2021 · 6 comments
Closed

Hashing Irregularities #278

hecon5 opened this issue Nov 2, 2021 · 6 comments

Comments

@hecon5
Copy link
Contributor

hecon5 commented Nov 2, 2021

I found some cases where what was hashed via GetStringHash did not have the same output as numerous online comparison tools.

For indexing, I don't think it really matters, as the hash was consistently generated (within the use case).

I believe it has something to do with the way that strings were converted to Byte() and system encoding. I also ended up needing the actual number outputs for a legacy operation, so I've included that as well.

I put this in three files, (in my own case, I just added all this to modHash, as I don't have encoding changes needed for me).

Let me know what you think.

Note that in this environment, the change won't make a big difference (we're using hashes for difference comparisons here, vice comparing data integrity with other systems. But if someone finds our use of modHash and implements it, they will likely have issues with hashes generated by this system and others.

joyfullservice added a commit that referenced this issue Nov 2, 2021
The current system encoding must be taken into account when generating a hash on a text string. To make the hash more consistent with what you would expect from other programs or online generators, we are converting the text from the current system encoding to a UTF-8 byte array before passing it to the crypto APIs. Fixes #278
joyfullservice added a commit that referenced this issue Nov 2, 2021
@joyfullservice
Copy link
Owner

Thanks for pointing this out! I appreciate you taking the time to put together the pull request as well. After some review, I went ahead and used the ADODB.Stream object instead of the API calls to make the text conversion. The function now produces the expected hash output that matches the online generators. Looking at the byte arrays produced by the different string conversion functions makes the difference pretty obvious.

Before

image

After

image

@hecon5
Copy link
Contributor Author

hecon5 commented Nov 2, 2021

Testing this out on my end; from initial tests it looks like it works.

joyfullservice added a commit that referenced this issue Nov 2, 2021
This change seems to produce the most consistent results when comparing to online generators, especially when 4-byte Unicode characters are involved. #278
@hecon5
Copy link
Contributor Author

hecon5 commented Nov 2, 2021

Ok, looks good here!

I did some checking, and weirdly, it looks like it's faster than the un-corrected string. I think it may be because you have twice as many bits otherwise. The ADODB.Stream method is also ~0.2 Seconds faster (0.98 Seconds Vs. 1.17 Seconds) than the one I had after 500 iterations of a GetDictionaryHash, too. Each hash iteration is ~0.05s vs ~0.25 Second (give or take) according to the performance tool you built.

I think we should put this on the master branch, too to avoid people finding an improperly implemented hashing tool.

@joyfullservice
Copy link
Owner

Well, this turned out to be more complicated than I thought. Not all online generators take the same approach for hashing Unicode strings, so they sometimes produce different results. As one person pointed out, we are not hashing a string, we are hashing a byte array. The way that we create the byte array is what makes the difference.

After unsuccessfully testing with trial and error using the ADODB.Stream object, I temporarily plugged in the WideCharToMultiByte API call to see the actual byte array so I could create a matching one using the Stream object. As it turned out, the main issue was that I needed to skip past the UTF-8 BOM in the byte array created automatically by the stream object. After doing that, the hash was returned correctly for both Unicode and non-Unicode strings.

For the sake of testing, you can use ChrW(55356) & ChrW(57102) which creates the Unicode character 🌎. The byte array that we need to hash is pictured here:

image

With the updated function, we get the same results as online.
c4b4332a65af850511cb1c8599a141b1f487bb1c9e3d021c148ea6a42652897b

Some online generators, such as xorbin do not hash the Unicode correctly. (Or more precisely, do not create the same UTF-8 byte array before hashing.)

@hecon5
Copy link
Contributor Author

hecon5 commented Nov 2, 2021

I literally just discovered this as I was going along! UGH!

@hecon5
Copy link
Contributor Author

hecon5 commented Nov 2, 2021

Thanks for taking a look at this and getting back to me so quick!

joyfullservice added a commit that referenced this issue Nov 9, 2021
The current system encoding must be taken into account when generating a hash on a text string. To make the hash more consistent with what you would expect from other programs or online generators, we are converting the text from the current system encoding to a UTF-8 byte array before passing it to the crypto APIs. Fixes #278
joyfullservice added a commit that referenced this issue Nov 9, 2021
joyfullservice added a commit that referenced this issue Nov 9, 2021
This change seems to produce the most consistent results when comparing to online generators, especially when 4-byte Unicode characters are involved. #278
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants