Hashing Irregularities #278

hecon5 · 2021-11-02T17:25:54Z

I found some cases where what was hashed via GetStringHash did not have the same output as numerous online comparison tools.

For indexing, I don't think it really matters, as the hash was consistently generated (within the use case).

I believe it has something to do with the way that strings were converted to Byte() and system encoding. I also ended up needing the actual number outputs for a legacy operation, so I've included that as well.

I put this in three files, (in my own case, I just added all this to modHash, as I don't have encoding changes needed for me).

Let me know what you think.

Note that in this environment, the change won't make a big difference (we're using hashes for difference comparisons here, vice comparing data integrity with other systems. But if someone finds our use of modHash and implements it, they will likely have issues with hashes generated by this system and others.

The text was updated successfully, but these errors were encountered:

The current system encoding must be taken into account when generating a hash on a text string. To make the hash more consistent with what you would expect from other programs or online generators, we are converting the text from the current system encoding to a UTF-8 byte array before passing it to the crypto APIs. Fixes #278

Related to #278

joyfullservice · 2021-11-02T19:26:45Z

Thanks for pointing this out! I appreciate you taking the time to put together the pull request as well. After some review, I went ahead and used the ADODB.Stream object instead of the API calls to make the text conversion. The function now produces the expected hash output that matches the online generators. Looking at the byte arrays produced by the different string conversion functions makes the difference pretty obvious.

Before

After

hecon5 · 2021-11-02T20:42:35Z

Testing this out on my end; from initial tests it looks like it works.

This change seems to produce the most consistent results when comparing to online generators, especially when 4-byte Unicode characters are involved. #278

hecon5 · 2021-11-02T20:53:52Z

Ok, looks good here!

I did some checking, and weirdly, it looks like it's faster than the un-corrected string. I think it may be because you have twice as many bits otherwise. The ADODB.Stream method is also ~0.2 Seconds faster (0.98 Seconds Vs. 1.17 Seconds) than the one I had after 500 iterations of a GetDictionaryHash, too. Each hash iteration is ~0.05s vs ~0.25 Second (give or take) according to the performance tool you built.

I think we should put this on the master branch, too to avoid people finding an improperly implemented hashing tool.

joyfullservice · 2021-11-02T21:17:55Z

Well, this turned out to be more complicated than I thought. Not all online generators take the same approach for hashing Unicode strings, so they sometimes produce different results. As one person pointed out, we are not hashing a string, we are hashing a byte array. The way that we create the byte array is what makes the difference.

After unsuccessfully testing with trial and error using the ADODB.Stream object, I temporarily plugged in the WideCharToMultiByte API call to see the actual byte array so I could create a matching one using the Stream object. As it turned out, the main issue was that I needed to skip past the UTF-8 BOM in the byte array created automatically by the stream object. After doing that, the hash was returned correctly for both Unicode and non-Unicode strings.

For the sake of testing, you can use ChrW(55356) & ChrW(57102) which creates the Unicode character 🌎. The byte array that we need to hash is pictured here:

With the updated function, we get the same results as online.
c4b4332a65af850511cb1c8599a141b1f487bb1c9e3d021c148ea6a42652897b

Some online generators, such as xorbin do not hash the Unicode correctly. (Or more precisely, do not create the same UTF-8 byte array before hashing.)

hecon5 · 2021-11-02T21:19:24Z

I literally just discovered this as I was going along! UGH!

hecon5 · 2021-11-02T23:51:54Z

Thanks for taking a look at this and getting back to me so quick!

The current system encoding must be taken into account when generating a hash on a text string. To make the hash more consistent with what you would expect from other programs or online generators, we are converting the text from the current system encoding to a UTF-8 byte array before passing it to the crypto APIs. Fixes #278

Related to #278

This change seems to produce the most consistent results when comparing to online generators, especially when 4-byte Unicode characters are involved. #278

hecon5 mentioned this issue Nov 2, 2021

Hashing Irregularities #279

Closed

joyfullservice added a commit that referenced this issue Nov 2, 2021

Handle hash of empty string

63b4c2e

Related to #278

joyfullservice added a commit that referenced this issue Nov 2, 2021

Fix hashing support for Unicode strings

d3c016b

This change seems to produce the most consistent results when comparing to online generators, especially when 4-byte Unicode characters are involved. #278

joyfullservice closed this as completed Nov 2, 2021

joyfullservice added a commit that referenced this issue Nov 9, 2021

Handle hash of empty string

d880d2c

Related to #278

joyfullservice added a commit that referenced this issue Nov 9, 2021

Fix hashing support for Unicode strings

315644b

This change seems to produce the most consistent results when comparing to online generators, especially when 4-byte Unicode characters are involved. #278

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hashing Irregularities #278

Hashing Irregularities #278

hecon5 commented Nov 2, 2021 •

edited

Loading

joyfullservice commented Nov 2, 2021

hecon5 commented Nov 2, 2021

hecon5 commented Nov 2, 2021

joyfullservice commented Nov 2, 2021

hecon5 commented Nov 2, 2021

hecon5 commented Nov 2, 2021

Hashing Irregularities #278

Hashing Irregularities #278

Comments

hecon5 commented Nov 2, 2021 • edited Loading

joyfullservice commented Nov 2, 2021

hecon5 commented Nov 2, 2021

hecon5 commented Nov 2, 2021

joyfullservice commented Nov 2, 2021

hecon5 commented Nov 2, 2021

hecon5 commented Nov 2, 2021

hecon5 commented Nov 2, 2021 •

edited

Loading