-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[browser] HybridGlobalization
correct HashCode
ranges of skipped unicodes
#97351
Conversation
Tagging subscribers to 'arch-wasm': @lewing Issue DetailsBackground: Reason for this PR: Skipped codes by UnicodeCategory:
We could skip full categories, producing more collisions. However:
Performance changes:
|
src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.WebAssembly.cs
Outdated
Show resolved
Hide resolved
for(int codePoint = 0; codePoint < 0x10FFFF; codePoint++) | ||
{ | ||
char character = (char)codePoint; | ||
string str2 = $"a{character}b"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just silly idea: is it possible that codepoints are skipped only when are before or after another specific code point ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect only surrogates to work this way. This cast might not work for surrogates, though (they are 2 chars, not one). I need to check it, thanks
// Hybrid has Equal function from JS and hashing from managed invariant algorithm, they might start diverging at some point | ||
[ConditionalTheory(typeof(PlatformDetection), nameof(PlatformDetection.IsHybridGlobalizationOnBrowser))] | ||
[MemberData(nameof(CharsIgnoredByEqualFunction))] | ||
public void CheckHashingOfSkippedChars(int hashCode1, string str2, CompareInfo cmpInfo, CompareOptions options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to also have another test like
foreach locale
foreach codepoint
var s1=$"A{codepoint}B"
var s2=$"AB"
var h1 = locale.getHash(s1)
var h2 = locale.getHash(s2)
if(locale.equals(s1, s2))
assert(h1 == h2)
else
// We know that the hash collisions are OK, when they are rare. So this should fail in very small % of cases.
assert(h1 != h2)
After we learned how bad the hash collisions are, we could comment out assert(h1 != h2)
or add few known collisions as exception to the rules.
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
@ilonatommy we know that the code in main branch is wrong. Could you please finish this or create open issue describing the problem ? |
Background:
In #96354 we introduced a mechanism of calculating
HashCodes
for invariant culture and non-invariant culture withCompareOptions.None
andCompareOptions.IgnoreCase
. In order to make the invariantHashCode
function be in line with JS-equal function:localeCompare
we are skipping some unicodes. The ranges used in the original PR were collected using ConsoleApp on Windows which turned out not to be a correct approach - they were NLS-based ranges.For browser (v8-based browsers list is same as Firefox) the list of skipped unicodes is shorter (1826 instead of ~16k).
Reason for this PR:
The bigger range does not include the whole corrected range.
Skipped codes by UnicodeCategory:
Performance changes:
ToDo:
hashCodes
for two equal strings, one of which has the char appended. If they are not "skippable" they might but do not have to produce differenthashCodes
(this PR overskipps).Answers to possible questions:
Q: Why don't we do one loop only, for skipping and changing case, so that
IgnoreCase
is not slower thanNone
?A:
ToUpper
is localized, so we make a call to JS. If we would call it on each char of the string, we would do n-times call to JS and back. This way we send the full string only once.Q: Then why don't we move the hashing logic to JS?
A: For that we would need to re-implement the algorithm that is already available in C#. It would be a code duplicate. What's more, all the hashing with
None
option would need to call to JS, causing even bigger delay.