Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[browser][non-icu] HybridGlobalization compare #84249

Merged
merged 10 commits into from
Apr 7, 2023
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions docs/design/features/hybrid-globalization.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,157 @@ Affected public APIs:
- TextInfo.ToTitleCase.

Case change with invariant culture uses `toUpperCase` / `toLoweCase` functions that do not guarantee a full match with the original invariant culture.

**String comparison**

Affected public APIs:
- CompareInfo.Compare,
- String.Compare,
- String.Equals.

The number of `CompareOptions` and `StringComparison` combinations is limited. Originally supported combinations can be found [here for CompareOptions](https://learn.microsoft.com/dotnet/api/system.globalization.compareoptions) and [here for StringComparison](https://learn.microsoft.com/dotnet/api/system.stringcomparison).

- `IgnoreWidth` is not supported because there is no equivalent in Web API. Throws `PlatformNotSupportedException`.
``` JS
let high = String.fromCharCode(65281) // %uff83 = テ
let low = String.fromCharCode(12486) // %u30c6 = テ
high.localeCompare(low, "ja-JP", { sensitivity: "case" }) // -1 ; case: a ≠ b, a = á, a ≠ A; expected: 0

let wide = String.fromCharCode(65345) // %uFF41 = a
let narrow = "a"
wide.localeCompare(narrow, "en-US", { sensitivity: "accent" }) // 0; accent: a ≠ b, a ≠ á, a = A; expected: -1
```

For comparison where "accent" sensitivity is used, ignoring some type of character widths is applied and cannot be switched off (see: point about `IgnoreCase`).

- `IgnoreKanaType`:

It is always switched on for comparison with locale "ja-JP", even if this comparison option was not set explicitly.

``` JS
let hiragana = String.fromCharCode(12353) // %u3041 = ぁ
let katakana = String.fromCharCode(12449) // %u30A1 = ァ
let enCmp = hiragana.localeCompare(katakana, "en-US") // -1
let jaCmp = hiragana.localeCompare(katakana, "ja-JP") // 0
```

For locales different than "ja-JP" it cannot be used separately (no equivalent in Web API) - throws `PlatformNotSupportedException`.

- `None`:

No equivalent in Web API for "ja-JP" locale. See previous point about `IgnoreKanaType`. For "ja-JP" it throws `PlatformNotSupportedException`.

- `IgnoreCase`, `CurrentCultureIgnoreCase`, `InvariantCultureIgnoreCase`

For `IgnoreCase | IgnoreKanaType`, argument `sensitivity: "accent"` is used.

``` JS
let hiraganaBig = `${String.fromCharCode(12353)} A` // %u3041 = ぁ
let katakanaSmall = `${String.fromCharCode(12449)} a` // %u30A1 = ァ
hiraganaBig.localeCompare(katakanaSmall, "en-US", { sensitivity: "accent" }) // 0; accent: a ≠ b, a ≠ á, a = A
```

Known exceptions:

| **character 1** | **character 2** | **CompareOptions** | **hybrid globalization** | **icu** | **comments** |
|:---------------:|:---------------:|--------------------|:------------------------:|:-------:|:-------------------------------------------------------:|
| a | `\uFF41` a | IgnoreKanaType | 0 | -1 | applies to all wide-narrow chars |
| `\u30DC` ボ | `\uFF8E` ホ | IgnoreCase | 1 | -1 | 1 is returned in icu when we additionally ignore width |
| `\u30BF` タ | `\uFF80` タ | IgnoreCase | 0 | -1 | |


For `IgnoreCase` alone, a comparison with default option: `sensitivity: "variant"` is used after string case unification.

``` JS
let hiraganaBig = `${String.fromCharCode(12353)} A` // %u3041 = ぁ
let katakanaSmall = `${String.fromCharCode(12449)} a` // %u30A1 = ァ
let unchangedLocale = "en-US"
let unchangedStr1 = hiraganaBig.toLocaleLowerCase(unchangedLocale);
let unchangedStr2 = katakanaSmall.toLocaleLowerCase(unchangedLocale);
unchangedStr1.localeCompare(unchangedStr2, unchangedLocale) // -1;
let changedLocale = "ja-JP"
let changedStr1 = hiraganaBig.toLocaleLowerCase(changedLocale);
let changedStr2 = katakanaSmall.toLocaleLowerCase(changedLocale);
changedStr1.localeCompare(changedStr2, changedLocale) // 0;
```

From this reason, comparison with locale `ja-JP` `CompareOption` `IgnoreCase` and `StringComparison`: `CurrentCultureIgnoreCase` and `InvariantCultureIgnoreCase` behave like a combination `IgnoreCase | IgnoreKanaType` (see: previous point about `IgnoreKanaType`). For other locales the behavior is unchanged with the following known exceptions:

| **character 1** | **character 2** | **CompareOptions** | **hybrid globalization** | **icu** |
|:------------------------------------------------:|:----------------------------------------------------------:|-----------------------------------|:------------------------:|:-------:|
| `\uFF9E` (HALFWIDTH KATAKANA VOICED SOUND MARK) | `\u3099` (COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK) | None / IgnoreCase / IgnoreSymbols | 1 | 0 |

- `IgnoreNonSpace`

`IgnoreNonSpace` cannot be used separately without `IgnoreKanaType`. Argument `sensitivity: "case"` is used for comparison and it ignores both types of characters. Option `IgnoreNonSpace` alone throws `PlatformNotSupportedException`.

``` JS
let hiraganaAccent = `${String.fromCharCode(12353)} á` // %u3041 = ぁ
let katakanaNoAccent = `${String.fromCharCode(12449)} a` // %u30A1 = ァ
hiraganaAccent.localeCompare(katakanaNoAccent, "en-US", { sensitivity: "case" }) // 0; case: a ≠ b, a = á, a ≠ A
```

- `IgnoreNonSpace | IgnoreCase`
Combination of `IgnoreNonSpace` and `IgnoreCase` cannot be used without `IgnoreKanaType`. Argument `sensitivity: "base"` is used for comparison and it ignores three types of characters. Combination `IgnoreNonSpace | IgnoreCase` alone throws `PlatformNotSupportedException`.

``` JS
let hiraganaBigAccent = `${String.fromCharCode(12353)} A á` // %u3041 = ぁ
let katakanaSmallNoAccent = `${String.fromCharCode(12449)} a a` // %u30A1 = ァ
hiraganaBigAccent.localeCompare(katakanaSmallNoAccent, "en-US", { sensitivity: "base" }) // 0; base: a ≠ b, a = á, a = A
```

- `IgnoreSymbols`

The subset of ignored symbols is limited to the symbols ignored by `string1.localeCompare(string2, locale, { ignorePunctuation: true })`. E.g. currency symbols, & are not ignored

``` JS
let hiraganaAccent = `${String.fromCharCode(12353)} á` // %u3041 = ぁ
let katakanaNoAccent = `${String.fromCharCode(12449)} a` // %u30A1 = ァ
hiraganaBig.localeCompare(katakanaSmall, "en-US", { sensitivity: "base" }) // 0; base: a ≠ b, a = á, a = A
```

- List of all `CompareOptions` combinations always throwing `PlatformNotSupportedException`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, could we use attributes or a compile time analyzer to make builds fail when the user's code attempts to use these combinations? (Obviously not in this PR)

Copy link
Member Author

@ilonatommy ilonatommy Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could but I am not sure if I know how to do it. Do we have a similar mechanism somewhere already?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, maybe ask in the group chat or channel. I haven't touched any of our generators/analyzers but I know we have a couple like the JSImport/Export generator.


`IgnoreCase`,

`IgnoreNonSpace`,

`IgnoreNonSpace | IgnoreCase`,

`IgnoreSymbols | IgnoreCase`,

`IgnoreSymbols | IgnoreNonSpace`,

`IgnoreSymbols | IgnoreNonSpace | IgnoreCase`,

`IgnoreWidth`,

`IgnoreWidth | IgnoreCase`,

`IgnoreWidth | IgnoreNonSpace`,

`IgnoreWidth | IgnoreNonSpace | IgnoreCase`,

`IgnoreWidth | IgnoreSymbols`

`IgnoreWidth | IgnoreSymbols | IgnoreCase`

`IgnoreWidth | IgnoreSymbols | IgnoreNonSpace`

`IgnoreWidth | IgnoreSymbols | IgnoreNonSpace | IgnoreCase`

`IgnoreKanaType | IgnoreWidth`

`IgnoreKanaType | IgnoreWidth | IgnoreCase`

`IgnoreKanaType | IgnoreWidth | IgnoreNonSpace`

`IgnoreKanaType | IgnoreWidth | IgnoreNonSpace | IgnoreCase`

`IgnoreKanaType | IgnoreWidth | IgnoreSymbols`

`IgnoreKanaType | IgnoreWidth | IgnoreSymbols | IgnoreCase`

`IgnoreKanaType | IgnoreWidth | IgnoreSymbols | IgnoreNonSpace`

`IgnoreKanaType | IgnoreWidth | IgnoreSymbols | IgnoreNonSpace | IgnoreCase`
Original file line number Diff line number Diff line change
Expand Up @@ -348,10 +348,14 @@ public static string GetDistroVersionString()
private static readonly Lazy<bool> m_isInvariant = new Lazy<bool>(()
=> (bool?)Type.GetType("System.Globalization.GlobalizationMode")?.GetProperty("Invariant", BindingFlags.NonPublic | BindingFlags.Static)?.GetValue(null) == true);

private static readonly Lazy<bool> m_isHybrid = new Lazy<bool>(()
=> (bool?)Type.GetType("System.Globalization.GlobalizationMode")?.GetProperty("Hybrid", BindingFlags.NonPublic | BindingFlags.Static)?.GetValue(null) == true);

private static readonly Lazy<Version> m_icuVersion = new Lazy<Version>(GetICUVersion);
public static Version ICUVersion => m_icuVersion.Value;

public static bool IsInvariantGlobalization => m_isInvariant.Value;
public static bool IsHybridGlobalization => m_isHybrid.Value;
public static bool IsNotInvariantGlobalization => !IsInvariantGlobalization;
public static bool IsIcuGlobalization => ICUVersion > new Version(0, 0, 0, 0);
public static bool IsNlsGlobalization => IsNotInvariantGlobalization && !IsIcuGlobalization;
Expand Down
Loading