Implement Unicode properties required by UAX 29 #1214

aethanyc · 2021-10-26T17:56:57Z

We need the following Unicode properties:

Grapheme_Cluster_Break
Sentence_Break
Word_Break
Extended_Pictographic for GB11

aethanyc · 2021-10-26T23:53:45Z

@makotokato Could you check whether these properties are sufficient for UAX29?

makotokato · 2021-10-27T08:12:23Z

We need the following Unicode properties:

* Grapheme_Cluster_Break

* Sentence_Break

* Word_Break

* Extended_Pictographic for [GB11](https://www.unicode.org/reports/tr29/#GB11)

As UAX#29 spec docs, these are converted. But, implementing word segmenter, we need more categories for Chinese, Japanese and East Asian languages.

For East Asia languages, we can recognize SA in line break property. CJ will be ID and some properties from line break or others.

aethanyc · 2021-10-27T18:02:51Z

As UAX#29 spec docs, these are converted. But, implementing word segmenter, we need more categories for Chinese, Japanese and East Asian languages.

For East Asia languages, we can recognize SA in line break property. CJ will be ID and some properties from line break or others.

I assume we need to query a codepoint's Scriptto switch to different language break engine like ICULanguageBreakFactory::loadEngineFor().

Luckily, ICU4X already implemented Script property. Here is a test of script codepointtrie, but I think @echeran is documenting a nicer API icu_properties::maps::get_script() in #1204.

aethanyc · 2021-11-02T17:37:14Z

@makotokato Here are the example in the document to use Word_Break property. The Extended_Pictographic and Script are also available. Note the getter's return value may change pending on the discussion in #1239.

makotokato · 2021-11-05T00:33:49Z

Thanks a lot.

aethanyc added T-core Type: Required functionality C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) labels Oct 26, 2021

aethanyc added this to the 2021 Q4 0.5 Sprint A milestone Oct 26, 2021

aethanyc self-assigned this Oct 26, 2021

aethanyc mentioned this issue Oct 26, 2021

[meta] Implement UAX29 spec in segmenter #943

Closed

3 tasks

aethanyc mentioned this issue Oct 28, 2021

Implement Grapheme_Cluster_Break, Word_Break, and Sentence_Break Unicode properties #1233

Merged

aethanyc closed this as completed in #1233 Nov 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Unicode properties required by UAX 29 #1214

Implement Unicode properties required by UAX 29 #1214

aethanyc commented Oct 26, 2021 •

edited

Loading

aethanyc commented Oct 26, 2021

makotokato commented Oct 27, 2021

aethanyc commented Oct 27, 2021

aethanyc commented Nov 2, 2021

makotokato commented Nov 5, 2021

Implement Unicode properties required by UAX 29 #1214

Implement Unicode properties required by UAX 29 #1214

Comments

aethanyc commented Oct 26, 2021 • edited Loading

aethanyc commented Oct 26, 2021

makotokato commented Oct 27, 2021

aethanyc commented Oct 27, 2021

aethanyc commented Nov 2, 2021

makotokato commented Nov 5, 2021

aethanyc commented Oct 26, 2021 •

edited

Loading