Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Unicode properties required by UAX 29 #1214

Closed
Tracked by #943
aethanyc opened this issue Oct 26, 2021 · 5 comments · Fixed by #1233
Closed
Tracked by #943

Implement Unicode properties required by UAX 29 #1214

aethanyc opened this issue Oct 26, 2021 · 5 comments · Fixed by #1233
Assignees
Labels
C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@aethanyc
Copy link
Contributor

aethanyc commented Oct 26, 2021

We need the following Unicode properties:

  • Grapheme_Cluster_Break
  • Sentence_Break
  • Word_Break
  • Extended_Pictographic for GB11
@aethanyc aethanyc added T-core Type: Required functionality C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) labels Oct 26, 2021
@aethanyc aethanyc added this to the 2021 Q4 0.5 Sprint A milestone Oct 26, 2021
@aethanyc aethanyc self-assigned this Oct 26, 2021
@aethanyc
Copy link
Contributor Author

@makotokato Could you check whether these properties are sufficient for UAX29?

@makotokato
Copy link
Member

We need the following Unicode properties:

* Grapheme_Cluster_Break

* Sentence_Break

* Word_Break

* Extended_Pictographic for [GB11](https://www.unicode.org/reports/tr29/#GB11)

As UAX#29 spec docs, these are converted. But, implementing word segmenter, we need more categories for Chinese, Japanese and East Asian languages.

For East Asia languages, we can recognize SA in line break property. CJ will be ID and some properties from line break or others.

@aethanyc
Copy link
Contributor Author

As UAX#29 spec docs, these are converted. But, implementing word segmenter, we need more categories for Chinese, Japanese and East Asian languages.

For East Asia languages, we can recognize SA in line break property. CJ will be ID and some properties from line break or others.

I assume we need to query a codepoint's Scriptto switch to different language break engine like ICULanguageBreakFactory::loadEngineFor().

Luckily, ICU4X already implemented Script property. Here is a test of script codepointtrie, but I think @echeran is documenting a nicer API icu_properties::maps::get_script() in #1204.

@aethanyc
Copy link
Contributor Author

aethanyc commented Nov 2, 2021

@makotokato Here are the example in the document to use Word_Break property. The Extended_Pictographic and Script are also available. Note the getter's return value may change pending on the discussion in #1239.

@makotokato
Copy link
Member

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants