Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sets for enumerated properties consistent with Script_Extensions #1577

Closed
echeran opened this issue Feb 3, 2022 · 2 comments · Fixed by #1608
Closed

Make sets for enumerated properties consistent with Script_Extensions #1577

echeran opened this issue Feb 3, 2022 · 2 comments · Fixed by #1608
Assignees
Labels
C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt

Comments

@echeran
Copy link
Contributor

echeran commented Feb 3, 2022

For Script_Extensions work, we have exported a CodePointTrie with a companion array from ICU that contains the data for both Script and Script_Extensions properties.

In order to return a UnicodeSet for code points whose Script_Extensions contains a particular Script code, we use the newly-added CodePointTrie::get_range(). This approach allows us to realize the space savings that motivated the design of the data from ICU for Script / Script_Extensions.

As always, there are space-time tradeoffs / alternatives available here, as discussed previously (1, 2, 3).

For now, it might make sense to update the API implementations for sets-for-enumerated-properties (ex: pub fn get_for_script<D>(provider: &D, enum_val: Script) -> UnisetResult) to depend on data in a consistent manner with the set-for-Script_Extensions API. In other words, we would only need one data key for the entire enumerated property and only depend on the data for the code point trie. (Currently, we serialize each prop=val UnicodeSet and have a corresponding data key for each.)

In the future, we can think about optimizations. For example, some algorithms that depend on General_Category only are interested in a certain subset of gc values (ex: Nd, Nl, and No for numbers), so keeping the option for General_Category for the current style of key=val serialized sets data might be useful for data slicing.

@echeran echeran added C-unicode Component: Props, sets, tries discuss Discuss at a future ICU4X-SC meeting S-medium Size: Less than a week (larger bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt labels Feb 3, 2022
@sffc
Copy link
Member

sffc commented Feb 3, 2022

For binary enumerated properties, there are (and have always been) two general mechanisms:

  1. Pre-compiled UnicodeSet stored directly in the data provider
    • Best if you need a set for a specific value of an enumerated property known at compile time, such as "all letters" or "all code points in the Cyrillic script"
  2. Code point tries with the UnicodeSet computed at runtime
    • Best if you need sets for most or all values of an enumerated property

Up until Script_Extensions, we did mechanism 1 for all of the binary enumerated properties we support (which is just General_Category and Script right now). Elango's PR for Script_Extensions implements mechanism 2 without support for mechanism 1.

Moving forward, I see a few options:

  1. Support all with mechanism 1 and support a subset with mechanism 2
  2. Support all with mechanism 2 and support a subset with mechanism 1
  3. Support all with both mechanisms 1 and 2
  4. Support some with mechanism 1 and others with mechanism 2

@sffc
Copy link
Member

sffc commented Feb 4, 2022

Decision 2022-02-04:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants