Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return sets for enumerated property value data using CPT data #1608

Merged
merged 16 commits into from
Mar 8, 2022

Conversation

echeran
Copy link
Contributor

@echeran echeran commented Feb 15, 2022

Closes #1577

@echeran echeran force-pushed the enum-props-sets-api-get-range branch from ebec909 to 32fdc50 Compare February 16, 2022 02:49
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • components/properties/src/provider.rs is now changed in the branch
  • components/properties/src/sets.rs is different
  • provider/cldr/src/transform/list/mod.rs is now changed in the branch
  • provider/uprops/src/enum_uniset.rs is now changed in the branch
  • utils/codepointtrie/Cargo.toml is different
  • utils/codepointtrie/src/codepointtrie.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@echeran
Copy link
Contributor Author

echeran commented Feb 16, 2022

@sffc Can you take an initial look to make sure that I'm on the right track here? A few specifics:

  • Is it okay to leave the data provider in enum_uniset.rs as it is -- it's there if someone needs that type of data?
  • Is it okay to use the ..._clone_... version of the map_project methods that transform DataPayload<U> to DataPayload<V>?
  • Also, I need to add API docs & examples.
  • Is there anything I'm forgetting or missing?

cc @iainireland

utils/codepointtrie/src/codepointtrie.rs Show resolved Hide resolved
utils/codepointtrie/src/codepointtrie.rs Outdated Show resolved Hide resolved
utils/codepointtrie/src/codepointtrie.rs Show resolved Hide resolved
components/properties/src/sets.rs Outdated Show resolved Hide resolved
@echeran echeran marked this pull request as ready for review March 3, 2022 04:37
@echeran echeran requested review from a team and nordzilla as code owners March 3, 2022 04:37
@echeran echeran requested review from iainireland and removed request for nordzilla March 3, 2022 04:37
@echeran
Copy link
Contributor Author

echeran commented Mar 3, 2022

moving @nordzilla from reviewer list to cc

@sffc sffc self-requested a review March 3, 2022 08:44
@echeran echeran requested a review from sffc March 8, 2022 19:53
Copy link
Contributor

@iainireland iainireland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks generally good to me. I think we will need some extensions to handle GeneralCategory, but it's fine to land this and do those in a separate PR.

@@ -110,196 +110,3 @@ impl IterableDynProvider<UnicodePropertyV1Marker> for EnumeratedPropertyUnicodeS
Ok(Box::new(core::iter::once(ResourceOptions::default())))
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, it looks like these tests are going away entirely. Is there any way to keep them around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved these tests over to the icu_properties component, which has the "provider-aware" APIs that give you the data payload.

These tests will test the data loading side of things, both by getting the data from the provider and by testing the logical correctness of that data. So I suppose that you could make a case either way to put them here (provider_uprops) or there (icu_properties).

But when using the APIs, and when using the data provider in testdata, the same tests can behave identically but can be written with less code. So that's why I moved the tests. These tests ended up at icu_properties::sets.

//! let data_struct = payload.get();
//! let line_sep = &data_struct.inv_list;
//! let gc = &data_struct.code_point_trie;
//! let line_sep = gc.get_set_for_value(GeneralCategory::LineSeparator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this belongs in this patch or somewhere else, but the API we expose for this should take GeneralCategoryGroup, not GeneralCategory, to make it possible for people to ask for gc=Letter and so on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method get_set_for_value() is implemented on a CodePointTrie, so it exists for all enumerated properties.

The special method that exists for the GeneralCategoryGroup case is over in icu_properties::sets called get_for_general_category_group().

/// assert_eq!(start..=end, sip_range);
///
/// assert!(sip_range_iter.next().is_none());
pub fn get_ranges_for_value(&self, value: T) -> impl Iterator<Item = RangeInclusive<u32>> + '_ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to implement gc=Letter if this takes a closure (to use in the .filter() call) instead of a value. The simple cases would use |range| range.value == value like here, but you could also, for example, check to see if range.value is included in a GeneralCategoryGroup.

I'm okay landing this and adding the extended version in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's #1532. GeneralCategoryGroup here is the second place we've seen where it would come in handy. The first was for Script_Extensions.

I believe the deep dive discussion answered my question originally in the issue on the impacts to FFI, code size, etc., but the discussion also surfaced a desire to add benchmarks before determining whether we feel it worth beginning the work for #1532.

@sffc
Copy link
Member

sffc commented Mar 8, 2022

+1 on making a better API for general category now that we are migrating to the new data model. See #1239 (comment)

@echeran echeran merged commit 29a3028 into unicode-org:main Mar 8, 2022
@echeran echeran deleted the enum-props-sets-api-get-range branch June 8, 2023 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make sets for enumerated properties consistent with Script_Extensions
3 participants