Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit return values of unicode property functions #1239

Closed
sffc opened this issue Oct 28, 2021 · 16 comments · Fixed by #2140
Closed

Revisit return values of unicode property functions #1239

sffc opened this issue Oct 28, 2021 · 16 comments · Fixed by #2140
Assignees
Labels
A-design Area: Architecture or design C-unicode Component: Props, sets, tries question Unresolved questions; type unclear T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Oct 28, 2021

(there's probably an issue for this already, but I'm filing it to make sure)

Currently we have

pub fn get_ascii_hex_digit<'data, D>(provider: &D) -> UnisetResult<'data>

pub fn get_script<'data, D>(provider: &D) -> CodePointMapResult<'data, Script>

which results in call sites that look like

// Sets
let provider = icu_testdata::get_provider();
let payload =
    sets::get_ascii_hex_digit(&provider)
        .expect("The data should be valid");
let data_struct = payload.get();
let ascii_hex_digit = &data_struct.inv_list;
assert!(ascii_hex_digit.contains('3'));

// Maps
let provider = icu_testdata::get_provider();
let payload =
    maps::get_script(&provider)
        .expect("The data should be valid");
let data_struct = payload.get();
let script = &data_struct.codepoint_trie;
assert_eq!(script.get('木' as u32), Script::Han);  // U+6728

We should think about this further.

CC @echeran @markusicu

@sffc sffc added question Unresolved questions; type unclear T-core Type: Required functionality A-design Area: Architecture or design C-unicode Component: Props, sets, tries v1 labels Oct 28, 2021
@markusicu
Copy link
Member

Happy to chat.

  • I am not totally up on what all the options and decisions have been.
  • It would be nice if static_property_name.get() directly gave you a thing that you could ask the relevant questions like contains(c) and getValue(c).
  • I voiced a preference before for exposing properties as interfaces whose implementations could change, rather than fixing the property API on things like inv_list and codepoint_trie. (And why is that not code_point_trie?)
  • Is it necessary or desirable that the namespaces are separate, such as sets and maps, as opposed to something like properties?

@sffc
Copy link
Member Author

sffc commented Oct 28, 2021

Yeah, the current API is largely a result of us having not yet hammered down the details. It's just what happened to be implementable. We should consider most parts of it to be fair game for debate.

@sffc
Copy link
Member Author

sffc commented Nov 9, 2021

In #1269 I'm adding an FFI with the following syntax:

auto result = ICU4XUnicodeSetProperty::try_get_ascii_hex_digit(dp);
if (result.success) {
    bool contains = result.data.value().contains(cp);
}

with the following Rust definitions:

pub struct ICU4XUnicodeSetProperty(DataPayload<'static, UnicodePropertyV1Marker>);
impl ICU4XUnicodeSetProperty {
    pub fn contains(&self, cp: char) -> bool {
        self.0.get().inv_list.contains(cp)
    }
}

We should unify the FFI and the Rust API once we finalize the design.

@sffc
Copy link
Member Author

sffc commented Mar 8, 2022

After reviewing #1608, I think we should have a class like GeneralCategoryMap which is similar to ScriptWithExtensions and has a bunch of General_Category helper function on it.

@sffc
Copy link
Member Author

sffc commented Apr 1, 2022

@sffc to reproduce the conclusion whiteboard.

@echeran
Copy link
Contributor

echeran commented Apr 14, 2022

@Manishearth here are the notes to our previous meeting to create an initial design of what this might look like, and after I opened PR #1333 to attempt implementing it, I ran into tricky Rust issues when implementing it, and the discussion you and I had about those issues exist at the end of the same section of the notes doc:
https://docs.google.com/document/d/1weiKTPjbmfJNctmVVzbtXULABW9l7HWGi7L7DbnpiWE/edit#bookmark=id.j00m6clhxi2y

@Manishearth
Copy link
Member

Thanks so much for digging that up!

@Manishearth
Copy link
Member

A thing I noticed is that this plan would require renaming the icu_uniset crate to have CodePointSet. For now I think I'm going to consider this out of scope for my work here and let @echeran decide how best to do this (we may want to, for example, consolidate the CPT and CPS types, and also the UnicodeSet/UnicodeMap types when those exist too).

@Manishearth
Copy link
Member

Posting a WIP for the new design, now working on integrating

// This file is part of ICU4X. For terms of use, please see the file
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

//! Temporary module for new return value design

use core::marker::PhantomData;
use icu_codepointtrie::{CodePointTrie, TrieValue};
use icu_provider::prelude::*;
use icu_provider::{yoke, zerofrom};
use icu_uniset::UnicodeSet;
use icu_uniset::UnicodeSetBuilder;

/// A set of characters with a particular property.
#[derive(Debug, Eq, PartialEq, Clone, yoke::Yokeable, zerofrom::ZeroFrom)]
#[cfg_attr(feature = "datagen", derive(serde::Serialize))]
#[cfg_attr(feature = "serde", derive(serde::Deserialize))]
pub struct CodePointSet<'data> {
    /// The set of characters, represented as an inversion list
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub inv_list: UnicodeSet<'data>,
}

impl Default for CodePointSet<'static> {
    /// Default empty property
    fn default() -> CodePointSet<'static> {
        CodePointSet {
            inv_list: UnicodeSetBuilder::new().build(),
        }
    }
}

impl<'data> CodePointSet<'data> {
    /// Creates a [`UnicodeProperty`] for the given [`UnicodeSet`].
    pub fn from_owned_uniset(set: UnicodeSet<'data>) -> CodePointSet<'data> {
        CodePointSet { inv_list: set }
    }
}

impl<'data> From<CodePointSet<'data>> for UnicodeSet<'data> {
    fn from(prop: CodePointSet<'data>) -> UnicodeSet<'data> {
        prop.inv_list
    }
}

/// A set of characters with a particular property.
#[derive(Clone)]
pub struct CodePointSetData<T: CodePointSetLikeProperty> {
    data: DataPayload<CodePointSetMarker<T>>,
}

impl<T: CodePointSetLikeProperty> CodePointSetData<T> {
    pub fn contains(&self, ch: char) -> bool {
        self.data.get().inv_list.contains(ch)
    }
}

/// A map efficiently storing data about individual characters.
#[derive(Debug, Eq, PartialEq, yoke::Yokeable, zerofrom::ZeroFrom)]
#[cfg_attr(feature = "datagen", derive(serde::Serialize))]
#[cfg_attr(feature = "serde", derive(serde::Deserialize))]
pub struct CodePointMap<'data, T: TrieValue> {
    /// A codepoint trie storing the data
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub code_point_trie: CodePointTrie<'data, T>,
}

impl<'data, T: TrieValue> Clone for CodePointMap<'data, T>
where
    <T as zerovec::ule::AsULE>::ULE: Clone,
{
    fn clone(&self) -> Self {
        CodePointMap {
            code_point_trie: self.code_point_trie.clone(),
        }
    }
}

/// A set of characters with a particular property.
#[derive(Clone)]
pub struct CodePointMapData<T: CodePointMapLikeProperty> {
    data: DataPayload<CodePointMapMarker<T>>,
}

impl<T: CodePointMapLikeProperty> CodePointMapData<T> {
    pub fn get(&self, ch: char) -> T::Value {
        self.data.get().code_point_trie.get(ch as u32)
    }
}

/// Properties that are stored in CodePointSets
pub trait CodePointSetLikeProperty {
    const KEY: ResourceKey;
}

#[derive(Copy, Clone, Debug, PartialEq, Eq, Ord, PartialOrd, Hash)]
pub struct CodePointSetMarker<T>(PhantomData<T>);

impl<T> DataMarker for CodePointSetMarker<T> {
    type Yokeable = CodePointSet<'static>;
}

impl<T: CodePointSetLikeProperty> ResourceMarker for CodePointSetMarker<T> {
    const KEY: ResourceKey = T::KEY;
}

/// Properties that are stored in CodePointSets
pub trait CodePointMapLikeProperty {
    const KEY: ResourceKey;
    type Value: TrieValue;
}

#[derive(Copy, Clone, Debug, PartialEq, Eq, Ord, PartialOrd, Hash)]
pub struct CodePointMapMarker<T>(PhantomData<T>);

impl<T: CodePointMapLikeProperty> DataMarker for CodePointMapMarker<T> {
    type Yokeable = CodePointMap<'static, T::Value>;
}

impl<T: CodePointMapLikeProperty> ResourceMarker for CodePointMapMarker<T> {
    const KEY: ResourceKey = T::KEY;
}

fn get_cp_map<D, T>(provider: &D, key: ResourceKey) -> CodePointMapData<T>
where
    D: DynProvider<CodePointMapMarker<T>> + ?Sized,
    T: CodePointMapLikeProperty,
{
    let resp: DataResponse<CodePointMapMarker<T>> =
        provider.load_payload(key, &Default::default())?;

    let property_payload: DataPayload<CodePointMapMarker<T>> = resp.take_payload()?;
    Ok(CodePointMapData { data: property_payload })
}

macro_rules! make_map_property {
    (
        property: $prop_name:expr;
        marker: $marker_name:ident;
        value: $value_ty:path;
        key: $key_name:expr;
        func:
        $(#[$attr:meta])*
        pub fn $funcname:ident();
    ) => {
        #[doc = concat!("Marker type for `", $prop_name, "` Unicode property, ")]
        #[doc = concat!("with value [`", stringify!($value_ty), "`]")]
        pub struct $marker_name;

        impl CodePointMapLikeProperty for $marker_name {
            const KEY: ResourceKey = $key_name;
            type Value = $value_ty;
        }

        $(#[$attr])*
        pub fn $funcname<D>(provider: &D) -> CodePointMapData<$marker_name>
        where
            D: DynProvider<CodePointMapMarker<$marker_name>> + ?Sized,
        {
            get_cp_map(provider, $key_name)
        }
    }
}

// use ;
make_map_property!{
    property: "Word_Break";
    marker: WordBreakProperty;
    value: crate::props::WordBreak;
    key: crate::provider::key::WORD_BREAK_V1;
    func: 
    /// testing
    pub fn get_word_break();
}

@Manishearth
Copy link
Member

With my current plan I'm also not sure what to do about versioning and I might make a PR without V1s on it, which we can add separately if desired.

@Manishearth
Copy link
Member

Hmm, I'm hitting rust-lang/rust#96223

@Manishearth
Copy link
Member

What I might do is get rid of the prop marker types for now and perhaps introduce them in the future, which will be easy once we have a macro doing everything.

@Manishearth
Copy link
Member

Hm, that doesn't actually fix things 😅

@Manishearth
Copy link
Member

WIP is at https://github.com/unicode-org/icu4x/pull/1808/files, split out macro stuff into #1810

@Manishearth
Copy link
Member

I've written https://docs.google.com/document/d/1gVEV_cSrdCSk1WxKVthdcIB1XenrcB-wDyanRAwMbUI/edit to resolve some open questions here.

@Manishearth
Copy link
Member

Hmm, I'm hitting rust-lang/rust#96223

We're not blocked on this anymore, the issue crops up when there's a missing Deserialize impl; we just need to test with --all-features. It's still not great that the compiler crashes instead of reporting this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-unicode Component: Props, sets, tries question Unresolved questions; type unclear T-core Type: Required functionality
Projects
None yet
5 participants