Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out endianness story #45

Closed
Manishearth opened this issue May 11, 2021 · 9 comments
Closed

Figure out endianness story #45

Manishearth opened this issue May 11, 2021 · 9 comments

Comments

@Manishearth
Copy link
Collaborator

Right now TinyStr contains an integer that can be interpreted as a string. There are also ways of looking at TinyStr as an integer, both by looking at the integer in the native endianness, and by converting it to a little-endian representation that can be sent across machines.

The conversion methods are kinda haphazard, #43 completes the set but really we should figure out what utilities we want to expose for viewing the TinyStr as an integer and stick to those.

Furthermore, it's unclear to me if the tinystr!() macro is sound when cross compiling: if you build your code on a little endian system targeting big endian, will the codegen use the little endian representation? It's kinda tricky to investigate, and our constructors from numbers should probably be exceedingly clear about this.

Ideally we can add dedicated constructors for the ULE use case (#44)

@sffc
Copy link
Collaborator

sffc commented May 11, 2021

The key insight is that the bit pattern is constant whether you are on LE or BE systems. The thing that differs is the numerical representation of those bytes. This is different from most LE/BE problems, where you want to represent the same numerical value but need to do so using different bit patterns.

Therefore, when interfacing with external systems (incl. serialization, ULE, etc.), the type we should be using is [u8; N].

The constructors could look like, e.g.,

impl TinyStr4 {
    const unsafe fn new_unchecked(bytes: &[u8; 4]) {
        Self(NonZeroU32::new_unchecked(u32::from_ne_bytes(bytes)))
    }
    const fn as_raw(&self) -> &[u8; 4] {
        self.0.get().as_ne_bytes()
    }
}

If we care about the alignment of the bytes, we could use something like https://docs.rs/aligned/0.3.4/aligned/ instead of returning a raw byte array.

To be clear: I don't think there's a reason to change the internal representation of TinyStr. I just think that using byte arrays instead of integers on interchange will make things a lot easier to reason about.

@sffc
Copy link
Collaborator

sffc commented May 11, 2021

Related: rust-lang/rust#76976

@Manishearth
Copy link
Collaborator Author

Yeah, I like this idea!

@sffc
Copy link
Collaborator

sffc commented May 18, 2021

@zbraniecki, thoughts? Would you approve a PR that does what #45 (comment) proposes?

@zbraniecki
Copy link
Owner

I'm ok with that change, but I'd be curious what the performance impact may be.

@Manishearth
Copy link
Collaborator Author

I don't think there will be one, we're just changing the types used, not runtime behavior

@sffc
Copy link
Collaborator

sffc commented Jan 7, 2022

Discussion with me and Manish: we can have AsciiULE<N> and TinyStrN where N is 1 to 16, and TinyStrN is a wrapper around AsciiULE<N> but repr(align(...)).

@Manishearth
Copy link
Collaborator Author

Fixed by unicode-org/icu4x#1508

@Manishearth
Copy link
Collaborator Author

@zbraniecki should we mark this repo as archived and point to icu4x?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants