Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences in parsing UUIDs from uint128 in Python and Rust #406

Closed
thedrow opened this issue May 12, 2019 · 20 comments
Closed

Differences in parsing UUIDs from uint128 in Python and Rust #406

thedrow opened this issue May 12, 2019 · 20 comments

Comments

@thedrow
Copy link

thedrow commented May 12, 2019

Describe the bug
I'm creating a binding of this library to Python for learning purposes and possible production usage.
I've encountered a difference in the values Python provides and the values Rust provides when parsing a UUID from an unsigned 128-bit integer.

To Reproduce

When parsing a UUID from an unsigned integer in Python I get the following UUID:

>>> import uuid
>>> uuid.UUID(int=286695482685957555672632549999181678124)
UUID('d7af8a9f-e924-4f96-b36f-9ecea445c62c')

However, when I use the following rust code:

extern crate uuid;
use uuid::Uuid;
let u: uint128 = 286695482685957555672632549999181678124;
let uuid: Uuid = u.into();
println!(uuid.to_hyphenated()
                .encode_lower(&mut Uuid::encode_buffer())
                .to_string());

The output is the following UUID: 2cc645a4-ce9e-6fb3-964f-24e99f8aafd7

Expected behavior
I think that the value should be the same in both implementations.
If not, we need to figure out the difference and decide which implementation is correct.

Screenshots
If applicable, add screenshots to help explain your problem.

Specifications (please complete the following information):

  • Target: Debug
  • Version [e.g. 1.0]: 0.7.4
  • Features Enabled: v1, v3, v4, v5, u128

Additional context
See the Python implementation here.

Other
N/A

@kinggoesgaming
Copy link
Member

cc @KodrAus

@KodrAus
Copy link
Member

KodrAus commented May 13, 2019

Looks like endianness? Will dig in properly in a bit, but maybe we’re reversing the bits in that 128bit number so they’re big endian whereas Python isn’t?

@kinggoesgaming
Copy link
Member

Now I feel silly... thats the first thing I ruled out :P

@thedrow
Copy link
Author

thedrow commented May 13, 2019

That's exactly the case.

I did the following:

            if let Some(i) = int {
                handle = i.swap_bytes().into();
            }

and the tests pass. I'm not sure which implementation is correct (or if both are correct).
In any case, this should be documented.

@KodrAus
Copy link
Member

KodrAus commented May 13, 2019

Yeh, I think because a uuid doesn’t have a representation as a 128bit number any semantics we pick will be fairly arbitrary.

I think we should consider deprecating the 128bit conversions.

@thedrow
Copy link
Author

thedrow commented May 13, 2019

I disagree.
It's useful at least to me.

@KodrAus
Copy link
Member

KodrAus commented May 13, 2019

@thedrow In that case you could define your own function in your bindings to Python with the semantics you need, though.

The problem is we’ve chosen one arbitrary representation where the 128bit number has a platform-specific endianness but the representation you want is for the 128bit number to act like a [u8; 16] where there is no endianness.

Since there are multiple equally valid approaches here I think we should consider punting supporting it.

@kinggoesgaming
Copy link
Member

I think we should consider deprecating the 128bit conversions.

I think a better way will be to provide another adapter Integer that handles the u128 explicitly. I agree that conversion functionality is important. However as seen here, this introduces the issue of the endianness of the u128 being ambiguous.

@kinggoesgaming
Copy link
Member

The other thing that I was thinking last night was that we do not have any types that guarantee about the endianness of the bytes within. We could introduce a adapter::endian module that has these. I am not a 100% sure if this would be the best fit for uuid crate itself though

bors bot added a commit that referenced this issue May 26, 2019
407: remove support for u128 r=Dylan-DPC a=kinggoesgaming

**I'm submitting a(n)** other

# Description
Remove the `u128_support` module and the current implementations for converting from `u128` to `Uuid`s. New alternative will be provided later.

# Motivation
#406 shows that `u128`, number type that is affected by the target's endianness.

# Tests
All tests related to `u128` removed. The rest of the tests should work as expected, as before.

# Related Issue(s)
#406 

Co-authored-by: Hunar Roop Kahlon <[email protected]>
bors bot added a commit that referenced this issue Oct 4, 2019
427: Breaking API features and refactoring for `0.8` r=kinggoesgaming a=KodrAus

**I'm submitting a(n)** (|feature|refactor|)

# Description

## Modules and Errors

This PR is a broad refactoring of our error types and module layout. The general pattern I've gone for is:

- Define specific error types in submodules. These are currently private.
- Collect these specific error types in an opaque root `Error`. All methods return this single `Error` type so it's easier for consumers to carry `uuid::Error`s in their own code.

It'll also include some implementations for open PRs as we've been discussing. I imagine we'll want to spend time working through these changes :)

I've hidden our `prelude` module for now, because I'm not sure it's something we'll want to stabilize with (it's only got a few bits in it afterall).

## 128bit support

Re-enables support for 128-bit numbers in the form of constructors on `Uuid`, following the pattern that method names without an endian suffix are BE.

## No-std

Refactors our `no-std` support so we always mark the crate as `no-std` so our std imports are always the same. This simplifies our std/core support so we need fewer modules.

## Timestamps

Includes the design proposed in #405 for timestamp support originally implemented by @jonathanstrong.

# Related Issue(s)

- #406

# Related PR(s)

- #405 
- #416

Co-authored-by: Ashley Mannix <[email protected]>
Co-authored-by: Jonathan Strong <[email protected]>
@Fleshgrinder
Copy link

The behavior here is not surprising, well, not to me at least. Literals are big endian and get converted to the target endianess. Given that most (all?) x86/x64 CPUs are little endian we end up with exactly that. Python – like the JVM – is probably normalizing exactly that so that users do not get confused.

There is target_endian with which it would be possible to select the right default implementation for the u128 to Uuid conversion at compile time. However, I cannot tell if targets exist that may have a different Endianess depending on the used CPU. I do not believe so but believing means knowing nothing.

The – by far – safest choice to to simply call to_be and to_le that are no-op if the target is already using the expected Endianess and otherwise converts them. This way the Endianess is always correct and all operations placed on top are safe.

@chpio
Copy link

chpio commented Feb 13, 2021

Looking at the code I was wondering why there even is a from_u128_le fn. Isn't rusts u128 defined as "big endian" from the pov of the developer (or maybe im just stupid...)? So the endianness conversion should happen at the "data reading" stage with for example from_le_bytes?

@kinggoesgaming
Copy link
Member

Looking at the code I was wondering why there even is a from_u128_le fn. Isn't rusts u128 defined as "big endian" from the pov of the developer (or maybe im just stupid...)? So the endianness conversion should happen at the "data reading" stage with for example from_le_bytes?

All primitive numbers follow the target platform’s endianness... the from_le_bytes and from_be_bytes are noop if endian already matches, IIRC

@chpio
Copy link

chpio commented Mar 8, 2021

All primitive numbers follow the target platform’s endianness

yeah, im not talking about the native endianness the cpu is using, that is abstracted away by the compiler anyway. But the "conceptual"/"logical" endianness perceived by the developer.

Im just saying that the from_le_bytes should be used on a lower level (eg. while interpreting/parsing read data), and the higher level (Uuid::from_u128) should not bother with endianness. from_u128_le is mixing both, the lower level data endianness and the conceptual endianness.

edit: "should" as "The existence of from_u128_le is clashing with my understanding of the software/language design", not saying it should be removed.

@kinggoesgaming
Copy link
Member

kinggoesgaming commented Apr 30, 2021

All primitive numbers follow the target platform’s endianness

yeah, im not talking about the native endianness the cpu is using, that is abstracted away by the compiler anyway. But the "conceptual"/"logical" endianness perceived by the developer.

Im just saying that the from_le_bytes should be used on a lower level (eg. while interpreting/parsing read data), and the higher level (Uuid::from_u128) should not bother with endianness. from_u128_le is mixing both, the lower level data endianness and the conceptual endianness.

edit: "should" as "The existence of from_u128_le is clashing with my understanding of the software/language design", not saying it should be removed.

le explicitly indicate little endian, since uuids are big endian by default as required by the specification. As such you don’t see be variants. U128 specific one is used as u128 endianess is machine dependant. uuid itself does not implicitly handle endianess so the default behaviour is to assume the endianess of the inputted bytes(128 bits = 16 bytes) is correctly big endian.

@KodrAus
Copy link
Member

KodrAus commented Aug 13, 2021

I think I'm still in favour of deprecating the _le methods entirely. They're a constant source of confusion. I think we should instead recommend calling swap_bytes yourself if you want to reverse the input fields. We can't really insert our own calls to to_be or to_le within uuid because that would break the existing semantics of from_fields.

@KodrAus
Copy link
Member

KodrAus commented Aug 13, 2021

If we did want to keep this functionality maybe it would be better called from_fields_swap etc rather than from_fields_le because I think it’s the endianness that’s making it confusing.

@chpio
Copy link

chpio commented Aug 13, 2021

le explicitly indicate little endian, since uuids are big endian by default as required by the specification. As such you don’t see be variants. U128 specific one is used as u128 endianess is machine dependant. uuid itself does not implicitly handle endianess so the default behaviour is to assume the endianess of the inputted bytes(128 bits = 16 bytes) is correctly big endian.

Why do you say, that it's dependent on the native endianess?

const UUID: Uuid = Uuid::from_u128(0xd8d2cf50_a5c6_433b_98e6_8c268fd84fa0);

is just the same on a little endianess machine and on a big endianess machine. There's no "endianess", because it gets abstracted away by the compiler. What matters here is the way how you're reading the data. And that's my argument, endianess should be handled on the reading side not in uuid.

If you're reading an u128 from a file and need to convert it in uuid then that u128 you have read was invalid, because your reading code is broken (returning data in the wrong endianess).

@thedrow
Copy link
Author

thedrow commented Aug 15, 2021

I don't find the API confusing and it is very useful for my fastuuid library.
I'd rather keep the API as it is, if you don't mind.

@KodrAus
Copy link
Member

KodrAus commented Aug 16, 2021

Why do you say, that it's dependent on the native endianess?

This is where the confusion is I think. I've added some docs to our main branch that call out the endianness we're talking about is within the UUID (where the fields are always big endian), rather than the machine, which may be either big or little endian. So when we're talking about little endian integers in the _le methods, we're talking about integer fields encoded as little endian coming from some other source (like a Microsoft GUID) that need to be flipped for an equivalent UUID. I'm feeling less worried about deprecating them after that.

I totally agree that handling any endianness at the read side probably makes the most sense if you're encoding them, but am not sure that means we'd need a from_bytes_le method. If you're encoding a UUID as its raw bytes then either:

  • Those 16 bytes are an opaque value and can be encoded exactly as the UUID gives them to you (which is big endian).
  • You want to read those fields individually as integers so will need to unpack the UUID to either big or little endian fields depending on your encoding using as_fields, in which case you'd probably need to read those fields back out individually to re-interpret them together as a UUID using from_fields.

The u128 methods specifically are a little odd, because there's no specified interpretation of a UUID as a 128bit number, but to keep the behavior consistent with other methods that accept or produce integers, we have a default big-endian interpretation and an unconditionally flipped little-endian one. Since we flip the bytes in whole integers, that means we flip the whole number and not just the fields encoded in it.

@KodrAus
Copy link
Member

KodrAus commented Oct 28, 2021

I'm just coming through some triage. I think the specific issue here is "resolved" as use Uuid::from_u128_le if you want to construct a Uuid from a 128bit number and flip it. It can be a little surprising, but the docs now try to call out why it's done this way. The _le methods aren't going anywhere.

I'll go ahead and close this one for now, but please feel free to re-open or create new issues for anything else that comes up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants