Differences in parsing UUIDs from uint128 in Python and Rust #406

thedrow · 2019-05-12T12:38:52Z

Describe the bug
I'm creating a binding of this library to Python for learning purposes and possible production usage.
I've encountered a difference in the values Python provides and the values Rust provides when parsing a UUID from an unsigned 128-bit integer.

To Reproduce

When parsing a UUID from an unsigned integer in Python I get the following UUID:

>>> import uuid
>>> uuid.UUID(int=286695482685957555672632549999181678124)
UUID('d7af8a9f-e924-4f96-b36f-9ecea445c62c')

However, when I use the following rust code:

extern crate uuid;
use uuid::Uuid;
let u: uint128 = 286695482685957555672632549999181678124;
let uuid: Uuid = u.into();
println!(uuid.to_hyphenated()
                .encode_lower(&mut Uuid::encode_buffer())
                .to_string());

The output is the following UUID: 2cc645a4-ce9e-6fb3-964f-24e99f8aafd7

Expected behavior
I think that the value should be the same in both implementations.
If not, we need to figure out the difference and decide which implementation is correct.

Screenshots
If applicable, add screenshots to help explain your problem.

Specifications (please complete the following information):

Target: Debug
Version [e.g. 1.0]: 0.7.4
Features Enabled: v1, v3, v4, v5, u128

Additional context
See the Python implementation here.

Other
N/A

The text was updated successfully, but these errors were encountered:

kinggoesgaming · 2019-05-12T23:33:03Z

cc @KodrAus

KodrAus · 2019-05-13T01:21:39Z

Looks like endianness? Will dig in properly in a bit, but maybe we’re reversing the bits in that 128bit number so they’re big endian whereas Python isn’t?

kinggoesgaming · 2019-05-13T05:46:42Z

Now I feel silly... thats the first thing I ruled out :P

thedrow · 2019-05-13T08:27:58Z

That's exactly the case.

I did the following:

            if let Some(i) = int {
                handle = i.swap_bytes().into();
            }

and the tests pass. I'm not sure which implementation is correct (or if both are correct).
In any case, this should be documented.

KodrAus · 2019-05-13T11:05:02Z

Yeh, I think because a uuid doesn’t have a representation as a 128bit number any semantics we pick will be fairly arbitrary.

I think we should consider deprecating the 128bit conversions.

thedrow · 2019-05-13T11:10:51Z

I disagree.
It's useful at least to me.

KodrAus · 2019-05-13T11:18:32Z

@thedrow In that case you could define your own function in your bindings to Python with the semantics you need, though.

The problem is we’ve chosen one arbitrary representation where the 128bit number has a platform-specific endianness but the representation you want is for the 128bit number to act like a [u8; 16] where there is no endianness.

Since there are multiple equally valid approaches here I think we should consider punting supporting it.

kinggoesgaming · 2019-05-13T16:45:49Z

I think we should consider deprecating the 128bit conversions.

I think a better way will be to provide another adapter Integer that handles the u128 explicitly. I agree that conversion functionality is important. However as seen here, this introduces the issue of the endianness of the u128 being ambiguous.

kinggoesgaming · 2019-05-13T16:49:25Z

The other thing that I was thinking last night was that we do not have any types that guarantee about the endianness of the bytes within. We could introduce a adapter::endian module that has these. I am not a 100% sure if this would be the best fit for uuid crate itself though

407: remove support for u128 r=Dylan-DPC a=kinggoesgaming **I'm submitting a(n)** other # Description Remove the `u128_support` module and the current implementations for converting from `u128` to `Uuid`s. New alternative will be provided later. # Motivation #406 shows that `u128`, number type that is affected by the target's endianness. # Tests All tests related to `u128` removed. The rest of the tests should work as expected, as before. # Related Issue(s) #406 Co-authored-by: Hunar Roop Kahlon <[email protected]>

@jonathanstrong

427: Breaking API features and refactoring for `0.8` r=kinggoesgaming a=KodrAus **I'm submitting a(n)** (|feature|refactor|) # Description ## Modules and Errors This PR is a broad refactoring of our error types and module layout. The general pattern I've gone for is: - Define specific error types in submodules. These are currently private. - Collect these specific error types in an opaque root `Error`. All methods return this single `Error` type so it's easier for consumers to carry `uuid::Error`s in their own code. It'll also include some implementations for open PRs as we've been discussing. I imagine we'll want to spend time working through these changes :) I've hidden our `prelude` module for now, because I'm not sure it's something we'll want to stabilize with (it's only got a few bits in it afterall). ## 128bit support Re-enables support for 128-bit numbers in the form of constructors on `Uuid`, following the pattern that method names without an endian suffix are BE. ## No-std Refactors our `no-std` support so we always mark the crate as `no-std` so our std imports are always the same. This simplifies our std/core support so we need fewer modules. ## Timestamps Includes the design proposed in #405 for timestamp support originally implemented by @jonathanstrong. # Related Issue(s) - #406 # Related PR(s) - #405 - #416 Co-authored-by: Ashley Mannix <[email protected]> Co-authored-by: Jonathan Strong <[email protected]>

Fleshgrinder · 2020-02-24T19:48:58Z

The behavior here is not surprising, well, not to me at least. Literals are big endian and get converted to the target endianess. Given that most (all?) x86/x64 CPUs are little endian we end up with exactly that. Python – like the JVM – is probably normalizing exactly that so that users do not get confused.

There is target_endian with which it would be possible to select the right default implementation for the u128 to Uuid conversion at compile time. However, I cannot tell if targets exist that may have a different Endianess depending on the used CPU. I do not believe so but believing means knowing nothing.

The – by far – safest choice to to simply call to_be and to_le that are no-op if the target is already using the expected Endianess and otherwise converts them. This way the Endianess is always correct and all operations placed on top are safe.

chpio · 2021-02-13T11:36:50Z

Looking at the code I was wondering why there even is a from_u128_le fn. Isn't rusts u128 defined as "big endian" from the pov of the developer (or maybe im just stupid...)? So the endianness conversion should happen at the "data reading" stage with for example from_le_bytes?

kinggoesgaming · 2021-02-26T05:33:02Z

Looking at the code I was wondering why there even is a from_u128_le fn. Isn't rusts u128 defined as "big endian" from the pov of the developer (or maybe im just stupid...)? So the endianness conversion should happen at the "data reading" stage with for example from_le_bytes?

All primitive numbers follow the target platform’s endianness... the from_le_bytes and from_be_bytes are noop if endian already matches, IIRC

chpio · 2021-03-08T11:00:30Z

All primitive numbers follow the target platform’s endianness

yeah, im not talking about the native endianness the cpu is using, that is abstracted away by the compiler anyway. But the "conceptual"/"logical" endianness perceived by the developer.

Im just saying that the from_le_bytes should be used on a lower level (eg. while interpreting/parsing read data), and the higher level (Uuid::from_u128) should not bother with endianness. from_u128_le is mixing both, the lower level data endianness and the conceptual endianness.

edit: "should" as "The existence of from_u128_le is clashing with my understanding of the software/language design", not saying it should be removed.

kinggoesgaming · 2021-04-30T14:12:00Z

All primitive numbers follow the target platform’s endianness

yeah, im not talking about the native endianness the cpu is using, that is abstracted away by the compiler anyway. But the "conceptual"/"logical" endianness perceived by the developer.

Im just saying that the from_le_bytes should be used on a lower level (eg. while interpreting/parsing read data), and the higher level (Uuid::from_u128) should not bother with endianness. from_u128_le is mixing both, the lower level data endianness and the conceptual endianness.

edit: "should" as "The existence of from_u128_le is clashing with my understanding of the software/language design", not saying it should be removed.

le explicitly indicate little endian, since uuids are big endian by default as required by the specification. As such you don’t see be variants. U128 specific one is used as u128 endianess is machine dependant. uuid itself does not implicitly handle endianess so the default behaviour is to assume the endianess of the inputted bytes(128 bits = 16 bytes) is correctly big endian.

KodrAus · 2021-08-13T01:08:27Z

I think I'm still in favour of deprecating the _le methods entirely. They're a constant source of confusion. I think we should instead recommend calling swap_bytes yourself if you want to reverse the input fields. We can't really insert our own calls to to_be or to_le within uuid because that would break the existing semantics of from_fields.

KodrAus · 2021-08-13T01:22:11Z

If we did want to keep this functionality maybe it would be better called from_fields_swap etc rather than from_fields_le because I think it’s the endianness that’s making it confusing.

chpio · 2021-08-13T12:27:45Z

le explicitly indicate little endian, since uuids are big endian by default as required by the specification. As such you don’t see be variants. U128 specific one is used as u128 endianess is machine dependant. uuid itself does not implicitly handle endianess so the default behaviour is to assume the endianess of the inputted bytes(128 bits = 16 bytes) is correctly big endian.

Why do you say, that it's dependent on the native endianess?

const UUID: Uuid = Uuid::from_u128(0xd8d2cf50_a5c6_433b_98e6_8c268fd84fa0);

is just the same on a little endianess machine and on a big endianess machine. There's no "endianess", because it gets abstracted away by the compiler. What matters here is the way how you're reading the data. And that's my argument, endianess should be handled on the reading side not in uuid.

If you're reading an u128 from a file and need to convert it in uuid then that u128 you have read was invalid, because your reading code is broken (returning data in the wrong endianess).

thedrow · 2021-08-15T10:30:45Z

I don't find the API confusing and it is very useful for my fastuuid library.
I'd rather keep the API as it is, if you don't mind.

KodrAus · 2021-08-16T03:21:14Z

Why do you say, that it's dependent on the native endianess?

This is where the confusion is I think. I've added some docs to our main branch that call out the endianness we're talking about is within the UUID (where the fields are always big endian), rather than the machine, which may be either big or little endian. So when we're talking about little endian integers in the _le methods, we're talking about integer fields encoded as little endian coming from some other source (like a Microsoft GUID) that need to be flipped for an equivalent UUID. I'm feeling less worried about deprecating them after that.

I totally agree that handling any endianness at the read side probably makes the most sense if you're encoding them, but am not sure that means we'd need a from_bytes_le method. If you're encoding a UUID as its raw bytes then either:

Those 16 bytes are an opaque value and can be encoded exactly as the UUID gives them to you (which is big endian).
You want to read those fields individually as integers so will need to unpack the UUID to either big or little endian fields depending on your encoding using as_fields, in which case you'd probably need to read those fields back out individually to re-interpret them together as a UUID using from_fields.

The u128 methods specifically are a little odd, because there's no specified interpretation of a UUID as a 128bit number, but to keep the behavior consistent with other methods that accept or produce integers, we have a default big-endian interpretation and an unconditionally flipped little-endian one. Since we flip the bytes in whole integers, that means we flip the whole number and not just the fields encoded in it.

KodrAus · 2021-10-28T05:35:12Z

I'm just coming through some triage. I think the specific issue here is "resolved" as use Uuid::from_u128_le if you want to construct a Uuid from a 128bit number and flip it. It can be a little surprising, but the docs now try to call out why it's done this way. The _le methods aren't going anywhere.

I'll go ahead and close this one for now, but please feel free to re-open or create new issues for anything else that comes up!

triage-new-issues bot added the triage label May 12, 2019

kinggoesgaming added the Pending:Discussion label May 12, 2019

triage-new-issues bot removed the triage label May 12, 2019

kinggoesgaming mentioned this issue May 26, 2019

remove support for u128 #407

Merged

weekly-digest bot mentioned this issue Jun 2, 2019

Weekly Digest (26 May, 2019 - 2 June, 2019) #409

Closed

kinggoesgaming mentioned this issue Jul 4, 2019

add endian adapters #416

Closed

KodrAus mentioned this issue Oct 2, 2019

Breaking API features and refactoring for 0.8 #427

Merged

weekly-digest bot mentioned this issue Oct 6, 2019

Weekly Digest (29 September, 2019 - 6 October, 2019) #429

Closed

KodrAus mentioned this issue Aug 13, 2021

Add more docs on endianness #527

Merged

KodrAus closed this as completed Oct 28, 2021

davidhewitt mentioned this issue Jul 17, 2023

Add uuid validator pydantic/pydantic-core#772

Merged

4 tasks

This was referenced Jul 20, 2024

use little endian int fft001/pydantic-core#1

Open

use little endian int pydantic/pydantic-core#1372

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences in parsing UUIDs from uint128 in Python and Rust #406

Differences in parsing UUIDs from uint128 in Python and Rust #406

thedrow commented May 12, 2019

kinggoesgaming commented May 12, 2019

KodrAus commented May 13, 2019 •

edited

Loading

kinggoesgaming commented May 13, 2019

thedrow commented May 13, 2019

KodrAus commented May 13, 2019 •

edited

Loading

thedrow commented May 13, 2019

KodrAus commented May 13, 2019

kinggoesgaming commented May 13, 2019

kinggoesgaming commented May 13, 2019

Fleshgrinder commented Feb 24, 2020

chpio commented Feb 13, 2021

kinggoesgaming commented Feb 26, 2021

chpio commented Mar 8, 2021 •

edited

Loading

kinggoesgaming commented Apr 30, 2021 •

edited

Loading

KodrAus commented Aug 13, 2021

KodrAus commented Aug 13, 2021

chpio commented Aug 13, 2021 •

edited

Loading

thedrow commented Aug 15, 2021

KodrAus commented Aug 16, 2021

KodrAus commented Oct 28, 2021

Differences in parsing UUIDs from uint128 in Python and Rust #406

Differences in parsing UUIDs from uint128 in Python and Rust #406

Comments

thedrow commented May 12, 2019

kinggoesgaming commented May 12, 2019

KodrAus commented May 13, 2019 • edited Loading

kinggoesgaming commented May 13, 2019

thedrow commented May 13, 2019

KodrAus commented May 13, 2019 • edited Loading

thedrow commented May 13, 2019

KodrAus commented May 13, 2019

kinggoesgaming commented May 13, 2019

kinggoesgaming commented May 13, 2019

Fleshgrinder commented Feb 24, 2020

chpio commented Feb 13, 2021

kinggoesgaming commented Feb 26, 2021

chpio commented Mar 8, 2021 • edited Loading

kinggoesgaming commented Apr 30, 2021 • edited Loading

KodrAus commented Aug 13, 2021

KodrAus commented Aug 13, 2021

chpio commented Aug 13, 2021 • edited Loading

thedrow commented Aug 15, 2021

KodrAus commented Aug 16, 2021

KodrAus commented Oct 28, 2021

KodrAus commented May 13, 2019 •

edited

Loading

KodrAus commented May 13, 2019 •

edited

Loading

chpio commented Mar 8, 2021 •

edited

Loading

kinggoesgaming commented Apr 30, 2021 •

edited

Loading

chpio commented Aug 13, 2021 •

edited

Loading