-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: gurantee the deserialize order of struct is same as the struct type #795
base: main
Are you sure you want to change the base?
Conversation
crates/iceberg/src/spec/values.rs
Outdated
])), | ||
&Type::Struct(StructType::new(vec![ | ||
NestedField::required(2, "id", Type::Primitive(PrimitiveType::Int)).into(), | ||
NestedField::optional(3, "name", Type::Primitive(PrimitiveType::String)).into(), | ||
NestedField::optional(4, "address", Type::Primitive(PrimitiveType::String)).into(), | ||
NestedField::required(5, "extra", Type::Primitive(PrimitiveType::Int)).into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, would you like to add a new test that cover the mis-order cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I find that it can also pass originally. I'm trying to find the test case that can't pass originally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I find that it can also pass originally. I'm trying to find the test case that can't pass originally.
Thank you, that will be really meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have found the reason why this can pass originally: the avro writer will ensure the record order according to the schema: https://github.com/apache/avro-rs/blob/390a150bfc5999eb852c9c0ef40335612f1407b5/avro/src/encode.rs#L247.
However, if we serialize into other formats, e.g. json, the order can't be guaranteed.
744024d
to
813c2b5
Compare
let deserialized: RawLiteral = serde_json::from_str(&serialized).unwrap(); | ||
let deserialized_literal = deserialized.try_into(&fields).unwrap().unwrap(); | ||
|
||
assert_eq!(expected_literal, deserialized_literal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the whole serialization is off, this should be done by ID instead of name:
I checked out the branch, and it is currently by name:
{
"id": 1,
"extra": 1000,
"name": "bar",
"address": null
}
While it should be:
{
"2": 1,
"3": "bar",
"4": null,
"5": 1000
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not same as JSON single-value serialization, th JSON single-value serialization has the specific implementation
iceberg-rust/crates/iceberg/src/spec/values.rs
Line 1993 in 813c2b5
(Literal::Struct(s), Type::Struct(struct_type)) => { |
This test case is just to test the normal Serialize implementation, internally it mainly used in avro format. See https://docs.rs/avro-rs/latest/avro_rs/types/struct.Record.html, that's why here record store name and value.
Here I serialize it into json type is to test the reorder case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I'm a bit confused about why we need to care about this. [De]serialization is a very format-specific task, and it's really challenging to ensure our implementations meet all format requirements. I'm a bit concerned about the additional cost we incur to achieve this. Doesn't it seem fine as long as it works well with Avro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, missed that. Let me unblock this for now 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark unresolve for my newly added comment: #795 (comment)
Sorry @Fokko 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, around the same time! I share your concern, and I would like to check later on if we do the field-ID projection properly, but I didn't want to block the release 👍
For Avro it is very simple, it will always be decoded in the same order as the schema (otherwise it will just break). That said, we can rely on the order for V1, but use field-ID-based projection for V2 tables.
})?; | ||
let value = value.try_into(&field.field_type)?; | ||
Ok(value) | ||
let mut value_map: HashMap<String, RawLiteralEnum> = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand why we ignore optional
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because when deserilaize, None will be convert to RawLiteralEnum::Null. So there is no need for optional here. And actually, when deserialize, we don't know the schema, so that we don't know which value is optional and which is required, we just know which value is valid and which is none.
813c2b5
to
424c429
Compare
We should deserialize according to order of struct type rather than the deserialize value.