-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Support both 0x01 and 0x02 as type for list of booleans in thrift metadata #7052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -170,9 +170,13 @@ impl TInputProtocol for TCompactSliceInputProtocol<'_> { | |
| Some(b) => Ok(b), | ||
| None => { | ||
| let b = self.read_byte()?; | ||
| // Previous versions of the thrift specification said to use 0 and 1 inside collections, | ||
| // but that differed from existing implementations. | ||
| // The specification was updated in https://github.com/apache/thrift/commit/2c29c5665bc442e703480bb0ee60fe925ffe02e8. | ||
| // At least the go implementation seems to have followed the previously documented values. | ||
| match b { | ||
| 0x01 => Ok(true), | ||
| 0x02 => Ok(false), | ||
| 0x00 | 0x02 => Ok(false), | ||
| unkn => Err(thrift::Error::Protocol(thrift::ProtocolError { | ||
| kind: thrift::ProtocolErrorKind::InvalidData, | ||
| message: format!("cannot convert {} into bool", unkn), | ||
|
|
@@ -251,7 +255,12 @@ impl TInputProtocol for TCompactSliceInputProtocol<'_> { | |
|
|
||
| fn collection_u8_to_type(b: u8) -> thrift::Result<TType> { | ||
| match b { | ||
| 0x01 => Ok(TType::Bool), | ||
| // For historical and compatibility reasons, a reader should be capable to deal with both cases. | ||
| // The only valid value in the original spec was 2, but due to an widespread implementation bug | ||
| // the defacto standard across large parts of the library became 1 instead. | ||
| // As a result, both values are now allowed. | ||
| // https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md#list-and-set | ||
| 0x01 | 0x02 => Ok(TType::Bool), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Upstream thrift doesn't seem to support 0x02 for this 🤔
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I opened a PR just now: apache/thrift#3094 This might also affect parquet2 (currently unmaintained) and polars, and probably also other languages than rust.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI @ritchie46 or @orlp -- we found a bug in reading arguably malformed parquet files created by go that @jhorstmann fixed in parquet-rs. I did a quick scan in polars and didn't find the equivalent code (though I don't understand how There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I should forward that to our Parquet guy, @coastalwhite. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the heads-up. Also ported the fix to Polars 😄 |
||
| o => u8_to_type(o), | ||
| } | ||
| } | ||
|
|
@@ -282,3 +291,52 @@ fn eof_error() -> thrift::Error { | |
| message: "Unexpected EOF".to_string(), | ||
| }) | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { | ||
| use crate::format::{BoundaryOrder, ColumnIndex}; | ||
| use crate::thrift::{TCompactSliceInputProtocol, TSerializable}; | ||
|
|
||
| #[test] | ||
| pub fn read_boolean_list_field_type() { | ||
| // Boolean collection type encoded as 0x01, as used by this crate when writing. | ||
| // Values encoded as 1 (true) or 2 (false) as in the current version of the thrift | ||
| // documentation. | ||
| let bytes = vec![0x19, 0x21, 2, 1, 0x19, 8, 0x19, 8, 0x15, 0, 0]; | ||
|
|
||
| let mut protocol = TCompactSliceInputProtocol::new(bytes.as_slice()); | ||
| let index = ColumnIndex::read_from_in_protocol(&mut protocol).unwrap(); | ||
| let expected = ColumnIndex { | ||
| null_pages: vec![false, true], | ||
| min_values: vec![], | ||
| max_values: vec![], | ||
| boundary_order: BoundaryOrder::UNORDERED, | ||
| null_counts: None, | ||
| repetition_level_histograms: None, | ||
| definition_level_histograms: None, | ||
| }; | ||
|
|
||
| assert_eq!(&index, &expected); | ||
| } | ||
|
|
||
| #[test] | ||
| pub fn read_boolean_list_alternative_encoding() { | ||
| // Boolean collection type encoded as 0x02, as allowed by the spec. | ||
| // Values encoded as 1 (true) or 0 (false) as before the thrift documentation change on 2024-12-13. | ||
| let bytes = vec![0x19, 0x22, 0, 1, 0x19, 8, 0x19, 8, 0x15, 0, 0]; | ||
|
|
||
| let mut protocol = TCompactSliceInputProtocol::new(bytes.as_slice()); | ||
| let index = ColumnIndex::read_from_in_protocol(&mut protocol).unwrap(); | ||
| let expected = ColumnIndex { | ||
| null_pages: vec![false, true], | ||
| min_values: vec![], | ||
| max_values: vec![], | ||
| boundary_order: BoundaryOrder::UNORDERED, | ||
| null_counts: None, | ||
| repetition_level_histograms: None, | ||
| definition_level_histograms: None, | ||
| }; | ||
|
|
||
| assert_eq!(&index, &expected); | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the upstream thrift implementation may simply treat anything that is not 0x1 as false:
https://github.com/apache/thrift/blob/a45618e05bbb2d29737514541b6d61f6850d9b16/lib/rs/src/protocol/binary.rs#L167-L169
This doesn't look like it has changed for 9 years 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems to be the (non-compact) binary protocol, which uses different encodings and also different tags for the types. I was thinking of doing the same here though, the error handling does not add much value.