Flatbuffers Utf8 error when reading PyArrow written feather file #5547

Jefffrey · 2024-03-24T03:30:59Z

Describe the bug

Given a feather written file from PyArrow, when using the arrow-ipc reader to read this file, a flatbuffers ParseError is thrown due to invalid UTF8

To Reproduce

Given this ORC file:

https://github.com/apache/orc/blob/fa627ec6d7c72289c8a83632e6a43ae48603fc4b/examples/TestOrcFile.metaData.orc

When using PyArrow 15.0.0 to read and write it out to feather:

>>> from pyarrow import feather, orc
>>> table = orc.read_table("TestOrcFile.metaData.orc")
>>> feather.write_feather(table, "/tmp/test.feather")
>>> feather.read_table("/tmp/test.feather")
pyarrow.Table
boolean1: bool
byte1: int8
short1: int16
int1: int32
long1: int64
float1: float
double1: double
bytes1: binary
string1: string
middle: struct<list: list<item: struct<int1: int32, string1: string>>>
  child 0, list: list<item: struct<int1: int32, string1: string>>
      child 0, item: struct<int1: int32, string1: string>
          child 0, int1: int32
          child 1, string1: string
list: list<item: struct<int1: int32, string1: string>>
  child 0, item: struct<int1: int32, string1: string>
      child 0, int1: int32
      child 1, string1: string
map: map<string, struct<int1: int32, string1: string>>
  child 0, entries: struct<key: string not null, value: struct<int1: int32, string1: string>> not null
      child 0, key: string not null
      child 1, value: struct<int1: int32, string1: string>
          child 0, int1: int32
          child 1, string1: string
----
boolean1: [[true]]
byte1: [[127]]
short1: [[1024]]
int1: [[42]]
long1: [[45097156608]]
float1: [[3.1415]]
double1: [[-2.713]]
bytes1: [[null]]
string1: [[null]]
middle: [
  -- is_valid:  [false]
  -- child 0 type: list<item: struct<int1: int32, string1: string>>
[null]]
...
>>>

Then trying to read this file with arrow-ipc:

    #[test]
    fn test_123() {
        let _ = FileReaderBuilder::new()
            .build(std::fs::File::open("/tmp/test.feather").unwrap())
            .unwrap();
    }

It throws error:

arrow-rs$ cargo test -p arrow-ipc --lib reader::tests::test_123
    Finished test [unoptimized + debuginfo] target(s) in 0.05s
     Running unittests src/lib.rs (target/debug/deps/arrow_ipc-b6339780ea47b538)

running 1 test
test reader::tests::test_123 ... FAILED

failures:

---- reader::tests::test_123 stdout ----
thread 'reader::tests::test_123' panicked at arrow-ipc/src/reader.rs:1862:14:
called `Result::unwrap()` on an `Err` value: ParseError("Unable to get root as footer: Utf8Error { error: Utf8Error { valid_up_to: 1, error_len: Some(1) }, range: 208..40208, error_trace: ErrorTrace([TableField { field_name: \"value\", position: 200 }, VectorElement { index: 0, position: 96 }, TableField { field_name: \"custom_metadata\", position: 88 }, TableField { field_name: \"schema\", position: 24 }]) }")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    reader::tests::test_123

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 36 filtered out; finished in 0.00s

error: test failed, to rerun pass `-p arrow-ipc --lib`

Expected behavior

Should be able to read file successfully.

Additional context

Though error likely lies upstream with flatbuffers, maybe there is a way we can allow the ipc reader to ignore invalid custom_metadata via user configuration?

The text was updated successfully, but these errors were encountered:

tustvold · 2024-03-24T17:28:22Z

This sounds like the old issue of python extension types storing pickle data in a UTF-8 field without escaping it. This is an upstream bug IMO, but I would not be adverse to finding some way to skip such data

See #2444

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatbuffers Utf8 error when reading PyArrow written feather file #5547

Flatbuffers Utf8 error when reading PyArrow written feather file #5547

Jefffrey commented Mar 24, 2024

tustvold commented Mar 24, 2024 •

edited

Loading

Flatbuffers Utf8 error when reading PyArrow written feather file #5547

Flatbuffers Utf8 error when reading PyArrow written feather file #5547

Comments

Jefffrey commented Mar 24, 2024

tustvold commented Mar 24, 2024 • edited Loading

tustvold commented Mar 24, 2024 •

edited

Loading