Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatbuffers Utf8 error when reading PyArrow written feather file #5547

Open
Jefffrey opened this issue Mar 24, 2024 · 1 comment
Open

Flatbuffers Utf8 error when reading PyArrow written feather file #5547

Jefffrey opened this issue Mar 24, 2024 · 1 comment
Labels

Comments

@Jefffrey
Copy link
Contributor

Describe the bug

Given a feather written file from PyArrow, when using the arrow-ipc reader to read this file, a flatbuffers ParseError is thrown due to invalid UTF8

To Reproduce

Given this ORC file:

https://github.com/apache/orc/blob/fa627ec6d7c72289c8a83632e6a43ae48603fc4b/examples/TestOrcFile.metaData.orc

When using PyArrow 15.0.0 to read and write it out to feather:

>>> from pyarrow import feather, orc
>>> table = orc.read_table("TestOrcFile.metaData.orc")
>>> feather.write_feather(table, "/tmp/test.feather")
>>> feather.read_table("/tmp/test.feather")
pyarrow.Table
boolean1: bool
byte1: int8
short1: int16
int1: int32
long1: int64
float1: float
double1: double
bytes1: binary
string1: string
middle: struct<list: list<item: struct<int1: int32, string1: string>>>
  child 0, list: list<item: struct<int1: int32, string1: string>>
      child 0, item: struct<int1: int32, string1: string>
          child 0, int1: int32
          child 1, string1: string
list: list<item: struct<int1: int32, string1: string>>
  child 0, item: struct<int1: int32, string1: string>
      child 0, int1: int32
      child 1, string1: string
map: map<string, struct<int1: int32, string1: string>>
  child 0, entries: struct<key: string not null, value: struct<int1: int32, string1: string>> not null
      child 0, key: string not null
      child 1, value: struct<int1: int32, string1: string>
          child 0, int1: int32
          child 1, string1: string
----
boolean1: [[true]]
byte1: [[127]]
short1: [[1024]]
int1: [[42]]
long1: [[45097156608]]
float1: [[3.1415]]
double1: [[-2.713]]
bytes1: [[null]]
string1: [[null]]
middle: [
  -- is_valid:  [false]
  -- child 0 type: list<item: struct<int1: int32, string1: string>>
[null]]
...
>>>

Then trying to read this file with arrow-ipc:

    #[test]
    fn test_123() {
        let _ = FileReaderBuilder::new()
            .build(std::fs::File::open("/tmp/test.feather").unwrap())
            .unwrap();
    }

It throws error:

arrow-rs$ cargo test -p arrow-ipc --lib reader::tests::test_123
    Finished test [unoptimized + debuginfo] target(s) in 0.05s
     Running unittests src/lib.rs (target/debug/deps/arrow_ipc-b6339780ea47b538)

running 1 test
test reader::tests::test_123 ... FAILED

failures:

---- reader::tests::test_123 stdout ----
thread 'reader::tests::test_123' panicked at arrow-ipc/src/reader.rs:1862:14:
called `Result::unwrap()` on an `Err` value: ParseError("Unable to get root as footer: Utf8Error { error: Utf8Error { valid_up_to: 1, error_len: Some(1) }, range: 208..40208, error_trace: ErrorTrace([TableField { field_name: \"value\", position: 200 }, VectorElement { index: 0, position: 96 }, TableField { field_name: \"custom_metadata\", position: 88 }, TableField { field_name: \"schema\", position: 24 }]) }")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    reader::tests::test_123

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 36 filtered out; finished in 0.00s

error: test failed, to rerun pass `-p arrow-ipc --lib`

Expected behavior

Should be able to read file successfully.

Additional context

Though error likely lies upstream with flatbuffers, maybe there is a way we can allow the ipc reader to ignore invalid custom_metadata via user configuration?

@Jefffrey Jefffrey added the bug label Mar 24, 2024
@tustvold
Copy link
Contributor

tustvold commented Mar 24, 2024

This sounds like the old issue of python extension types storing pickle data in a UTF-8 field without escaping it. This is an upstream bug IMO, but I would not be adverse to finding some way to skip such data

See #2444

See also apache/arrow#20107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants