Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension type support #2444

Closed
Renkai opened this issue Aug 14, 2022 · 5 comments
Closed

Extension type support #2444

Renkai opened this issue Aug 14, 2022 · 5 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@Renkai
Copy link

Renkai commented Aug 14, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

My friends wrote a data format lance that has interconvertibility with parquet, and I want to make another implementation with Rust.
However, they used EXTENTION type,
it seems has not been implemented in arrow-rs.

Describe the solution you'd like

Let the reader can convert the parquet file with EXTENTION type to Arrow.
Describe alternatives you've considered

Additional context

Python code I used to generate such a file.

# with pyarrow-9.0.0
import pyarrow as pa


class UuidType(pa.PyExtensionType):

    def __init__(self):
        pa.PyExtensionType.__init__(self, pa.binary(16))

    def __reduce__(self):
        return UuidType, ()


if __name__ == '__main__':
    uuid_type = UuidType()
    print(uuid_type.extension_name)
    print(uuid_type.storage_type)
    import uuid

    storage_array = pa.array([uuid.uuid4().bytes for _ in range(4)], pa.binary(16))
    arr = pa.ExtensionArray.from_storage(uuid_type, storage_array)
    print(arr)
    table = pa.Table.from_arrays([arr], names=["uuid"])
    import pyarrow.parquet as pq

    pq.write_table(table, "extension_example.parquet")
   
    # successfully read and print
    parquet_table = pq.read_table('extension_example.parquet')
    print("schema", parquet_table.schema)
    print("table", parquet_table)

Rust code that failed reading

        let input_file_name = "extension_example.parquet";
        //https://docs.rs/parquet/19.0.0/parquet/arrow/index.html
        use arrow::record_batch::RecordBatchReader;
        use parquet::arrow::{ParquetFileArrowReader, ArrowReader, ProjectionMask};
        use std::fs::File;

        let file = File::open(input_file_name).unwrap();

        let mut arrow_reader = ParquetFileArrowReader::try_new(file).unwrap();
        let mask = ProjectionMask::leaves(arrow_reader.parquet_schema(), [0]);
        println!("parquet schema is: {:?}", arrow_reader.parquet_schema());
        println!("Converted arrow schema is: {}", arrow_reader.get_schema().unwrap());

error log

thread 'tests::test_convert' panicked at 'called `Result::unwrap()` on an `Err` value: ArrowError("Unable to get root as message stored in ARROW:schema: Utf8Error { error: Utf8Error { valid_up_to: 0, error_len: Some(1) }, range: 216..255, error_trace: ErrorTrace([TableField { field_name: \"value\", position: 208 }, VectorElement { index: 0, position: 116 }, TableField { field_name: \"custom_metadata\", position: 92 }, VectorElement { index: 0, position: 48 }, TableField { field_name: \"fields\", position: 40 }, UnionVariant { variant: \"MessageHeader::Schema\", position: 24 }, TableField { field_name: \"header\", position: 24 }]) }")', src/lib.rs:39:77
@Renkai Renkai added the enhancement Any new improvement worthy of a entry in the changelog label Aug 14, 2022
@tustvold
Copy link
Contributor

tustvold commented Aug 14, 2022

Thank you for the example, I can reproduce this. The parquet file is produced with the following base64 encoded arrow schema

"/////zABAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAYAAAAAAASABgACAAGAAcADAAAABAAFAASAAAAAAABDxQAAADUAAAACAAAABQAAAAAAAAABAAAAHV1aWQAAAAAAgAAAFQAAAAEAAAAvP///wgAAAAgAAAAFAAAAEFSUk9XOmV4dGVuc2lvbjpuYW1lAAAAABcAAABhcnJvdy5weV9leHRlbnNpb25fdHlwZQAIAAwABAAIAAgAAAAIAAAAJAAAABgAAABBUlJPVzpleHRlbnNpb246bWV0YWRhdGEAAAAAJwAAAIAElRwAAAAAAAAAjAhfX21haW5fX5SMCFV1aWRUeXBllJOUKVKULgAAAAYACAAEAAYAAAAQAAAA"

I took this and converted it to its raw data with

echo "..." | base64 --decode > /tmp/output.bin

I trimmed the first 8 bytes (as they're arrow specific delimiter information)

tail -c +9 /tmp/output.bin > /tmp/output-trimmed.bin

I then decoded the flatbuffer

flatc --json --raw-binary format/Message.fbs -- /tmp/output-trimmed.bin

And got the error

  Unable to generate text for output-trimmed

This fits with the error that the rust implementation is returning, the schema has non-UTF-8 data encoded in a string field, which is technically illegal.

If I tell flatc to ignore this

flatc  --allow-non-utf8 --json --raw-binary format/Message.fbs -- /tmp/output-trimmed.bin

I get the decoded data

$ cat output-trimmed.json 
{
  version: "V5",
  header_type: "Schema",
  header: {
    fields: [
      {
        name: "uuid",
        nullable: true,
        type_type: "FixedSizeBinary",
        type: {
          byteWidth: 16
        },
        children: [

        ],
        custom_metadata: [
          {
            key: "ARROW:extension:metadata",
            value: "\x80\u0004\x95\u001C\u0000\u0000\u0000\u0000\u0000\u0000\u0000\x8C\b__main__\x94\x8C\bUuidType\x94\x93\x94)R\x94."
          },
          {
            key: "ARROW:extension:name",
            value: "arrow.py_extension_type"
          }
        ]
      }
    ]
  }
}

I'm not really sure what to make of this, KeyValue is defined as

table KeyValue {
  key: string;
  value: string;
}

Which as per the flatbuffer spec must be valid UTF-8. I will try to get some clarity on what is going on here - my understanding of the specification is the Rust implementation is correct to refuse this schema...

@tustvold
Copy link
Contributor

tustvold commented Aug 15, 2022

Ok so it would appear that this is a known issue where pyarrow is writing ill-formed flatbuffers (here) for extension types. There isn't really much we can do here, a flatbuffer string field should not contain non-UTF-8 data, and in the case of Rust permitting this would not be sound (it could lead to UB). Having spoken with @jorgecarleitao I'm led to believe arrow2 also takes the approach of rejecting this.

The proper solution to the problem is for pyarrow to either base64 encode the payloads, or for the arrow specification to change KeyValue.value to be bytes not string. Both are probably going to be difficult to sell...

That being said, the embedded metadata is a pickled python class, which likely isn't hugely useful to a rust client anyway. Perhaps you could use skip_arrow_metadata to tell the parquet reader to just ignore the malformed embedded arrow schema, and just infer the data from the underlying parquet schema?

@Renkai
Copy link
Author

Renkai commented Aug 16, 2022

@tustvold Thanks a lot!

I replaced the generator with this one, it basically changed the pa.PyExtensionType to pa.ExtensionType. The rust parquet parser works well now except it would read the data type as FixedSizeBinary(16). I think it's a slight difference in behavior from the C++ parser. For practice, I can continue my adventure, but would you consider making the community less divergent?

import pyarrow as pa


class UuidType(pa.ExtensionType):

    def __init__(self):
        pa.ExtensionType.__init__(self, pa.binary(16),"lance.uuid")

    def __arrow_ext_serialize__(self):
        # since we don't have a parameterized type, we don't need extra
        # metadata to be deserialized
        return b''

    @classmethod
    def __arrow_ext_deserialize__(self, storage_type, serialized):
        # return an instance of this subclass given the serialized
        # metadata.
        return UuidType()


if __name__ == '__main__':
    uuid_type = UuidType()
    print(uuid_type.extension_name)
    print(uuid_type.storage_type)
    import uuid

    storage_array = pa.array([uuid.uuid4().bytes for _ in range(4)], pa.binary(16))
    arr = pa.ExtensionArray.from_storage(uuid_type, storage_array)
    print(arr)
    table = pa.Table.from_arrays([arr], names=["uuid"])
    import pyarrow.parquet as pq

    pq.write_table(table, "extension_example.parquet")

@tustvold
Copy link
Contributor

I think it's a slight difference in behavior from the C++ parser

What is the behavior of the C++ parser?

Would you consider making the community less divergent?

Happy to be corrected on this, but I don't believe there is a standard for extension types, one could conceivably make the case that such a concept would be an oxymoron... If there is a general purpose data type that is missing from the standard set, I'm sure the community would be willing to consider additions to the arrow specification, and this would be the path to portability for that data type.

There is always going to be a trade-off between expressiveness and portability, with extension types sacrificing the latter in favour of the former. I'm not sure there is a way around this... FWIW just supporting the standard arrow types is ripe with excitement #1666

@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2023

Closing this as I don't believe it is tracking any missing functionality, feel free to reopen if I am mistaken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

3 participants