Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Improving Deserialization Speed for PyArrow to Python Objects #45113

Open
lllangWV opened this issue Dec 27, 2024 · 0 comments
Open

Comments

@lllangWV
Copy link

Describe the usage question you have. Please include as many useful details as possible.

Title: Improving Deserialization Speed for PyArrow to Python Objects

Hello,

I am working with materials data stored in Parquet files, where a column structure contains serialized dictionaries representing structures from the Structure class in the pymatgen package. This class stores site and lattice information and provides a .to_dict() method for serialization.

I have a dataset of ~80,000 structures. To deserialize these into Structure objects, I use the following process:

ds = ds.dataset(dataset_dir, format="parquet")
table = ds.to_table(columns=['structure'])
df = table.to_pandas()  # ~8.20 seconds
df['structure_py'] = df['structure'].map(Structure.from_dict)  # ~116 seconds

The majority of the time is spent mapping the dictionaries to Structure objects via Structure.from_dict. I attempted using pa.ExtensionArray and pa.ExtensionType to optimize this process but achieved similar performance, as the bottleneck appears to be in the Structure.from_dict calls.

Here's an example of my ExtensionType implementation:

class StructureType(pa.ExtensionType):
    def __init__(self, data_type: pa.DataType):
        if not pa.types.is_struct(data_type):
            raise TypeError(f"data_type must be a struct type, not {data_type}")
        super().__init__(data_type, "matgraphdb.structure")

    def __arrow_ext_serialize__(self) -> bytes:
        return b""

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        assert pa.types.is_struct(storage_type)
        return StructureType(storage_type)

    def __arrow_ext_class__(self):
        return StructureArray

class StructureArray(pa.ExtensionArray):
    def to_structure(self):
        return self.storage.to_pandas().map(Structure.from_dict)

Despite these efforts, the deserialization time remains substantial. Below is the type of the structure column:

struct<@class: string, @module: string, charge: double, lattice: struct<a: double, alpha: double, b: double, beta: double, c: double, gamma: double, ...>, sites: list<element: struct<...>>>

Is there a recommended approach within PyArrow to speed up deserialization of such complex structured data into Python objects?

Best regards,
Logan Lang

Component(s)

Parquet, Python

@lllangWV lllangWV added the Type: usage Issue is a user question label Dec 27, 2024
@kou kou changed the title Improving Deserialization Speed for PyArrow to Python Objects [Python] Improving Deserialization Speed for PyArrow to Python Objects Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant