You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the usage question you have. Please include as many useful details as possible.
Title: Improving Deserialization Speed for PyArrow to Python Objects
Hello,
I am working with materials data stored in Parquet files, where a column structure contains serialized dictionaries representing structures from the Structure class in the pymatgen package. This class stores site and lattice information and provides a .to_dict() method for serialization.
I have a dataset of ~80,000 structures. To deserialize these into Structure objects, I use the following process:
The majority of the time is spent mapping the dictionaries to Structure objects via Structure.from_dict. I attempted using pa.ExtensionArray and pa.ExtensionType to optimize this process but achieved similar performance, as the bottleneck appears to be in the Structure.from_dict calls.
Here's an example of my ExtensionType implementation:
classStructureType(pa.ExtensionType):
def__init__(self, data_type: pa.DataType):
ifnotpa.types.is_struct(data_type):
raiseTypeError(f"data_type must be a struct type, not {data_type}")
super().__init__(data_type, "matgraphdb.structure")
def__arrow_ext_serialize__(self) ->bytes:
returnb""@classmethoddef__arrow_ext_deserialize__(cls, storage_type, serialized):
assertpa.types.is_struct(storage_type)
returnStructureType(storage_type)
def__arrow_ext_class__(self):
returnStructureArrayclassStructureArray(pa.ExtensionArray):
defto_structure(self):
returnself.storage.to_pandas().map(Structure.from_dict)
Despite these efforts, the deserialization time remains substantial. Below is the type of the structure column:
kou
changed the title
Improving Deserialization Speed for PyArrow to Python Objects
[Python] Improving Deserialization Speed for PyArrow to Python Objects
Dec 29, 2024
Describe the usage question you have. Please include as many useful details as possible.
Title: Improving Deserialization Speed for PyArrow to Python Objects
Hello,
I am working with materials data stored in Parquet files, where a column
structure
contains serialized dictionaries representing structures from theStructure
class in thepymatgen
package. This class stores site and lattice information and provides a.to_dict()
method for serialization.I have a dataset of ~80,000 structures. To deserialize these into
Structure
objects, I use the following process:The majority of the time is spent mapping the dictionaries to
Structure
objects viaStructure.from_dict
. I attempted usingpa.ExtensionArray
andpa.ExtensionType
to optimize this process but achieved similar performance, as the bottleneck appears to be in theStructure.from_dict
calls.Here's an example of my
ExtensionType
implementation:Despite these efforts, the deserialization time remains substantial. Below is the type of the
structure
column:Is there a recommended approach within PyArrow to speed up deserialization of such complex structured data into Python objects?
Best regards,
Logan Lang
Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: