Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registered serializer for common classes of additional array-like objects #762

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wilke0818
Copy link

Types of changes

Summary

  • Registers serializers for 3 common array-like packages: torch, tensorflow, and pandas
  • Copies existing methodology for serializing numpy in bytes_repr_numpy

Checklist

  • I have added tests to cover my changes (if necessary)
    • No tests exist for bytes_repr_numpy which all changes are based off of
  • I have updated documentation (if necessary)
    • No, let me know if there is a specific place that references this hash code that needs updating

Copy link

codecov bot commented Jul 18, 2024

Codecov Report

Attention: Patch coverage is 40.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 15.22%. Comparing base (4c00389) to head (83c721b).
Report is 12 commits behind head on master.

Files Patch % Lines
pydra/utils/hash.py 40.00% 15 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master     #762       +/-   ##
===========================================
- Coverage   88.01%   15.22%   -72.80%     
===========================================
  Files          53       53               
  Lines       15817    15801       -16     
  Branches     1610     2817     +1207     
===========================================
- Hits        13922     2406    -11516     
- Misses       1893    13062    +11169     
- Partials        2      333      +331     
Flag Coverage Δ
unittests 15.18% <40.00%> (-72.84%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@djarecka
Copy link
Collaborator

djarecka commented Jul 18, 2024

Thank you for your contribution! do you think you can create a simple test that shows the issue you are addressing?

we could perhaps add the libraries to the optional dependencies list
and run the tests in one of the environments we are using in GA
(if this is not clear, I can help you with it)

@effigies
Copy link
Contributor

I don't think we should add these as-is, and would strongly suggest avoiding adding anything but numpy directly. However, we could use protocols to try to identify these and similar types that can be coerced to numpy arrays. Perhaps looking for an __array__() method, or one of these .to_numpy() methods?

@djarecka
Copy link
Collaborator

@effigies - you mean not adding bytes_repr_tensorflowat all, and just try to re-use bytes_repr_numpy for tensorflow objects?

@wilke0818
Copy link
Author

@effigies I'm not aware of a common method like __array__ that many of these libraries use. I considered the to_numpy() and numpy() route as you mention, but this would require changing bytes_repr_numpy to accept more than just numpy types, right? Doing so might then make the optional dependencies weird? I fallback on y'all's judgement though. @djarecka I will try to look at creating a test in the meantime.

@effigies
Copy link
Contributor

I mean something like:

class ArrayLike(ty.Protocol):
    def __array__(self, *args, **kwargs): ...

@register_serializer
def bytes_repr_arraylike(obj: ArrayLike, cache: Cache) -> Iterator[bytes]:
        yield f"{obj.__class__.__module__}{obj.__class__.__name__}:{obj.size}:".encode()
        array = np.asanyarray(obj)
        if array.dtype == "object":
            yield from bytes_repr_sequence_contents(iter(array.ravel()), cache)
        else:
            yield array.tobytes(order="C")

As a quick proof of concept:

from functools import singledispatch
import typing as ty

import numpy as np
import pandas as pd
import tensorflow as tf
import torch

class ArrayLike(ty.Protocol):
    def __array__(self, *args, **kwargs): ...

@singledispatch
def identify(obj: object) -> str:
    return obj.__class__.__name__

@identify.register
def _(obj: ArrayLike) -> str:
    return "ArrayLike"

print(f"{identify([0, 0])=}")  # list
print(f"{identify(np.array([0, 0]))=}")  # ArrayLike
print(f"{identify(pd.DataFrame({"a": [0, 0]}))=}")  # ArrayLike
print(f"{identify(tf.constant([0, 0]))=}")  # ArrayLike
print(f"{identify(torch.tensor([0, 0]))=}")  # ArrayLike

@wilke0818
Copy link
Author

@effigies I tried something like you mention, or at least how I understood it, and copied it below. I haven't written official test cases, but have tried it in a colab notebook where I was originally getting the error this was meant to fix. The issue is that though your proof of concept works, this doesn't work for DataFrames (which throw an unserializable error) nor torch Tensors which continue to fail silently. For some reason it did seem to work for tensorflow.

class ArrayLike(ty.Protocol):
    def __array__(self, *args, **kwargs): ...

    def __array_interface__(self, *args, **kwargs): ...


if HAVE_NUMPY:
    @register_serializer
    def bytes_repr_arraylike(obj: ArrayLike, cache: Cache) -> Iterator[bytes]:
            yield f"{obj.__class__.__module__}{obj.__class__.__name__}:".encode()
            array = numpy.asanyarray(obj)
            yield f"{array.size}:".encode()
            if array.dtype == "object":
                yield from bytes_repr_sequence_contents(iter(array.ravel()), cache)
            else:
                yield array.tobytes(order="C")

@effigies
Copy link
Contributor

This works for me in a quick IPython session:

class ArrayLike(ty.Protocol):
    def __array__(self, *args, **kwargs): ...

@register_serializer
def bytes_repr_arraylike(obj: ArrayLike, cache: Cache) -> Iterator[bytes]:
    yield f"{obj.__class__.__module__}.{obj.__class__.__name__}:".encode()
    array = np.asanyarray(obj)
    yield f"{array.dtype.str}[{array.shape}]:".encode()
    if array.dtype == "object":
        yield from bytes_repr_sequence_contents(iter(array.ravel()), cache)
    else:
        yield array.tobytes(order="C")

Testing with the test objects above:

print(f"{b''.join(bytes_repr([0, 0], Cache()))}")
print(f"{b''.join(bytes_repr(np.array([0, 0]), Cache()))}")
print(f"{b''.join(bytes_repr(pd.DataFrame({"a": [0, 0]}), Cache()))}")
print(f"{b''.join(bytes_repr(tf.constant([0, 0]), Cache()))}")
print(f"{b''.join(bytes_repr(torch.tensor([0, 0]), Cache()))}")
b'list:(\xb8\xa7\xf7\x08q;\x12\x80rw^\xc4\x12L:-\xb8\xa7\xf7\x08q;\x12\x80rw^\xc4\x12L:-)'
b'numpyndarray:2:\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'pandas.core.frame.DataFrame:<i8[(2, 1)]:\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'tensorflow.python.framework.ops.EagerTensor:<i4[(2,)]:\x00\x00\x00\x00\x00\x00\x00\x00'
b'torch.Tensor:<i8[(2,)]:\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

Can you share a pandas dataframe that fails? Mixing and matching types, I see:

>>> b''.join(bytes_repr(pd.DataFrame({'a': [0, 0], 'b': [1.5, 2.5], 'c': ['one', 'two']}), Cache()))
b'pandas.core.frame.DataFrame:|O[(2, 3)]:\xb8\xa7\xf7\x08q;\x12\x80rw^\xc4\x12L:-\xd1\x94\x8a?2\x95a\x85=\x8c\xb8|06\xb2\xfe\xa7\xce\xcb+\xd1\xbbS\xd2\xd1\xe5\xa9\x8c|O\xed\xf9\xb8\xa7\xf7\x08q;\x12\x80rw^\xc4\x12L:-\xe3\xcd\x89$_K\x88\x98\xf8\x19l\x95fk\xf1\xee;\xf5\xfdW|\xf5>\xf8\xf7\x94a\x8c\xfa\xff\x15\x04'

Another approach could be to use yield pickle.dumps(obj) (via @satra), which will work for pickle-able objects, but if a dataframe is not serializable, that may not work either.

@wilke0818
Copy link
Author

wilke0818 commented Jul 19, 2024

Running your above code in a Colab notebook produces the following for me:
image
and when including the DataFrame I get
image

I don't see where there should be a difference in the way iPython runs from using a Colab but perhaps there is.
https://colab.research.google.com/drive/1f-XYbg2Gr4m2YyzFj8zmoYzoNCLA0_jN?usp=sharing

and this behavior of torch and tensorflow is consistent with the issue I have otherwise been running into, where the hash will be the same for all tensors.

@effigies
Copy link
Contributor

That's really weird. Can you activate pdb and see what object is actually causing the problem?

@wilke0818
Copy link
Author

apologies I am not sure how to use pdb in this context. I have added made the link to the colab to allow editing. will continue to try and figure it out if you don't have the time currently to look into it.

@effigies
Copy link
Contributor

I've never used colab and don't have time to figure it out. I don't understand why you're getting weak references in your environment, hence why I suggest using pdb. It appears to be a Jupyter notebook, so you should be able to run the %pdb magic before calling the erroring function and walk through the call stack to find where you got a weak reference. pd.DataFrame([[0,0],[1,1]]) gives me a normal Pandas dataframe and serializes to numpy without problem.

@wilke0818
Copy link
Author

wilke0818 commented Jul 19, 2024

In hash.py, when calling with the DataFrame to bytes_repr, we go to the general function where it breaks the dataframe into
{'_is_copy': None, '_mgr': BlockManager Items: Index(['b'], dtype='object') Axis 1: RangeIndex(start=0, stop=2, step=1) ObjectBlock: slice(0, 1, 1), 1 x 2, dtype: object, '_item_cache': {}, '_attrs': {}, '_flags': <Flags(allows_duplicate_labels=True)>} using the __dict__ function in pandas which is then passed to bytes_repr_mapping_contents and this whole process continues for a while until we try to hash the key/value pair of '_flags': <Flags(allows_duplicate_labels=True)>. The issue is that when Flags.__dict__ is called, it returns

{'_allows_duplicate_labels': True, '_obj': <weakref at 0x7eac8b3931f0; to 'DataFrame' at 0x7eac888e9a20>}

and when the key/value '_obj': <weakref at 0x7eac8b3931f0; to 'DataFrame' at 0x7eac888e9a20>} is attempted to be hash, we try to run __dict__ on the weakref which fails.

I think that the issue is somehow your code is registering the serializer for DataFrames to be converted to NumPy whereas mine is not. Theoretically mine is the result you would expect if we don't register a new serializer/leave the code as is. If I run with @register_serializer(pd.DataFrame) it works (for pandas, not torch or tensorflow), but with @register_serializer or @register_serializer(ArrayLike) it doesn't work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants