Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pydantic model serialization disallows binary data (experimental API) #310

Open
kylebarron opened this issue Sep 29, 2023 · 7 comments
Open
Labels
bug Something isn't working experimental

Comments

@kylebarron
Copy link
Contributor

The current implementation of _get_pydantic_state_v2 is:

def _get_pydantic_state_v2(obj: pydantic.BaseModel) -> Serializable:
"""Get the state of a pydantic (v2) BaseModel instance."""
return obj.model_dump(mode="json")

When you have a model that serializes to binary data, mode="json" is ok only when the binary data is actually a utf8 string. I.e.:

from pydantic import BaseModel

class Model(BaseModel):
    a: bytes

Model(a=b'abc').model_dump(mode='json') 
# {'a': 'abc'}

Model(a=b'\xa1').model_dump(mode='json')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid utf-8

But because anywidget uses remove_buffers, setting mode="json" isn't necessary.

state = Model(a=b'\xa1').model_dump()

from anywidget._util import remove_buffers
state2, buffer_paths, buffers = remove_buffers(state)
# {}, [['a']], [b'\xa1']

Though removing mode="json" might have regressions for other data types, like dates?

@kylebarron
Copy link
Contributor Author

Though removing mode="json" might have regressions for other data types, like dates?

I'm assuming this will end up being a problem...

It looks like I got this to work by subclassing model_dump to serialize the entire model except a specific key to a json-compatible version, and then serializing only that key, and then joining. e.g.:

from pydantic import BaseModel

class Model(BaseModel):
    a: bytes

    def model_dump(self, *args, **kwargs) -> dict[str, Any]:
        if 'exclude' in kwargs.keys():
            if 'a' not in kwargs['exclude']:
                kwargs['exclude'].append('a')
        else:
            kwargs['exclude'] = ['a']

        kwargs['mode'] = 'json'

        json_model_dump = super().model_dump(*args, **kwargs)
        python_model_dump = super().model_dump(include={"a"})
        return {**json_model_dump, **python_model_dump}

Model(a=b'abc').model_dump(mode='json')
# {'a': b'abc'}

Model(a=b'\xa1').model_dump(mode='json')
# {'a': b'\xa1'}

So now it's always binary!

@manzt
Copy link
Owner

manzt commented Sep 29, 2023

Hmm, yah this is due to the fact binary data isn't serializable to JSON (without base64 encoding it). The _get_pydanic_state_v2 is just a helper function for pydantic. Maybe in this case, the work around is to just implement your own state getter function:

class Model(BaseModel):
  a: bytes

  def _get_anywidget_state(self):
      state = self.model_dump(exclude=["a"], mode="json")
      state["a"] = self.a
      return state

This will take precedent over the inferred _get_pydantic_state_v2 helper (and won't force you to override the model_dump behavior. Logic for how we determine the state getter:

def determine_state_getter(obj: object) -> Callable[[Any], Serializable]:
"""Autodetect how `obj` can be serialized to a dict.
This looks for various special methods and patterns on the object (e.g. dataclass,
pydantic, etc...), and returns a callable that can be used to get the state of the
object as a dict.
As an escape hatch it first looks for a special method on the object called
`_get_anywidget_state`.
Returns
-------
state_getter : Callable[[object], dict]
A callable that takes an object and returns a dict of its state.
"""
# check on the class for our special state getter method
if hasattr(type(obj), _STATE_GETTER_NAME):
# note that we return the *unbound* method on the class here, so that it can be
# called with the object as the first argument
return getattr(type(obj), _STATE_GETTER_NAME) # type: ignore [no-any-return]
if is_dataclass(obj):
# caveat: if the dict is not JSON serializeable... you still need to
# provide an API for the user to customize serialization
return asdict
if _is_traitlets_object(obj):
return _get_traitlets_state
if _is_pydantic_model(obj):
if hasattr(obj, "model_dump"):
return _get_pydantic_state_v2
return _get_pydantic_state_v1
if _is_msgspec_struct(obj):
return _get_msgspec_state
# pickle protocol ... probably not type-safe enough for our purposes
# https://docs.python.org/3/library/pickle.html#object.__getstate__
# if hasattr(type(obj), "__getstate__"):
# return type(obj).__getstate__
raise TypeError( # pragma: no cover
f"Cannot determine a state-getting method for {obj!r}. "
"Please implement a `_get_anywidget_state()` method that returns a dict."
)

@manzt
Copy link
Owner

manzt commented Sep 30, 2023

Though removing mode="json" might have regressions for other data types, like dates?

Hmm, I'm not too worried about the regression of other date types. All of this stuff is behind experimental, and I'm starting to come around to the idea of just using model.model_dump(). Currently to serialize bytes fields (an important use case), you'd need to either:

  • Override model_dump()
  • Implement your own _get_anywidget_state method
  • implement a @model_serializer

with mode="json"

Whereas if we switched to model_dump(), supporting a datetime could look like:

class Model(BaseModel):
  a: bytes
  dt: datetime
  
  @field_serializer
  def serialize_dt(self, dt: datetime, _info):
      return dt.isoformat()

If someone really wants to keep the behavior of anywidget and pydantic separate, then they can always implement _get_anywidget_state. I just want _get_pydantic_state_v2 to provide the most sensible default.

@kylebarron
Copy link
Contributor Author

It seems like you either need to manually handle bytes if setting model_dump(mode="json") or you'd need to manually handle dates if setting model_dump(), so not sure which is easiest 🤷‍♂️

@manzt
Copy link
Owner

manzt commented Sep 30, 2023

Since bytes is a python builtin, I think my preference would be use model_dump() and make someone manually handle higher level objects (just like how one would need to make a serializer for their own class).

@kylebarron
Copy link
Contributor Author

kylebarron commented Sep 30, 2023

To play devil's advocate, datetime.datetime is a python builtin too? It's hard because it seems the Jupyter comm requires almost JSON, in that everything but bytes needs to be json-serializable?

@manzt
Copy link
Owner

manzt commented Sep 30, 2023

Sorry, I was referring to builtins, not built-in modules in the standard library. In particular, inspired by msgspec.to_builtins.

import builtins

builtins.datetime
# AttributeError: module 'builtins' has no attribute 'datetime'

But I see your point, probably need to think more about it. Less magic is better IMO with regard to serialization, and I could imagine that they might be several different ways to serialize a datetime compared to bytes. If we default to model_dump (without mode="json"), users are likely to get a Python error when ipywidgets tries to serialize to JSON (rather than some implicit conversion to a JSON-encoded value they aren't aware of). Either way, I think better documentation about these edge cases serves us the best strategy.

@manzt manzt added the bug Something isn't working label Jan 26, 2024
@manzt manzt changed the title pydantic model serialization disallows binary data pydantic model serialization disallows binary data (experimental API) Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working experimental
Projects
None yet
Development

No branches or pull requests

2 participants