Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic (and sometimes incorrect) finding of BaseModel classes #117

Open
slafs opened this issue Aug 12, 2023 · 4 comments
Open

Comments

@slafs
Copy link
Contributor

slafs commented Aug 12, 2023

When trying to run BP001 (add default None) on our project, I've noticed something interesting.
The part when the tool is "Looking for Pydantic Models" with ClassDefVisitor yields different results when running more than once. I think I've managed to boil it down to a minimal example.

Setup

Lets consider 4 simple files: a.py, common.py, utils.py and z.py like so:
==> common.py <==

from pydantic import BaseModel

class NewBaseModel(BaseModel):
    foo: str | None

==> utils.py <==

from common import NewBaseModel

class CustomBaseModel(NewBaseModel):
    bar: str | None

==> a.py <==

from utils import CustomBaseModel

class AModel(CustomBaseModel):
    a: str | None

==> z.py <==

from utils import CustomBaseModel

class ZModel(CustomBaseModel):
    z: str | None

(apologies for terrible and potentially unintuitive naming 😅)

Requirements

There are 2 important things to note here:

  • There are at least 2 levels of inheritance from pydantic's BaseModel. That is: z.ZModel and a.AModel -> utils.CustomBaseModel -> common.NewBaseModel -> pydantic.BaseModel. Apparently this is important as the problem doesn't seem to occur when we get rid of one level (common.NewBaseModel for example).
  • Two "normal" modules, a and z, are listed respectively before and after the utility modules (common and utils) when processing.

Expected outcome

Running bump-pydantic . on such "project" should add None as a default value for fields in all 4 modules.

Observed outcome

But it turns out it does that in (more or less) half of the runs. I've run the tool 100 times and only 47 of them ended up in modifying all four files, 15 times it modified 3 of them (common.py, utils.py and z.py) and 38 times only 2 of them (common.py and utils.py).

Problem research

I see two issues here:

  1. non-deterministic behaviour and
  2. incorrect finding of BaseModel classes.

Looks like the problem occurs because of the way the queue is being constructed and the way files are being added to it later in

missing_files = set(files) - visited
if not queue and missing_files:
queue.append(next(iter(missing_files)))

since set by nature is unsorted, files will be added to the queue in a "random" order.
Although a.py will be processed first every time.
Fixing that non-determinism should be relatively simple. For example, we could make the queue initially the same as files, appendleft files coming from visitor.next_file and get rid of the missing_files bit.

Now, once we fix that, AModel will never get marked as a pydantic model, at least with the way I've fixed the first issue. Which I haven't figured out why, yet. I mean, it does get marked correctly ~50% of the times currently, so... 🤷. I haven't fully grokked the ClassDefVisitor code yet. Seems like there is some consideration for those cases there already.

For solutions I was considering either going through all of the files twice or allowing callers to specify fqns of custom base models.

I can open a PR to better demonstrate what I mean here.
This is what I mean about that non-determinism fix: 13bfe61 (from #118).

@slafs
Copy link
Contributor Author

slafs commented Aug 14, 2023

I haven't fully grokked the ClassDefVisitor code yet. Seems like there is some consideration for those cases there already.

OK. I think I figured it out. It looks like the code responsible for disambiguation in

# In case we have the following scenario:
# class A(B): ...
# class B(BaseModel): ...
# class D(C): ...
# class C: ...
# We want to disambiguate `A` as soon as we see `B` is a `BaseModel`.
if (
fqn.name in self.context.scratch[self.BASE_MODEL_CONTEXT_KEY]
and fqn.name in self.context.scratch[self.CLS_CONTEXT_KEY]
):
for parent_class in self.context.scratch[self.CLS_CONTEXT_KEY].pop(fqn.name):
self.context.scratch[self.BASE_MODEL_CONTEXT_KEY].add(parent_class)
# In case we have the following scenario:
# class A(B): ...
# class B(BaseModel): ...
# class D(C): ...
# class C: ...
# We want to disambiguate `D` as soon as we see `C` is NOT a `BaseModel`.
if (
fqn.name in self.context.scratch[self.NO_BASE_MODEL_CONTEXT_KEY]
and fqn.name in self.context.scratch[self.CLS_CONTEXT_KEY]
):
for parent_class in self.context.scratch[self.CLS_CONTEXT_KEY].pop(fqn.name):
self.context.scratch[self.NO_BASE_MODEL_CONTEXT_KEY].add(parent_class)
only works one level "up". So it manages to resolve something like:

# -- bar.py
from package.foo import Foo

class Bar(Foo):
    b: str

# -- foo.py
from pydantic import BaseModel

class Foo(BaseModel):
    a: str

but once you add one (or more!) levels of inheritance e.g.:

# -- baz.py
from package.bar import Bar

class Baz(Bar):
    c: str

then at the end CLS_CONTEXT_KEY entry will have one lingering entry with:
{'package.bar.Bar': {'package.baz.Baz'}}.

It seems like some recursive disambiguation is needed at the end of the whole process (or at the end each visit).

@Kludex
Copy link
Member

Kludex commented Aug 14, 2023

Closed by #118 ?

@slafs
Copy link
Contributor Author

slafs commented Aug 14, 2023

Closed by #118 ?

Nope. #118 is only for 1. from:

I see two issues here:

  1. non-deterministic behaviour and
  2. incorrect finding of BaseModel classes.

Will try to propose something for number 2. later.

It might get tricky with testing as I see there's no unit tests for ClassDefVisitor :/.

@ovidiu-travelperk
Copy link

this still reproduces

any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants