Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FrameLocalsProxy is stricter than dict about what constitutes a match #120906

Closed
da-woods opened this issue Jun 23, 2024 · 15 comments
Closed

FrameLocalsProxy is stricter than dict about what constitutes a match #120906

da-woods opened this issue Jun 23, 2024 · 15 comments
Assignees
Labels
3.13 bugs and security fixes 3.14 new features, bugs and security fixes release-blocker type-bug An unexpected behavior, bug, or error

Comments

@da-woods
Copy link
Contributor

da-woods commented Jun 23, 2024

Bug report

Bug description:

import inspect

class MyString(str):
    pass

def f():
    a = 1
    locals = inspect.currentframe().f_locals
    print(MyString("a") in locals)

f()

In Python 3.12 and below this prints True. In Python 3.13 this print False. I think it comes down to the check for exact unicode:

if (PyUnicode_CheckExact(key)) {

The change in behaviour isn't a huge problem so if it's intended then I won't spend waste any time complaining about it, but I do think it's worth confirming that it is intended/desired.

CPython versions tested on:

3.13

Operating systems tested on:

Linux

Linked PRs

@da-woods da-woods added the type-bug An unexpected behavior, bug, or error label Jun 23, 2024
@AlexWaygood AlexWaygood added 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Jun 23, 2024
@gaogaotiantian
Copy link
Member

Interesting issue. I don't have a definitive answer here but this is something we need to deal with, because we are getting some weird issues here:

import sys
class MyString(str):
    pass

def f():
    x = 1
    local = sys._getframe().f_locals
    local[MyString('x')] = 2
    print(local.keys())
    # ['x', 'local', 'x']
    print(local)
    # {'x': 2, 'local': {...}}

f()

Internally we check if the input key is an exact unicode, but we do utilize dict for certain features (repr for example). This causes an inconsistent behavior.

My preferred solution is to enforce any key to be an exact unicode string. The reason for that is, unlike a generic dict, FrameLocalsProxy should be a direct proxy for local variables and keys other than unicode string does not quite make sense. However, I know I'm the more aggressive type when it comes to changes so I'll list some concerns as well.

frame.f_locals for an unoptimized frame (module/class) is a real dict and can take any type of keys. There might also be a usage to use a subclass as keys? If the original usage is shown to be helpful, that might tip the scale for me.

I'm a bit worried about the can of worms when we allow subclasses of unicode strings - what if it overwrite the __hash__ and __eq__ methods? Will it cause more troubles than the benefits?

@da-woods
Copy link
Contributor Author

da-woods commented Jun 23, 2024

Just to explain the context I ran into it in:

The Cython compiler wraps most of it's strings into an EncodedString class (for reasons that are possibly historic and which I don't quite understand). To clarify - that's a class used internally by the compiler, rather than a class used in Cython extension modules.

We have an inline method of compilation, and here missing variables are taken from the surrounding frame. The frame comes from Python so the variable names are just a normal str. However the names we're using to look them up are our EncodedString class. These used to work fine when used to look up keys (since they just matched the equivalent str) but now don't.

It's easy enough to work around so I'm relaxed about the solution whatever you decide to do. I think what we were doing was mostly accidental and there wasn't a hidden special use-case behind it.

But for this use-case I'm just reading existing values from the FrameLocalsProxy and not worried about storing subclassed strings. Obviously you need to consider both sides of it though.

@gaogaotiantian
Copy link
Member

We need to discuss this further with @markshannon, but I think the desired behavior would be either

  • Report an error when the key is not an exact unicode string
  • or allow all string-like (subclasses of unicode) keys and access the actual fast locals even with those strings.

@gaogaotiantian
Copy link
Member

I just remembered that we already had that discussion when I was implementing PEP 667. The PEP suggests that we should allow arbitrary types.

Then the question would be what if the user gives a unicode-like key, do we try to access the fast variables with that key?

@ncoghlan
Copy link
Contributor

To avoid breaking mapping invariants, if a given key compares equal to the name of a local variable, it needs to be handled as a lookup of that variable name. Having a PyUnicode_CheckExact guard on a string comparison fast path still makes sense, but the fallback needs to be a linear search using the abstract object comparison API rather than assuming the key is absent.

@gaogaotiantian
Copy link
Member

But that's not the mapping protocol right? What if a key has a different hash, but the equal value? What if the key is

class Evil:
    def __eq__(self, other):
        return True

It's unhashable but only comparing equality will match it to an arbitrary local (the first one).

Are we going to implement the full mapping interface? Checking everything as dict/mapping does? The dark corner we left here might eat us in the future.

@encukou
Copy link
Member

encukou commented Jul 17, 2024

What if a key has a different hash, but the equal value

That's a violation of the hash protocol: objects that compare equal must have the same hash (or at least one must be unhashable).

The Evil example is incomplete, since __hash__ is not inherited if __eq__ is overridden.
But if Evil had __hash__ explicitly set to str.__hash__, it would be OK to rely on it and skip the call to __eq__.


IMO, it would also be OK to convert the argument to PyUnicode, and do the lookup with the resulting temporary object.

@ncoghlan
Copy link
Contributor

ncoghlan commented Jul 18, 2024

I don't think dict.__getitem__ is the right point of comparison here. Instead, the relevant comparison is with list.index() or list.find(), which just do a linear search for the item. No hashing involved (unless the objects do their own hash check internally as an optimisation). That's already the way that the key index lookup in https://github.com/python/cpython/blob/main/Objects/frameobject.c#L53 works.

Looking at the way PyFrame_GetVar works, we could likely accept Unicode subclasses just by switching from _PyUnicode_EQ to _PyUnicode_Equal in the fallback search loop for non-interned strings, internally gating the loop that checks for interned strings on PyUnicode_CheckExact(), and relaxing the higher level checks to use PyUnicode_Check the same way PyFrame_GetVar does.

Accepting Unicode subclasses would be enough to avoid the compatibility issue that affected Cython.

To accept arbitrary objects the way a general mapping does, the fallback search loop would need to use PyObject_RichCompareBool rather than _PyUnicode_Equals .

I also noticed a redundancy in the way the checks for whether or not a value is currently bound (or visible) are handled: names can't be duplicated, so we can return -1 immediately if the name is currently unbound or hidden instead of continuing the search. This allows the inner "is this unbound or hidden?" check to be factored out into a helper function:

static int
framelocalsproxy_getdefinedindex(_PyInterpreterFrame *frame, PyCodeObject *co, int i, bool read)
{
    /*
     * Conditionally returns the given fast locals index
     *   - if read == true, returns the index if the value is not NULL
     *   - if read == false, returns the index if the value is not hidden
     *   - otherwise returns -1
     */
    if (read) {
        if (framelocalsproxy_getval(frame, co, i) != NULL) {
             return i;
        }
    } else {
        if (!(_PyLocals_GetKind(co->co_localspluskinds, i) & CO_FAST_HIDDEN)) {
            return i;
        }
    }
    return -1;
}

A search loop accepting arbitrary keys would then look something like:

    PyCodeObject *co = _PyFrame_GetCode(frame->f_frame);

    // For actual Unicode keys, the key is likely interned and we can do a pointer comparison.
    if (PyUnicode_CheckExact(key) {
        for (int i = 0; i < co->co_nlocalsplus; i++) {
            PyObject *name = PyTuple_GET_ITEM(co->co_localsplusnames, i);
            if (name == key) {
                return framelocalsproxy_getdefinedindex(frame->f_frame, co, i, read);
            }
        }
    }

    // Fall back to an equality comparison if the interned string check fails
    if (PyUnicode_Check(key) {
        // Fast path for strings and string subclasses
        for (int i = 0; i < co->co_nlocalsplus; i++) {
            PyObject *name = PyTuple_GET_ITEM(co->co_localsplusnames, i);
            if (_PyUnicode_Equals(name, key)) {
                return framelocalsproxy_getdefinedindex(frame->f_frame, co, i, read);
            }
        }
    } else {
        // Full rich comparison for other objects
        for (int i = 0; i < co->co_nlocalsplus; i++) {
            PyObject *name = PyTuple_GET_ITEM(co->co_localsplusnames, i);
            int eq_result = PyObject_RichCompareBool(name, key, Py_EQ);
            if (eq_result < 0) {
                return -2;   // Callers will need to check for results < -1 and propagate the error
            }
            if (eq_result) {
                return framelocalsproxy_getdefinedindex(frame->f_frame, co, i, read);
            }
        }
    }

@gaogaotiantian
Copy link
Member

My worry about supporting a unicode subclass or an arbitrary object as a key is - it does not fit the mapping protocol. If an object overrides it's __eq__ but not __hash__, it can't be used as a key. If an object has a different hash to another one but also equal to that object, they will be different keys in the dict.

None of the proposals above have this feature. So basically, we are inventing a very new mapping protocol specifically for FrameLocalsproxy and it's not easy to explain - we don't have a simple enough rule.

That's why I liked "unicode only" solution - that's a simple rule that works and easy to understand. It will cause some trouble, but we are already causing troubles.

We need to consider - what if a key is a unicode subclass, but has different __eq__ and/or __hash__ method? It would be very difficult for us to behave exactly as a dict. What if it's not even a subclass of unicode?

If we can't do that, I kind of like what @encukou suggested - we convert the non-exact-unicode keys to unicode first, for both read and write. That's a golden rule that everyone can follow, and that will solve many benign cases like this very issue.

@encukou
Copy link
Member

encukou commented Jul 26, 2024

I finally looked at the code, rather than going just by the conversation here. But I couldn't just look...
Please see PR #122309, as a suggestion for how to handle this.

If an object overrides its __eq__ but not __hash__, it can't be used as a key. It's not hashable.
If an object has a different hash to another one but also equal to that object, that's a violation of the hash protocol, and dict can behave surprisingly (give weird results/exceptions, but not crash). We don't need to match that behavior.

The hash protocol:

  • limits the types (to hashable ones) and
  • allows dicts to do an optimization: they can look at a small subset of entries. We don't need to use that optimization.

Keys that aren't unicode subclasses shouldn't be a problem.

@gaogaotiantian
Copy link
Member

Oh sorry I did not realize you posted in the issue. I don't think I'm getting notification from this issue.

Could you point to me the documentation about the hash protocol? I could not find it.

@encukou
Copy link
Member

encukou commented Jul 27, 2024

See hashable in the docs:

Hashable objects which compare equal must have the same hash value.

@ncoghlan
Copy link
Contributor

The data model docs in https://docs.python.org/3/reference/datamodel.html#object.__hash__ also state "The only required property is that objects which compare equal have the same hash value". However, they go into more detail about how to define __hash__ and __eq__ to be consistent with each other, as well as how to indicate when subclasses of hashable parent classes are not themselves hashable.

Classes that don't abide by those rules just straight up don't work properly, so the fact they won't misbehave as badly with locals proxy instances as they do with regular dicts isn't a problem we need to be concerned about.

encukou added a commit to encukou/cpython that referenced this issue Jul 30, 2024
…sProxy (pythonGH-122309)

Co-authored-by: Alyssa Coghlan <[email protected]>
(cherry picked from commit 5912487)
encukou added a commit to encukou/cpython that referenced this issue Jul 30, 2024
…sProxy (pythonGH-122309)

Co-authored-by: Alyssa Coghlan <[email protected]>
(cherry picked from commit 5912487)
@Yhg1s
Copy link
Member

Yhg1s commented Jul 31, 2024

This issue is all fixed now, right?

Yhg1s pushed a commit that referenced this issue Jul 31, 2024
…GH-122309) (#122488)

[3.13] gh-120906: Support arbitrary hashable keys in FrameLocalsProxy  (GH-122309)

Co-authored-by: Alyssa Coghlan <[email protected]>
(cherry picked from commit 5912487)
@ncoghlan
Copy link
Contributor

ncoghlan commented Aug 1, 2024

Indeed it is!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes 3.14 new features, bugs and security fixes release-blocker type-bug An unexpected behavior, bug, or error
Projects
Development

No branches or pull requests

7 participants