Skip to content

Conversation

@matiaslindgren
Copy link
Contributor

@matiaslindgren matiaslindgren commented Oct 28, 2025

old summary

This is a simple workaround. For a proper solution, khash_python.h would require some refactoring to handle exceptions gracefully. It's a bit tricky, though, because kh_python_hash_{equal,func} is exposed to the vendored khash implementation, which calls those functions in a loop.

summary

  • khash_python.h is silently suppressing all exceptions from PyObject_Hash and PyObject_RichCompareBool
  • This is problematic because client logic might expect that exceptions thrown from their custom __hash__ and __eq__ methods are not silently discarded (see BUG: ht.PyObjectHashTable swallows exception #57052)
  • This PR implements a new layer for pymap that catches all exceptions thrown during khash computations and raises the exceptions properly
  • Attempting to hash dict and list will result in a hash value of 0 for backwards compatibility, see below
  • Fixed comparison of pd.NA in pyobject_cmp, see below
  • Added wrapper on JSONArray.duplicated that converts UserDict elements to dict before calling pd.core.algorithms.duplicated.

dict and list objects are still hashed as 0

Some existing logic, for example core.algorithms.value_count, uses the pymap for efficiency.

High-level API examples that test this behavior:

This was working fine because when PyObject_Hash would raise an exception while trying to hash unhashable keys, the error indicator was explicitly cleared, and a hash value 0 was returned. All objects would end up in the same bucket in the hashtable. Although this is not too good for performance, it is not a problem for correctness (because pyobject_cmp would find all objects in the bucket).

This PR retains that behavior for all dict and list objects. If hashing needs to be bypassed for additional types in the future, they can be added here.

comparing a pandas.NA in pyobject_cmp is always False

Exceptions were previously suppressed also in pyobject_cmp. One notable exception thrown during == comparison is when pd.NA == x returns pd.NA instead of False/True. This PR will introduce a check for pd.NA and return False from pyobject_cmp when one or both of the operands are pd.NA.

^^^^^
- Bug in :class:`DataFrame` when passing a ``dict`` with a NA scalar and ``columns`` that would always return ``np.nan`` (:issue:`57205`)
- Bug in :class:`Series` ignoring errors when trying to convert :class:`Series` input data to the given ``dtype`` (:issue:`60728`)
- Bug in :class:``PyObjectHashTable`` that would silently suppress exceptions thrown from custom ``__hash__`` and ``__eq__`` methods during hashing (:issue:`57052`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to add a test that uses a public API that would be fixed by your changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jbrockmendel
Copy link
Member

doing this inside the khash code is definitely more difficult, but problably the Right Way To Do It. Does this entail a perf hit?

BTW #62888 is probably going to have to entail digging into that same bit of khash code.

@matiaslindgren
Copy link
Contributor Author

I tried adapting the suggestion from #57052 (comment) (pandas._libs.parsers.raise_parser_error) but there are quite a few failing tests.

SystemError: ... returned a result with an exception set is the catch-all exception raised by the interpreter when an exception is left unhandled in the C API layer. So it seems there are some code paths outside PyObjectHashTable that call kh_python_hash_{equal,func}. I need to do some more digging here.

doing this inside the khash code is definitely more difficult, but problably the Right Way To Do It. Does this entail a perf hit?

I'll set up a tiny benchmark for PyObjectHashTable to compare these changes with main.

@matiaslindgren matiaslindgren changed the title BUG: try triggering exceptions from custom methods in Cython before entering the khash loop BUG: Catch all exceptions raised while calling PyObjectHashTable methods Oct 31, 2025
@matiaslindgren
Copy link
Contributor Author

I implemented a new layer called pymap_checked in pandas/_libs/khash for the PyObject hash table. It will catch every exception thrown during khash computation for PyObjects.

The next problem is fixing the dozens of exceptions that were previously silently suppressed. Most of them seem to be either TypeError: boolean value of NA is ambiguous or TypeError: unhashable type: 'dict' but there are a few others too.

@matiaslindgren
Copy link
Contributor Author

doing this inside the khash code is definitely more difficult, but problably the Right Way To Do It. Does this entail a perf hit?

BTW #62888 is probably going to have to entail digging into that same bit of khash code.

FYI @jbrockmendel this small benchmark I did for PyObjectHashTable.{set,get}_item suggests the if PyErr_Occurred() check does not affect performance, even when it runs on every kh_*_pymap call.

setup

from pandas._libs import hashtable as ht
from random import shuffle


class testkey:
    def __init__(self, value):
        self.value = value

    def __hash__(self):
        return hash(self.value)

    def __eq__(self, other):
        return self.value == other.value


def test_pymap_set_get(indexes: list[int]):
    table = ht.PyObjectHashTable()

    keys = [testkey(f"key{i}") for i in indexes]

    shuffle(indexes)
    for i in indexes:
        table.set_item(keys[i], i)

    shuffle(indexes)
    for i in indexes:
        assert table.get_item(keys[i]) == i


def test_pymap_set_get_no_shuffle(indexes: list[int]):
    table = ht.PyObjectHashTable()

    keys = [testkey(f"key{i}") for i in indexes]

    for i in indexes:
        table.set_item(keys[i], i)

    for i in indexes:
        assert table.get_item(keys[i]) == i

main branch (d597079)

with shuffle

In [1]: %timeit test_pymap_set_get(list(range(100)))
55.9 μs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [2]: %timeit test_pymap_set_get(list(range(1000)))
606 μs ± 847 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: %timeit test_pymap_set_get(list(range(10000)))
6.49 ms ± 18.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit test_pymap_set_get(list(range(100000)))
81.1 ms ± 329 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

without shuffle

In [1]: %timeit test_pymap_set_get_no_shuffle(list(range(100)))
36.9 μs ± 401 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [2]: %timeit test_pymap_set_get_no_shuffle(list(range(1000)))
372 μs ± 2.38 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: %timeit test_pymap_set_get_no_shuffle(list(range(10000)))
4.1 ms ± 24 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit test_pymap_set_get_no_shuffle(list(range(100000)))
46.6 ms ± 248 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

this PR (0a4cba8)

with shuffle

In [1]: %timeit test_pymap_set_get(list(range(100)))
55.9 μs ± 151 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [2]: %timeit test_pymap_set_get(list(range(1000)))
604 μs ± 1.13 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: %timeit test_pymap_set_get(list(range(10000)))
6.51 ms ± 10.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit test_pymap_set_get(list(range(100000)))
79.6 ms ± 268 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

without shuffle

In [1]: %timeit test_pymap_set_get_no_shuffle(list(range(100)))
37.2 μs ± 106 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [2]: %timeit test_pymap_set_get_no_shuffle(list(range(1000)))
373 μs ± 926 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: %timeit test_pymap_set_get_no_shuffle(list(range(10000)))
4.03 ms ± 8.16 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit test_pymap_set_get_no_shuffle(list(range(100000)))
45.3 ms ± 190 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@matiaslindgren
Copy link
Contributor Author

@mroeschke ready for review again. pre-commit is failing on clang-format. Can't format locally since I don't have the format config used in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: ht.PyObjectHashTable swallows exception

3 participants