-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
BUG: Catch all exceptions raised while calling PyObjectHashTable methods #62892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| ^^^^^ | ||
| - Bug in :class:`DataFrame` when passing a ``dict`` with a NA scalar and ``columns`` that would always return ``np.nan`` (:issue:`57205`) | ||
| - Bug in :class:`Series` ignoring errors when trying to convert :class:`Series` input data to the given ``dtype`` (:issue:`60728`) | ||
| - Bug in :class:``PyObjectHashTable`` that would silently suppress exceptions thrown from custom ``__hash__`` and ``__eq__`` methods during hashing (:issue:`57052`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to add a test that uses a public API that would be fixed by your changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
doing this inside the khash code is definitely more difficult, but problably the Right Way To Do It. Does this entail a perf hit? BTW #62888 is probably going to have to entail digging into that same bit of khash code. |
|
I tried adapting the suggestion from #57052 (comment) (
I'll set up a tiny benchmark for |
|
I implemented a new layer called The next problem is fixing the dozens of exceptions that were previously silently suppressed. Most of them seem to be either |
FYI @jbrockmendel this small benchmark I did for setupfrom pandas._libs import hashtable as ht
from random import shuffle
class testkey:
def __init__(self, value):
self.value = value
def __hash__(self):
return hash(self.value)
def __eq__(self, other):
return self.value == other.value
def test_pymap_set_get(indexes: list[int]):
table = ht.PyObjectHashTable()
keys = [testkey(f"key{i}") for i in indexes]
shuffle(indexes)
for i in indexes:
table.set_item(keys[i], i)
shuffle(indexes)
for i in indexes:
assert table.get_item(keys[i]) == i
def test_pymap_set_get_no_shuffle(indexes: list[int]):
table = ht.PyObjectHashTable()
keys = [testkey(f"key{i}") for i in indexes]
for i in indexes:
table.set_item(keys[i], i)
for i in indexes:
assert table.get_item(keys[i]) == imain branch (d597079)with shufflewithout shufflethis PR (0a4cba8)with shufflewithout shuffle |
|
@mroeschke ready for review again. pre-commit is failing on clang-format. Can't format locally since I don't have the format config used in CI. |
(Replace xxxx with the GitHub issue number)
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.old summaryThis is a simple workaround. For a proper solution, khash_python.h would require some refactoring to handle exceptions gracefully. It's a bit tricky, though, becausekh_python_hash_{equal,func}is exposed to the vendoredkhashimplementation, which calls those functions in a loop.summary
PyObject_HashandPyObject_RichCompareBool__hash__and__eq__methods are not silently discarded (see BUG: ht.PyObjectHashTable swallows exception #57052)pymapthat catches all exceptions thrown during khash computations and raises the exceptions properlydictandlistwill result in a hash value of 0 for backwards compatibility, see belowpd.NAinpyobject_cmp, see belowJSONArray.duplicatedthat convertsUserDictelements todictbefore callingpd.core.algorithms.duplicated.dictandlistobjects are still hashed as 0Some existing logic, for example core.algorithms.value_count, uses the
pymapfor efficiency.High-level API examples that test this behavior:
DataFrame.describeSeries.isinThis was working fine because when
PyObject_Hashwould raise an exception while trying to hash unhashable keys, the error indicator was explicitly cleared, and a hash value 0 was returned. All objects would end up in the same bucket in the hashtable. Although this is not too good for performance, it is not a problem for correctness (becausepyobject_cmpwould find all objects in the bucket).This PR retains that behavior for all
dictandlistobjects. If hashing needs to be bypassed for additional types in the future, they can be added here.comparing a
pandas.NAinpyobject_cmpis alwaysFalseExceptions were previously suppressed also in
pyobject_cmp. One notable exception thrown during==comparison is whenpd.NA == xreturnspd.NAinstead ofFalse/True. This PR will introduce a check forpd.NAand returnFalsefrompyobject_cmpwhen one or both of the operands arepd.NA.