PERF: fix assert_frame_equal can be very slow#38202
PERF: fix assert_frame_equal can be very slow#38202jreback merged 14 commits intopandas-dev:masterfrom
Conversation
|
Looks like green. |
|
hmm can you merge master (this looks like it's running some older code) |
|
Did that. |
it was some other errors. but in spite of this change i am not sure this actually fixes things. (tests times are about the same) |
i doubt we have any/many tests that are as poorly behaved as the example from #38183 |
| rtol=rtol, | ||
| atol=atol, | ||
| check_index=False, | ||
| ) |
There was a problem hiding this comment.
we have a lot of kwargs already. would it be viable to call assert_(ea|numpy)_array_equal on lcol._values and rcol._values?
There was a problem hiding this comment.
Probably.
But I looked inside assert_series_equal and noticed that there are a lot of cases covered, different dtypes, etc.
So, I thought that it would be safer to parametrize.
I will try to find another way, as you suggest.
There was a problem hiding this comment.
@jbrockmendel, I am afraid I can only suggest the following approach for now.
Split assert_series_equal into assert_index_equal and _assert_series_values_equal.
The latter one is the same as the original assert_series_equal, but with the assert_index_equal removed from there.
Then we can use _assert_series_values_equal in the loop inside assert_frame_equal.
There was a problem hiding this comment.
hmm i actually liked your prior implementation (check_index) but i guess this is fine.
Execution of the following command improved. from 50 seconds to 280 something milliseconds (#38183). |
|
I can think of a test ensuring that In fact, |
| @@ -1,3 +1,5 @@ | |||
| from unittest.mock import patch | |||
There was a problem hiding this comment.
i think elsewhere we use pytest's monkeypatch fixture, does that not provide the call_count check?
There was a problem hiding this comment.
I did not find this functionality there.
| tm.assert_frame_equal(left, right) | ||
|
|
||
|
|
||
| def test_assert_frame_equal_checks_index_exactly_twice(): |
There was a problem hiding this comment.
i dont think this necessarily needs a test, maybe an asv. as long as there is a comment in tm.assert_frame_equal about why we are doing what we're doing that should be enough
There was a problem hiding this comment.
I am removing the test because it relies on unittest.mock, which is not accepted by the CI check.
| @@ -1690,11 +1727,10 @@ def assert_frame_equal( | |||
| assert col in right | |||
| lcol = left.iloc[:, i] | |||
| rcol = right.iloc[:, i] | |||
There was a problem hiding this comment.
marginally faster to do right._ixs(i, axis=1)
| atol=1.0e-8, | ||
| obj="Series", | ||
| ): | ||
| if check_less_precise is not no_default: |
ok maybe something else is stil take time |
|
its this build: https://travis-ci.org/github/pandas-dev/pandas/jobs/747040921 that is really really slow, used to be like 30min. |
|
pre-commit checks are faling |
pandas/_testing.py
Outdated
|
|
||
|
|
||
| def _assert_series_values_equal( | ||
| left, |
There was a problem hiding this comment.
yeah let's revert this and use the check_index=False
There was a problem hiding this comment.
Reverted to use of check_index.
There was a problem hiding this comment.
But personally I prefer the way involving splitting of assert_series_equal into two functions.
Indeed, this function does two major things: check index equivalence and check values equivalence, so it is reasonable to split it like that.
I see that there are multiple parameters passed to _assert_series_values_equal, which make this solution look not very elegant.
But there is a benefit that we do not touch public API, by not adding new parameter check_index.
Please suggest which solution you prefer (@jbrockmendel suggested that we do not introduce new kwarg).
There was a problem hiding this comment.
the new kwarg isnt the worst thing in the world. lets make it keyword-only to start the process of #38222
There was a problem hiding this comment.
ok i think this is fine (reason I don't mind check_index is that we have this elsewhere publicly). ok let's make this kwarg only fo r the new paramater, otherwise lgtm.
There was a problem hiding this comment.
I make this kwarg keyword-only.
|
We normally have a high bar for adding parameters to the public api. Since we could just have an internal |
-1 on this (which was already suggested), this add even more complexity in an already complex world. we already have a |
pandas/_testing.py
Outdated
| check_index : bool, default True | ||
| Whether to check index equivalence. If False, then compare only values. | ||
|
|
||
| .. versionadded:: 1.2.0 |
pandas/_testing.py
Outdated
| assert col in right | ||
| lcol = left.iloc[:, i] | ||
| rcol = right.iloc[:, i] | ||
| lcol = left._ixs(i, axis=1) |
There was a problem hiding this comment.
why are you changing this? this is not a public function and shouldn't be used here at all
There was a problem hiding this comment.
@jbrockmendel suggested that it would be a marginally faster option.
Looks like it actually is.
There was a problem hiding this comment.
it doesn't matter if its faster as its not public; these are heavily used public function (yes this is an implementation), but we want to use iloc here (and i doubt this actually makes any real world difference).
|
can you show how this is faster, specifically are things decreasing in the duration? which builds are faster? |
I compared |
|
thanks @ivanovmg |
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffAdd kwarg
check_indexintoassert_series_equalto allow elimination of multiple index checking (for each column) inassert_frame_equal.Performance wise.