BUG: Covariance does not handle ddof argument if data is missing. #45814
Labels
Bug
cov/corr
Missing-data
np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Reduction Operations
sum, mean, min, max, etc.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The problem :
pandas
andnumpy
have mismatching behavior when computing covariance.cov()
in the presence of missing/nan values. This has also been highlighted in issue #16837In estimating covariance, the data is normalized by
(N - ddof)
. Therefore for a case when the number of observationsN
is equal to the value passed in forddof
, dividing by zero results to infinityinf
.I think it’s specifically an issue in pandas code, in that pandas uses numpy for the calculation if no values are missing (NaN).
On the flip side, a pandas-internal implementation
libalgos.nancorr
is used if nulls are present, in whichddof
is currently not being used to normalize data before estimating the covarianceExpected Behavior
Refer to Reproducible Example above
Installed Versions
In [5]: pd.show_versions()
INSTALLED VERSIONS
commit : 66e3805
python : 3.8.12.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-76-generic
Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.3.5
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 59.8.0
Cython : 0.29.26
pytest : 6.2.5
hypothesis : 6.36.0
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
fsspec : 2022.01.0
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.55.0
The text was updated successfully, but these errors were encountered: