BUG: Covariance does not handle ddof argument if data is missing. #45814

skirui-source · 2022-02-04T02:04:43Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

print("DataFrame without null values:")
non_null_df = pd.DataFrame({"id": ["a", "a"], "val1": [1.0, 2.0], "val2": [1.0, 2.0],})

DataFrame without null values:
  id  val1  val2
0  a   1.0   1.0
1  a   2.0   2.0

# We normalize by N-ddof = 2-2 = 0, and thus get infinity from dividing by 0.0
# This is the expected result:
print(non_null_df.cov(ddof=2))
      val1  val2
val1   inf   inf
val2   inf   inf

# A groupby covariance behaves like the function above, as expected:
print(non_null_df.groupby("id").cov(ddof=2))
         val1  val2
id                 
a  val1   inf   inf
   val2   inf   inf



print("DataFrame with null values:")
null_df = pd.DataFrame({ "id": ["a", "a"], "val1": [1.0, 2.0],"val2": [np.nan, np.nan],})

DataFrame with null values:
  id  val1  val2
0  a   1.0   NaN
1  a   2.0   NaN

# We expect to normalize by N-ddof = 2-2 = 0, but ddof is ignored because there are null values. 
# The underlying problem is that libalgos.nancorr does not  accept and use the provided ddof parameter. 
# Instead, it returns 0.5 for the correlation of val1 with val1. This term of the covariance matrix should be
# infinity to match the behavior of the non-null DataFrame above, which handles the ddof argument.
# https://github.com/pandas-dev/pandas/blob/bb1f651536508cdfef8550f93ace7849b00046ee/pandas/core/frame.py#L9658-L9666

print(null_df.cov(ddof=2))
      val1  val2
val1   0.5   NaN
val2   NaN   NaN

# A groupby covariance behaves like the function above, as expected.
print(null_df.groupby("id").cov(ddof=2))
         val1  val2
id                 
a  val1   0.5   NaN
   val2   NaN   NaN

Issue Description

The problem : pandas and numpy have mismatching behavior when computing covariance .cov() in the presence of missing/nan values. This has also been highlighted in issue #16837

In estimating covariance, the data is normalized by (N - ddof). Therefore for a case when the number of observations
N is equal to the value passed in for ddof, dividing by zero results to infinity inf.

I think it’s specifically an issue in pandas code, in that pandas uses numpy for the calculation if no values are missing (NaN).
On the flip side, a pandas-internal implementation libalgos.nancorr is used if nulls are present, in which ddof is currently not being used to normalize data before estimating the covariance

Expected Behavior

Refer to Reproducible Example above

Installed Versions

In [5]: pd.show_versions()

INSTALLED VERSIONS

commit : 66e3805
python : 3.8.12.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-76-generic
Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.3.5
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 59.8.0
Cython : 0.29.26
pytest : 6.2.5
hypothesis : 6.36.0
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
fsspec : 2022.01.0
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.55.0

The text was updated successfully, but these errors were encountered:

skirui-source added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 4, 2022

skirui-source changed the title ~~BUG: Inaccuracy in groupby covariance when min_periods==ddof==2~~ BUG: Covariance does not handle ddof argument if data is missing. Feb 9, 2022

mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 11, 2022

skirui-source mentioned this issue Feb 16, 2022

Add covariance for sort groupby (python) rapidsai/cudf#9889

Merged

5 tasks

jbrockmendel added the cov/corr label Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Covariance does not handle ddof argument if data is missing. #45814

BUG: Covariance does not handle ddof argument if data is missing. #45814

skirui-source commented Feb 4, 2022 •

edited

Loading

INSTALLED VERSIONS

BUG: Covariance does not handle ddof argument if data is missing. #45814

BUG: Covariance does not handle ddof argument if data is missing. #45814

Comments

skirui-source commented Feb 4, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

skirui-source commented Feb 4, 2022 •

edited

Loading