Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Covariance does not handle ddof argument if data is missing. #45814

Open
3 tasks
skirui-source opened this issue Feb 4, 2022 · 0 comments
Open
3 tasks
Labels
Bug cov/corr Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc.

Comments

@skirui-source
Copy link

skirui-source commented Feb 4, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

print("DataFrame without null values:")
non_null_df = pd.DataFrame({"id": ["a", "a"], "val1": [1.0, 2.0], "val2": [1.0, 2.0],})

DataFrame without null values:
  id  val1  val2
0  a   1.0   1.0
1  a   2.0   2.0

# We normalize by N-ddof = 2-2 = 0, and thus get infinity from dividing by 0.0
# This is the expected result:
print(non_null_df.cov(ddof=2))
      val1  val2
val1   inf   inf
val2   inf   inf

# A groupby covariance behaves like the function above, as expected:
print(non_null_df.groupby("id").cov(ddof=2))
         val1  val2
id                 
a  val1   inf   inf
   val2   inf   inf



print("DataFrame with null values:")
null_df = pd.DataFrame({ "id": ["a", "a"], "val1": [1.0, 2.0],"val2": [np.nan, np.nan],})

DataFrame with null values:
  id  val1  val2
0  a   1.0   NaN
1  a   2.0   NaN

# We expect to normalize by N-ddof = 2-2 = 0, but ddof is ignored because there are null values. 
# The underlying problem is that libalgos.nancorr does not  accept and use the provided ddof parameter. 
# Instead, it returns 0.5 for the correlation of val1 with val1. This term of the covariance matrix should be
# infinity to match the behavior of the non-null DataFrame above, which handles the ddof argument.
# https://github.com/pandas-dev/pandas/blob/bb1f651536508cdfef8550f93ace7849b00046ee/pandas/core/frame.py#L9658-L9666

print(null_df.cov(ddof=2))
      val1  val2
val1   0.5   NaN
val2   NaN   NaN

# A groupby covariance behaves like the function above, as expected.
print(null_df.groupby("id").cov(ddof=2))
         val1  val2
id                 
a  val1   0.5   NaN
   val2   NaN   NaN

Issue Description

The problem : pandas and numpy have mismatching behavior when computing covariance .cov() in the presence of missing/nan values. This has also been highlighted in issue #16837

In estimating covariance, the data is normalized by (N - ddof). Therefore for a case when the number of observations
N is equal to the value passed in for ddof, dividing by zero results to infinity inf.

I think it’s specifically an issue in pandas code, in that pandas uses numpy for the calculation if no values are missing (NaN).
On the flip side, a pandas-internal implementation libalgos.nancorr is used if nulls are present, in which ddof is currently not being used to normalize data before estimating the covariance

Expected Behavior

Refer to Reproducible Example above

Installed Versions

In [5]: pd.show_versions()

INSTALLED VERSIONS

commit : 66e3805
python : 3.8.12.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-76-generic
Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.3.5
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 59.8.0
Cython : 0.29.26
pytest : 6.2.5
hypothesis : 6.36.0
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
fsspec : 2022.01.0
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.55.0

@skirui-source skirui-source added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 4, 2022
@skirui-source skirui-source changed the title BUG: Inaccuracy in groupby covariance when min_periods==ddof==2 BUG: Covariance does not handle ddof argument if data is missing. Feb 9, 2022
@mroeschke mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug cov/corr Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

3 participants