Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas to matrix function #16

Merged
merged 8 commits into from
Jul 20, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/quantcore/matrix/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from .categorical_matrix import CategoricalMatrix
from .dense_matrix import DenseMatrix
from .matrix_base import MatrixBase, one_over_var_inf_to_val
from .pandas import from_pandas
from .sparse_matrix import SparseMatrix
from .split_matrix import SplitMatrix, csc_to_split
from .standardized_mat import StandardizedMatrix
Expand All @@ -14,4 +15,5 @@
"CategoricalMatrix",
"csc_to_split",
"one_over_var_inf_to_val",
"from_pandas",
]
39 changes: 39 additions & 0 deletions src/quantcore/matrix/pandas.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
import warnings

import pandas as pd
import scipy.sparse as sps

from .categorical_matrix import CategoricalMatrix
from .matrix_base import MatrixBase
from .split_matrix import SplitMatrix, csc_to_split


def from_pandas(
df: pd.DataFrame,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support both sparse and dense DataFrames?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparse meaning a data frame that consists of pd.SparseArray columns? That sounds useful 😉

If you have a Pandas data frame that only consists of sparse arrays, you can use the sparse accessor to convert that data frame to a scipy sparse matrix using X.sparse.to_coo(). Note that the .sparse accessor only can do .to_coo(), so if you want something else, you then need to do to_csr() or to_csc() afterwards.

Since this only works on data frames for which all columns are sparse,the SplitMatrix functionality is coming in very handy here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, huh, I wasn't aware of pd.SparseArray. I meant pd.SparseDataFrame, but as of 1.0.0 that no longer exists since the API for sparse stuff has totally changed. Do you think we should support Pandas < 1.0.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered sparsity in pandas this morning. What I am currently doing is to assume someone is working with the latest pandas API. If they are not, then this will still work but there will be some performance penalty. We can tackle that in the future if it would be useful.

Currently (will push soon), there's native handling of pd.Categorical and pd.SparseArraywhich are mapped to mx.CategoricalMatrix and mx.SparseMatrix, respectively. All the other numerical columns are converted to either SparseMatrix or DenseMatrix depending on data density.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should support Pandas < 1.0.0?

I wouldn't bend over backwards to support Pandas < 1.0. There were also substantial changes in the categorical type with 1.0, so we may also face some issues there. Still, we probably shouldn't require Pandas >= 1.0 in general. I think it's okay to say you can only use from_pandas() if you have >=1.0

sparse_threshold: float = 0.1,
cat_threshold: int = 4,
object_as_cat: bool = False,
) -> MatrixBase:
"""
TODO:
- docstring
- tests
- efficiency
- consider changing filename
"""
if object_as_cat:
for colname in df.select_dtypes("object"):
df[colname] = df[colname].astype("category")
else:
if not df.select_dtypes(include=object).empty:
warnings.warn("DataFrame contains columns with object dtypes. Ignoring")

categorical_component = df.select_dtypes(include=pd.CategoricalDtype)
MarcAntoineSchmidtQC marked this conversation as resolved.
Show resolved Hide resolved
X_cat = []
for colname in categorical_component:
X_cat.append(CategoricalMatrix(categorical_component[colname]))

numerical_component = df.select_dtypes(include="number")
MarcAntoineSchmidtQC marked this conversation as resolved.
Show resolved Hide resolved
X_noncat = csc_to_split(sps.csc_matrix(numerical_component))

return SplitMatrix([*X_noncat.matrices, *X_cat])