Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas to matrix function #16

Merged
merged 8 commits into from
Jul 20, 2020
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/quantcore/matrix/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from .categorical_matrix import CategoricalMatrix
from .dense_matrix import DenseMatrix
from .matrix_base import MatrixBase, one_over_var_inf_to_val
from .pandas import from_pandas
from .sparse_matrix import SparseMatrix
from .split_matrix import SplitMatrix, csc_to_split
from .standardized_mat import StandardizedMatrix
Expand All @@ -14,4 +15,5 @@
"CategoricalMatrix",
"csc_to_split",
"one_over_var_inf_to_val",
"from_pandas",
]
100 changes: 100 additions & 0 deletions src/quantcore/matrix/pandas.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
import warnings

import numpy as np
import pandas as pd
from pandas.api.types import is_numeric_dtype

from .categorical_matrix import CategoricalMatrix
from .dense_matrix import DenseMatrix
from .matrix_base import MatrixBase
from .sparse_matrix import SparseMatrix
from .split_matrix import SplitMatrix


def from_pandas(
df: pd.DataFrame,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support both sparse and dense DataFrames?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparse meaning a data frame that consists of pd.SparseArray columns? That sounds useful 😉

If you have a Pandas data frame that only consists of sparse arrays, you can use the sparse accessor to convert that data frame to a scipy sparse matrix using X.sparse.to_coo(). Note that the .sparse accessor only can do .to_coo(), so if you want something else, you then need to do to_csr() or to_csc() afterwards.

Since this only works on data frames for which all columns are sparse,the SplitMatrix functionality is coming in very handy here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, huh, I wasn't aware of pd.SparseArray. I meant pd.SparseDataFrame, but as of 1.0.0 that no longer exists since the API for sparse stuff has totally changed. Do you think we should support Pandas < 1.0.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered sparsity in pandas this morning. What I am currently doing is to assume someone is working with the latest pandas API. If they are not, then this will still work but there will be some performance penalty. We can tackle that in the future if it would be useful.

Currently (will push soon), there's native handling of pd.Categorical and pd.SparseArraywhich are mapped to mx.CategoricalMatrix and mx.SparseMatrix, respectively. All the other numerical columns are converted to either SparseMatrix or DenseMatrix depending on data density.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should support Pandas < 1.0.0?

I wouldn't bend over backwards to support Pandas < 1.0. There were also substantial changes in the categorical type with 1.0, so we may also face some issues there. Still, we probably shouldn't require Pandas >= 1.0 in general. I think it's okay to say you can only use from_pandas() if you have >=1.0

dtype: np.dtype = np.float64,
sparse_threshold: float = 0.1,
cat_threshold: int = 4,
object_as_cat: bool = False,
) -> MatrixBase:
"""
Transform a pandas.DataFrame into an efficient SplitMatrix

Parameters
----------
df : pd.DataFrame
pandas DataFrame to be converted.
dtype : np.dtype, default np.float64
dtype of all sub-matrices of the resulting SplitMatrix.
sparse_threshold : float, default 0.1
Density threshold below which numerical columns will be stored in a sparse
format.
cat_threshold : int, default 4
Number of levels of a categorical column under which the column will be stored
as sparse one-hot-encoded columns instead of CategoricalMatrix
object_as_cat : bool, default False
If True, DataFrame columns stored as python objects will be treated as
categorical columns.

Returns
-------
SplitMatrix
"""
if object_as_cat:
for colname in df.select_dtypes("object"):
df[colname] = df[colname].astype("category")

matrices = []
sparse_ohe_comp = []
sparse_idx = []
dense_idx = []
ignored_cols = []
for colidx, (colname, coldata) in enumerate(df.iteritems()):
# categorical
if isinstance(coldata.dtype, pd.CategoricalDtype):
if len(coldata.cat.categories) < cat_threshold:
sparse_ohe_comp.append(
pd.get_dummies(coldata, prefix=colname, sparse=True)
)
else:
matrices.append(CategoricalMatrix(coldata, dtype=dtype))

# sparse data, keep in sparse format even if density is larger than threshold
elif isinstance(coldata.dtype, pd.SparseDtype):
sparse_idx.append(colidx)

# All other numerical dtypes (needs to be after pd.SparseDtype)
elif is_numeric_dtype(coldata):
# check if we want to store as sparse
if (coldata != 0).mean() <= sparse_threshold:
sparse_dtype = pd.SparseDtype(coldata.dtype, fill_value=0)
df.iloc[:, colidx] = df.iloc[:, colidx].astype(sparse_dtype)
sparse_idx.append(colidx)
else:
dense_idx.append(colidx)

# dtype not handled yet
else:
ignored_cols.append((colidx, colname))

if len(ignored_cols) > 0:
warnings.warn(
f"Columns {ignored_cols} were ignored. Make sure they have a valid dtype."
)
if len(dense_idx) > 0:
dense_comp = DenseMatrix(df.iloc[:, dense_idx].astype(dtype))
matrices.append(dense_comp)
if len(sparse_idx) > 0:
sparse_comp = SparseMatrix(df.iloc[:, sparse_idx].sparse.to_coo(), dtype=dtype)
matrices.append(sparse_comp)
if len(sparse_ohe_comp) > 0:
sparse_ohe_comp = SparseMatrix(
pd.concat(sparse_ohe_comp, axis=1).sparse.to_coo(), dtype=dtype
)
matrices.append(sparse_ohe_comp)

if len(matrices) > 1:
return SplitMatrix(matrices)
else:
return matrices[0]