-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas to matrix function #16
Conversation
src/quantcore/matrix/pandas.py
Outdated
|
||
|
||
def from_pandas( | ||
df: pd.DataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we support both sparse and dense DataFrames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sparse meaning a data frame that consists of pd.SparseArray
columns? That sounds useful 😉
If you have a Pandas data frame that only consists of sparse arrays, you can use the sparse accessor to convert that data frame to a scipy sparse matrix using X.sparse.to_coo()
. Note that the .sparse
accessor only can do .to_coo()
, so if you want something else, you then need to do to_csr()
or to_csc()
afterwards.
Since this only works on data frames for which all columns are sparse,the SplitMatrix functionality is coming in very handy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, huh, I wasn't aware of pd.SparseArray
. I meant pd.SparseDataFrame, but as of 1.0.0 that no longer exists since the API for sparse stuff has totally changed. Do you think we should support Pandas < 1.0.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discovered sparsity in pandas this morning. What I am currently doing is to assume someone is working with the latest pandas API. If they are not, then this will still work but there will be some performance penalty. We can tackle that in the future if it would be useful.
Currently (will push soon), there's native handling of pd.Categorical
and pd.SparseArray
which are mapped to mx.CategoricalMatrix
and mx.SparseMatrix
, respectively. All the other numerical columns are converted to either SparseMatrix or DenseMatrix depending on data density.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should support Pandas < 1.0.0?
I wouldn't bend over backwards to support Pandas < 1.0. There were also substantial changes in the categorical type with 1.0, so we may also face some issues there. Still, we probably shouldn't require Pandas >= 1.0 in general. I think it's okay to say you can only use from_pandas()
if you have >=1.0
src/quantcore/matrix/constructor.py
Outdated
dense_comp = DenseMatrix(df.iloc[:, dense_idx].astype(dtype)) | ||
matrices.append(dense_comp) | ||
if len(sparse_idx) > 0: | ||
sparse_comp = SparseMatrix(df.iloc[:, sparse_idx].sparse.to_coo(), dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, I don't think this preserves column ordering. Is that intended? How will a user know the mapping from input columns to the SplitMatrix columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. If you build a SplitMatrix using this function, then estimate a GLM using that matrix, how will you know which features the coefficients correspond to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. See #21. The problem with the current implementation is that we are changing the shape of the matrix with the categoricals. If we store column names, we can use that to preserve the link between pandas and quantcore.matrix. Otherwise, do you have a suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if the columns came out in the same order as if you used pd.to_dummies or sklearn OneHotEncoder, though. Could you try plugging this into the quantcore.glm_benchmarks data setup and see if it breaks things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iff not adding column name metadata seems like an acceptable workaround we can deal with in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current solution that we've been using elsewhere is to just maintain column ordering exactly as it appears in the input object by passing the indices to the SplitMatrix constructor: https://github.com/Quantco/quantcore.matrix/blob/4e72d638e324b66abafa2fa6e9347d3d08723f97/src/quantcore/matrix/split_matrix.py#L34.
In the case of categoricals, this should just expand the categorical to take more columns without changing the ordering. e.g. if you have a categorical with 100 categories and two dense columns, then col idx 0 is dense, col idx 1-100 are categoricals and col idx 101 is dense.
But, this maintain-the-ordering solution is suboptimal for performance because the dense/sparse columns aren't contiguous. It requires a bit more bookkeeping internal to SplitMatrix than an alternate solution like you propose in #21. But, maintaining the ordering is simpler in other ways for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, let me keep the ordering and expanding with categoricals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally decided to implement both orderings and let the user decide. if cat_position == 'end', categoricals are going to be at the end, similar to pd.get_dummies
, while if cat_position == 'expand', categoricals are going to stay at the same position but be expanded.
solves #13
decided on the
from_pandas
name since it's more standard and would allow other constructors likefrom_numpy
still some work to do:
Docstring
Tests
Make it more efficient (currently creating a sparse matrix to reuse
csc_to_split
)Find better module name