Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas to matrix function #16

Merged
merged 8 commits into from
Jul 20, 2020
Merged

pandas to matrix function #16

merged 8 commits into from
Jul 20, 2020

Conversation

MarcAntoineSchmidtQC
Copy link
Member

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC commented Jul 10, 2020

solves #13

  • decided on the from_pandas name since it's more standard and would allow other constructors like from_numpy

  • still some work to do:

  • Docstring

  • Tests

  • Make it more efficient (currently creating a sparse matrix to reuse csc_to_split)

  • Find better module name



def from_pandas(
df: pd.DataFrame,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support both sparse and dense DataFrames?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparse meaning a data frame that consists of pd.SparseArray columns? That sounds useful 😉

If you have a Pandas data frame that only consists of sparse arrays, you can use the sparse accessor to convert that data frame to a scipy sparse matrix using X.sparse.to_coo(). Note that the .sparse accessor only can do .to_coo(), so if you want something else, you then need to do to_csr() or to_csc() afterwards.

Since this only works on data frames for which all columns are sparse,the SplitMatrix functionality is coming in very handy here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, huh, I wasn't aware of pd.SparseArray. I meant pd.SparseDataFrame, but as of 1.0.0 that no longer exists since the API for sparse stuff has totally changed. Do you think we should support Pandas < 1.0.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered sparsity in pandas this morning. What I am currently doing is to assume someone is working with the latest pandas API. If they are not, then this will still work but there will be some performance penalty. We can tackle that in the future if it would be useful.

Currently (will push soon), there's native handling of pd.Categorical and pd.SparseArraywhich are mapped to mx.CategoricalMatrix and mx.SparseMatrix, respectively. All the other numerical columns are converted to either SparseMatrix or DenseMatrix depending on data density.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should support Pandas < 1.0.0?

I wouldn't bend over backwards to support Pandas < 1.0. There were also substantial changes in the categorical type with 1.0, so we may also face some issues there. Still, we probably shouldn't require Pandas >= 1.0 in general. I think it's okay to say you can only use from_pandas() if you have >=1.0

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC marked this pull request as draft July 13, 2020 13:30
@MarcAntoineSchmidtQC MarcAntoineSchmidtQC changed the title pandas to matrix function [WIP] pandas to matrix function Jul 13, 2020
@MarcAntoineSchmidtQC MarcAntoineSchmidtQC marked this pull request as ready for review July 16, 2020 14:28
@MarcAntoineSchmidtQC MarcAntoineSchmidtQC changed the title [WIP] pandas to matrix function pandas to matrix function Jul 16, 2020
dense_comp = DenseMatrix(df.iloc[:, dense_idx].astype(dtype))
matrices.append(dense_comp)
if len(sparse_idx) > 0:
sparse_comp = SparseMatrix(df.iloc[:, sparse_idx].sparse.to_coo(), dtype=dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, I don't think this preserves column ordering. Is that intended? How will a user know the mapping from input columns to the SplitMatrix columns?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. If you build a SplitMatrix using this function, then estimate a GLM using that matrix, how will you know which features the coefficients correspond to?

Copy link
Member Author

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. See #21. The problem with the current implementation is that we are changing the shape of the matrix with the categoricals. If we store column names, we can use that to preserve the link between pandas and quantcore.matrix. Otherwise, do you have a suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the columns came out in the same order as if you used pd.to_dummies or sklearn OneHotEncoder, though. Could you try plugging this into the quantcore.glm_benchmarks data setup and see if it breaks things?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iff not adding column name metadata seems like an acceptable workaround we can deal with in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current solution that we've been using elsewhere is to just maintain column ordering exactly as it appears in the input object by passing the indices to the SplitMatrix constructor: https://github.com/Quantco/quantcore.matrix/blob/4e72d638e324b66abafa2fa6e9347d3d08723f97/src/quantcore/matrix/split_matrix.py#L34.

In the case of categoricals, this should just expand the categorical to take more columns without changing the ordering. e.g. if you have a categorical with 100 categories and two dense columns, then col idx 0 is dense, col idx 1-100 are categoricals and col idx 101 is dense.

But, this maintain-the-ordering solution is suboptimal for performance because the dense/sparse columns aren't contiguous. It requires a bit more bookkeeping internal to SplitMatrix than an alternate solution like you propose in #21. But, maintaining the ordering is simpler in other ways for the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, let me keep the ordering and expanding with categoricals.

Copy link
Member Author

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally decided to implement both orderings and let the user decide. if cat_position == 'end', categoricals are going to be at the end, similar to pd.get_dummies, while if cat_position == 'expand', categoricals are going to stay at the same position but be expanded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants