pandas to matrix function #16

MarcAntoineSchmidtQC · 2020-07-10T21:00:02Z

solves #13

decided on the from_pandas name since it's more standard and would allow other constructors like from_numpy
still some work to do:
Docstring
Tests
Make it more efficient (currently creating a sparse matrix to reuse csc_to_split)
Find better module name

ElizabethSantorellaQC · 2020-07-12T13:50:57Z

src/quantcore/matrix/pandas.py

+
+
+def from_pandas(
+    df: pd.DataFrame,


Can we support both sparse and dense DataFrames?

sparse meaning a data frame that consists of pd.SparseArray columns? That sounds useful 😉

If you have a Pandas data frame that only consists of sparse arrays, you can use the sparse accessor to convert that data frame to a scipy sparse matrix using X.sparse.to_coo(). Note that the .sparse accessor only can do .to_coo(), so if you want something else, you then need to do to_csr() or to_csc() afterwards.

Since this only works on data frames for which all columns are sparse,the SplitMatrix functionality is coming in very handy here.

Oh, huh, I wasn't aware of pd.SparseArray. I meant pd.SparseDataFrame, but as of 1.0.0 that no longer exists since the API for sparse stuff has totally changed. Do you think we should support Pandas < 1.0.0?

I discovered sparsity in pandas this morning. What I am currently doing is to assume someone is working with the latest pandas API. If they are not, then this will still work but there will be some performance penalty. We can tackle that in the future if it would be useful.

Currently (will push soon), there's native handling of pd.Categorical and pd.SparseArraywhich are mapped to mx.CategoricalMatrix and mx.SparseMatrix, respectively. All the other numerical columns are converted to either SparseMatrix or DenseMatrix depending on data density.

Do you think we should support Pandas < 1.0.0?

I wouldn't bend over backwards to support Pandas < 1.0. There were also substantial changes in the categorical type with 1.0, so we may also face some issues there. Still, we probably shouldn't require Pandas >= 1.0 in general. I think it's okay to say you can only use from_pandas() if you have >=1.0

src/quantcore/matrix/pandas.py

tbenthompson · 2020-07-16T15:18:03Z

src/quantcore/matrix/constructor.py

+        dense_comp = DenseMatrix(df.iloc[:, dense_idx].astype(dtype))
+        matrices.append(dense_comp)
+    if len(sparse_idx) > 0:
+        sparse_comp = SparseMatrix(df.iloc[:, sparse_idx].sparse.to_coo(), dtype=dtype)


At the moment, I don't think this preserves column ordering. Is that intended? How will a user know the mapping from input columns to the SplitMatrix columns?

e.g. If you build a SplitMatrix using this function, then estimate a GLM using that matrix, how will you know which features the coefficients correspond to?

I agree. See #21. The problem with the current implementation is that we are changing the shape of the matrix with the categoricals. If we store column names, we can use that to preserve the link between pandas and quantcore.matrix. Otherwise, do you have a suggestion?

It would be nice if the columns came out in the same order as if you used pd.to_dummies or sklearn OneHotEncoder, though. Could you try plugging this into the quantcore.glm_benchmarks data setup and see if it breaks things?

Iff not adding column name metadata seems like an acceptable workaround we can deal with in the future

The current solution that we've been using elsewhere is to just maintain column ordering exactly as it appears in the input object by passing the indices to the SplitMatrix constructor: https://github.com/Quantco/quantcore.matrix/blob/4e72d638e324b66abafa2fa6e9347d3d08723f97/src/quantcore/matrix/split_matrix.py#L34.

In the case of categoricals, this should just expand the categorical to take more columns without changing the ordering. e.g. if you have a categorical with 100 categories and two dense columns, then col idx 0 is dense, col idx 1-100 are categoricals and col idx 101 is dense.

But, this maintain-the-ordering solution is suboptimal for performance because the dense/sparse columns aren't contiguous. It requires a bit more bookkeeping internal to SplitMatrix than an alternate solution like you propose in #21. But, maintaining the ordering is simpler in other ways for the user.

okay, let me keep the ordering and expanding with categoricals.

I finally decided to implement both orderings and let the user decide. if cat_position == 'end', categoricals are going to be at the end, similar to pd.get_dummies, while if cat_position == 'expand', categoricals are going to stay at the same position but be expanded.

src/quantcore/matrix/constructor.py

working prototype

0abf742

ElizabethSantorellaQC reviewed Jul 12, 2020

View reviewed changes

src/quantcore/matrix/pandas.py Outdated Show resolved Hide resolved

ElizabethSantorellaQC reviewed Jul 12, 2020

View reviewed changes

src/quantcore/matrix/pandas.py Outdated Show resolved Hide resolved

MarcAntoineSchmidtQC marked this pull request as draft July 13, 2020 13:30

MarcAntoineSchmidtQC changed the title ~~pandas to matrix function~~ [WIP] pandas to matrix function Jul 13, 2020

MarcAntoineSchmidtQC added 2 commits July 13, 2020 14:24

more efficient implementation + docstring

1d8bee7

added simple test

875f466

MarcAntoineSchmidtQC marked this pull request as ready for review July 16, 2020 14:28

MarcAntoineSchmidtQC changed the title ~~[WIP] pandas to matrix function~~ pandas to matrix function Jul 16, 2020

tbenthompson approved these changes Jul 16, 2020

View reviewed changes

tbenthompson reviewed Jul 16, 2020

View reviewed changes

ElizabethSantorellaQC reviewed Jul 16, 2020

View reviewed changes

src/quantcore/matrix/constructor.py Show resolved Hide resolved

MarcAntoineSchmidtQC added 5 commits July 16, 2020 16:32

Merge branch 'master' into pd_np_to_matrix

f36d2c0

keep ordering

bcc840a

fix test

ad8e9e3

let user choose categorical location

b658a68

typo

29586bf

ElizabethSantorellaQC approved these changes Jul 20, 2020

View reviewed changes

MarcAntoineSchmidtQC merged commit 40ffce1 into master Jul 20, 2020

MarcAntoineSchmidtQC mentioned this pull request Jul 29, 2020

Add a "pandas to Matrix" function #13

Closed

ElizabethSantorellaQC deleted the pd_np_to_matrix branch August 3, 2020 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas to matrix function #16

pandas to matrix function #16

MarcAntoineSchmidtQC commented Jul 10, 2020 •

edited

Loading

ElizabethSantorellaQC Jul 12, 2020

jtilly Jul 12, 2020

ElizabethSantorellaQC Jul 13, 2020

MarcAntoineSchmidtQC Jul 13, 2020

jtilly Jul 13, 2020

tbenthompson Jul 16, 2020

tbenthompson Jul 16, 2020

MarcAntoineSchmidtQC Jul 16, 2020 •

edited

Loading

ElizabethSantorellaQC Jul 16, 2020

ElizabethSantorellaQC Jul 16, 2020

tbenthompson Jul 16, 2020

MarcAntoineSchmidtQC Jul 16, 2020

MarcAntoineSchmidtQC Jul 16, 2020 •

edited

Loading

pandas to matrix function #16

pandas to matrix function #16

Conversation

MarcAntoineSchmidtQC commented Jul 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcAntoineSchmidtQC Jul 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcAntoineSchmidtQC Jul 16, 2020 • edited Loading

Choose a reason for hiding this comment

MarcAntoineSchmidtQC commented Jul 10, 2020 •

edited

Loading

MarcAntoineSchmidtQC Jul 16, 2020 •

edited

Loading

MarcAntoineSchmidtQC Jul 16, 2020 •

edited

Loading