Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrices from formulas #267

Merged
merged 74 commits into from
Aug 15, 2023
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
bd2a2d3
Add an experimental tabmat materializer class
stanmart Jun 15, 2023
100bdb8
Nicer way of handling interactions
stanmart Jun 19, 2023
85da52e
Have proper column names [skip ci]
stanmart Jun 19, 2023
ce7dfaa
Make dummy ordering consistent with pandas [skip ci]
stanmart Jun 19, 2023
d23cca5
Fix mistake in categorical interactions [skip ci]
stanmart Jun 19, 2023
55b01bf
Add formulaic to environment files
stanmart Jun 19, 2023
5b7da3c
Add from_formula constructor
stanmart Jun 19, 2023
51ecfc2
Add some tests
stanmart Jun 19, 2023
ffa4955
Add more tests
stanmart Jun 19, 2023
a54a1a3
Major refactoring
stanmart Jun 21, 2023
6157acd
Make name formatting custommizable
stanmart Jun 22, 2023
c9959cc
Add formulaic to conda recipe
stanmart Jun 22, 2023
64af944
Implement `C()` function to convert to categoricals
stanmart Jun 22, 2023
8abca00
Auto-convert strings to categories
stanmart Jun 22, 2023
124d47c
Fix C() not working from materializer interface
stanmart Jun 22, 2023
bb1faf6
Add the pandasmaterializer tests from formulaic
stanmart Jun 22, 2023
5716573
Add formulaic to setup.py deps
stanmart Jun 22, 2023
21fcdff
Implement suggestions from code review
stanmart Jun 22, 2023
eaf968e
Clean up code
stanmart Jun 22, 2023
7cb70f6
Pin formulaic minimum version
stanmart Jun 22, 2023
fb629c6
Add support for architectures not supported by xsimd (#262)
xhochy Jun 15, 2023
9fb2993
Release 3.1.9 (#263)
xhochy Jun 16, 2023
1ad6a93
Pre-commit autoupdate (#264)
quant-ranger[bot] Jun 19, 2023
5b2c885
Add params for density and cardinality thresholds
stanmart Jun 23, 2023
63cbc7b
Skip python 3.6 build
stanmart Jun 23, 2023
2daba93
Refactor to avoid circular imports
stanmart Jun 23, 2023
8d049c9
Merge branch 'main' into formula
stanmart Jun 23, 2023
ef84e7d
Interaction of dropped and NA is dropped
stanmart Jun 26, 2023
927b2be
Add type hint for context
stanmart Jun 27, 2023
fbd9ad9
Add unit tests for interactable vectors
stanmart Jun 27, 2023
20b617b
Add more checks
stanmart Jun 27, 2023
010ad8e
Change argument name
stanmart Jun 27, 2023
fcc1a91
Make C() stateful (remember levels)
stanmart Jun 27, 2023
d9d1353
Add test for categorizer state
stanmart Jun 27, 2023
695de6f
More correct handling of encoding categoricals
stanmart Jun 28, 2023
064daac
Make adding an intercept implicitly parametrizable
stanmart Jun 28, 2023
55ae36f
Add na_action parameter to constrictor
stanmart Jun 30, 2023
5f253a3
Add test for sparse numerical columns
stanmart Jun 30, 2023
9caa6df
Add option to not add the constant column
stanmart Aug 2, 2023
f180075
Pre-commit autoupdate (#274)
quant-ranger[bot] Jun 26, 2023
5eaba13
Pre-commit autoupdate (#276)
quant-ranger[bot] Jul 3, 2023
6fb5a96
Bump pypa/gh-action-pypi-publish from 1.8.6 to 1.8.7 (#277)
dependabot[bot] Jul 3, 2023
a473577
Bump pypa/gh-action-pypi-publish from 1.8.7 to 1.8.8 (#279)
dependabot[bot] Jul 17, 2023
1084b53
Bump pypa/cibuildwheel from 2.13.1 to 2.14.1 (#280)
dependabot[bot] Jul 17, 2023
ce96be8
Minimal implementation (tests green)
stanmart Jul 18, 2023
6fdde75
Remove sum method and rely on np.sum
stanmart Jul 19, 2023
a8cbf96
Force DenseMatrix to always be 2-dimensional
stanmart Jul 19, 2023
b80cdc1
Add __repr__ and __str__ methods
stanmart Jul 19, 2023
8983f4d
Fix as_mx
stanmart Jul 20, 2023
16e0217
Fix ufunc return value
stanmart Jul 20, 2023
272ba65
Wrap SparseMatrix, too
stanmart Jul 20, 2023
7775b79
Demo of how the ufunc interface can be implemented
stanmart Jul 20, 2023
a493e03
Do not subclass csc_matrix
stanmart Jul 20, 2023
008dfa3
Improve the performance of `from_pandas` in the case of low-cardinali…
stanmart Jul 18, 2023
407a12f
Add benchmark data to .gitignore (#282)
stanmart Jul 19, 2023
ac9c121
Demonstrate binary ufuncs for sparse
stanmart Jul 21, 2023
95bc477
Add tocsc method
stanmart Jul 21, 2023
a6173f5
Fix type checks
stanmart Jul 21, 2023
35e7330
Minor improvements
stanmart Jul 21, 2023
aa264df
ufunc support for categoricals
stanmart Jul 21, 2023
51d31e5
Remove __array_ufunc__ interface
stanmart Jul 25, 2023
86e3178
Remove numpy operator mixin
stanmart Jul 25, 2023
e4bb2ea
Add hstack function
stanmart Jul 26, 2023
17c36ca
Add method for unpacking underlying array
stanmart Jul 26, 2023
9dd638d
Add __matmul__ methods to SparseMatrix
stanmart Jul 26, 2023
ba2b70e
Stricter and more consistent indexing
stanmart Jul 27, 2023
9b04f8c
Be consistent when instantiating from 1d arrays
stanmart Aug 9, 2023
5c064c2
Adjust tests to work with v4
stanmart Aug 9, 2023
f1ba304
Fix type hints
stanmart Aug 9, 2023
01e20b3
Merge branch 'tabmat-v4' into formula
stanmart Aug 15, 2023
603293b
Add changelog entry
stanmart Aug 15, 2023
3686371
term and column names for formula-based matrices
stanmart Aug 15, 2023
7aa36a4
Fix handling of formula-based names
stanmart Aug 15, 2023
c9cfc0f
Add tests for formula-based names
stanmart Aug 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions environment-win.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ channels:
dependencies:
- libblas>=0=*mkl
- pandas
- formulaic

# development tools
- black
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ channels:
- nodefaults
dependencies:
- pandas
- formulaic

# development tools
- black
Expand Down
3 changes: 2 additions & 1 deletion src/tabmat/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from .categorical_matrix import CategoricalMatrix
from .constructor import from_csc, from_pandas
from .constructor import from_csc, from_formula, from_pandas
from .dense_matrix import DenseMatrix
from .matrix_base import MatrixBase
from .sparse_matrix import SparseMatrix
Expand All @@ -14,5 +14,6 @@
"SplitMatrix",
"CategoricalMatrix",
"from_csc",
"from_formula",
"from_pandas",
]
36 changes: 36 additions & 0 deletions src/tabmat/constructor.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
import sys
import warnings
from typing import List, Tuple, Union

import numpy as np
import pandas as pd
from formulaic import Formula, ModelSpec
from formulaic.utils.layered_mapping import LayeredMapping
from pandas.api.types import is_numeric_dtype
from scipy import sparse as sps

from .categorical_matrix import CategoricalMatrix
from .dense_matrix import DenseMatrix
from .formula import TabmatMaterializer
from .matrix_base import MatrixBase
from .sparse_matrix import SparseMatrix
from .split_matrix import SplitMatrix
Expand Down Expand Up @@ -198,3 +202,35 @@ def from_csc(mat: sps.csc_matrix, threshold=0.1):
"""
dense, sparse, dense_idx, sparse_idx = _split_sparse_and_dense_parts(mat, threshold)
return SplitMatrix([dense, sparse], [dense_idx, sparse_idx])


def from_formula(
formula: Union[str, Formula],
df: pd.DataFrame,
ensure_full_rank: bool = False,
context=0,
):
"""
Transform a pandas DataFrame to a SplitMatrix using a Wilkinson formula.
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved

Parameters
----------
formula: str
A formula accepted by formulaic.
df: pd.DataFrame
pandas DataFrame to be converted.
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved
ensure_full_rank: bool, default False
If True, ensure that the matrix has full structural rank by categories.
"""
if isinstance(context, int):
if hasattr(sys, "_getframe"):
frame = sys._getframe(context + 1)
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved
context = LayeredMapping(frame.f_locals, frame.f_globals)
else:
context = None # pragma: no cover
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved
spec = ModelSpec(
formula=Formula(formula),
ensure_full_rank=ensure_full_rank,
)
materializer = TabmatMaterializer(df, context=context)
return materializer.get_model_matrix(spec)
313 changes: 313 additions & 0 deletions src/tabmat/formula.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,313 @@
import copy
import itertools
from collections import OrderedDict

import numpy
import pandas
from formulaic import ModelMatrix, ModelSpec
from formulaic.materializers import FormulaMaterializer
from formulaic.materializers.base import EncodedTermStructure
from formulaic.materializers.types import NAAction
from interface_meta import override

from .categorical_matrix import CategoricalMatrix
from .dense_matrix import DenseMatrix
from .sparse_matrix import SparseMatrix
from .split_matrix import SplitMatrix


class TabmatMaterializer(FormulaMaterializer):
"""Materializer for pandas input and tabmat output."""

REGISTER_NAME = "tabmat"
REGISTER_INPUTS = ("pandas.core.frame.DataFrame",)
REGISTER_OUTPUTS = "tabmat"

@override
def _is_categorical(self, values):
if isinstance(values, (pandas.Series, pandas.Categorical)):
return values.dtype == object or isinstance(
values.dtype, pandas.CategoricalDtype
)
return super()._is_categorical(values)

@override
def _check_for_nulls(self, name, values, na_action, drop_rows):
if na_action is NAAction.IGNORE:
return

if isinstance(
values, dict
): # pragma: no cover; no formulaic transforms return dictionaries any more
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved
for key, vs in values.items():
self._check_for_nulls(f"{name}[{key}]", vs, na_action, drop_rows)

elif na_action is NAAction.RAISE:
if isinstance(values, pandas.Series) and values.isnull().values.any():
raise ValueError(f"`{name}` contains null values after evaluation.")

elif na_action is NAAction.DROP:
if isinstance(values, pandas.Series):
drop_rows.update(numpy.flatnonzero(values.isnull().values))

else:
raise ValueError(
f"Do not know how to interpret `na_action` = {repr(na_action)}."
) # pragma: no cover; this is currently impossible to reach

@override
def _encode_constant(self, value, metadata, encoder_state, spec, drop_rows):
series = value * numpy.ones(self.nrows - len(drop_rows))
return InteractableDenseMatrix(series)

@override
def _encode_numerical(self, values, metadata, encoder_state, spec, drop_rows):
if drop_rows:
values = values.drop(index=values.index[drop_rows])
if isinstance(values, pandas.Series):
values = values.to_numpy()
return InteractableDenseMatrix(values)

@override
def _encode_categorical(
self, values, metadata, encoder_state, spec, drop_rows, reduced_rank=False
):
# We do not do any encoding here as it is handled by tabmat
if drop_rows:
values = values.drop(index=values.index[drop_rows])
return InteractableCategoricalMatrix(values._values, drop_first=reduced_rank)

@override
def _combine_columns(self, cols, spec, drop_rows):
# Special case no columns
if not cols:
values = numpy.empty((self.data.shape[0], 0))
return SplitMatrix([DenseMatrix(values)])

# Otherwise, concatenate columns into SplitMatrix
return SplitMatrix([col[1].to_non_interactable() for col in cols])

# Have to override this because of culumn names
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved
# (and possibly intercept later on)
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved
@override
def _build_model_matrix(self, spec: ModelSpec, drop_rows):
# Step 0: Apply any requested column/term clustering
# This must happen before Step 1 otherwise the greedy rank reduction
# below would result in a different outcome than if the columns had
# always been in the generated order.
terms = self._cluster_terms(spec.formula, cluster_by=spec.cluster_by)

# Step 1: Determine strategy to maintain structural full-rankness of output matrix
scoped_terms_for_terms = self._get_scoped_terms(
terms,
ensure_full_rank=spec.ensure_full_rank,
)

# Step 2: Generate the columns which will be collated into the full matrix
cols = []
for term, scoped_terms in scoped_terms_for_terms:
scoped_cols = OrderedDict()
for scoped_term in scoped_terms:
if not scoped_term.factors:
scoped_cols[
"Intercept"
] = scoped_term.scale * self._encode_constant(
1, None, {}, spec, drop_rows
)
else:
scoped_cols.update(
self._get_columns_for_term(
[
self._encode_evaled_factor(
scoped_factor.factor,
spec,
drop_rows,
reduced_rank=scoped_factor.reduced,
)
for scoped_factor in scoped_term.factors
],
spec=spec,
scale=scoped_term.scale,
)
)
cols.append((term, scoped_terms, scoped_cols))

# Step 3: Populate remaining model spec fields
if spec.structure:
cols = self._enforce_structure(cols, spec, drop_rows)
else:
# for term, scoped_terms, columns in spec.structure:
# expanded_columns = list(itertools.chain(colname_dict[col] for col in columns))
# expanded_structure.append(
# EncodedTermStructure(term, scoped_terms, expanded_columns)
# )

spec = spec.update(
structure=[
EncodedTermStructure(
term,
[st.copy(without_values=True) for st in scoped_terms],
# This is the only line that is different from the original:
list(
itertools.chain(
*(
mat.get_names(col)
for col, mat in scoped_cols.items()
)
)
),
)
for term, scoped_terms, scoped_cols in cols
],
)

# Step 4: Collate factors into one ModelMatrix
return ModelMatrix(
self._combine_columns(
[
(name, values)
for term, scoped_terms, scoped_cols in cols
for name, values in scoped_cols.items()
],
spec=spec,
drop_rows=drop_rows,
),
spec=spec,
)


class InteractableDenseMatrix(DenseMatrix):
def __mul__(self, other):
if isinstance(other, (InteractableDenseMatrix, int, float)):
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved
return self.multiply(other)
elif isinstance(
other, (InteractableSparseMatrix, InteractableCategoricalMatrix)
):
return other.__mul__(self)
else:
raise TypeError(f"Cannot multiply {type(self)} and {type(other)}")
# Multiplication with sparse and categorical is handled by the other classes

def __rmul__(self, other):
return self.__mul__(other)

def to_non_interactable(self):
return DenseMatrix(self)

def get_names(self, col):
return [col]


class InteractableSparseMatrix(SparseMatrix):
def __mul__(self, other):
if isinstance(other, (InteractableDenseMatrix, InteractableSparseMatrix)):
return self.multiply(other)
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved
elif isinstance(other, InteractableCategoricalMatrix):
return other.__mul__(self)
elif isinstance(other, (int, float)):
return self.multiply(numpy.array(other))
else:
raise TypeError(f"Cannot multiply {type(self)} and {type(other)}")
# Multiplication with categorical is handled by the categorical

def __rmul__(self, other):
return self.__mul__(other)

def to_non_interactable(self):
return SparseMatrix(self)

def get_names(self, col):
return [col]


class InteractableCategoricalMatrix(CategoricalMatrix):
def __init__(self, *args, **kwargs):
multipliers = kwargs.pop("multipliers", None)
super().__init__(*args, **kwargs)
if multipliers is None:
self.multipliers = numpy.ones_like(self.cat, dtype=numpy.float_)
else:
self.multipliers = multipliers

def __mul__(self, other):
if isinstance(other, (InteractableDenseMatrix, float, int)):
result = copy.copy(self)
result.multipliers = result.multipliers * numpy.array(other)
return result
elif isinstance(other, InteractableSparseMatrix):
result = copy.copy(self)
result.multipliers = result.multipliers * other.todense()
return result
elif isinstance(other, InteractableCategoricalMatrix):
return self._interact_categorical(other)
else:
raise TypeError(
f"Can't multiply InteractableCategoricalMatrix with {type(other)}"
)

def __rmul__(self, other):
if isinstance(other, InteractableCategoricalMatrix):
other._interact_categorical(self) # order matters
else:
return self.__mul__(other)

def to_non_interactable(self):
if numpy.all(self.multipliers == 1):
return CategoricalMatrix(
self.cat,
drop_first=self.drop_first,
dtype=self.dtype,
)
else:
return SparseMatrix(
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved
self.tocsr().multiply(self.multipliers[:, numpy.newaxis])
)

def _interact_categorical(self, other):
cardinality_self = len(self.cat.categories)

new_codes = other.cat.codes * cardinality_self + self.cat.codes

if self.drop_first:
new_codes[new_codes % cardinality_self == 0] = 0
new_codes -= new_codes // cardinality_self
self_slice = slice(1, None)
else:
self_slice = slice(None)

if other.drop_first:
new_codes -= (cardinality_self - 1)
new_codes[new_codes < 0] = 0
other_slice = slice(1, None)
else:
other_slice = slice(None)

new_categories = [
f"{self_cat}:{other_cat}"
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved
for other_cat, self_cat in itertools.product(
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved
other.cat.categories[other_slice], self.cat.categories[self_slice]
)
]

new_drop_first = self.drop_first or other.drop_first
if new_drop_first:
new_categories = ["__drop__"] + new_categories

cat = pandas.Categorical.from_codes(
categories=new_categories,
codes=new_codes,
ordered=self.cat.ordered and other.cat.ordered,
)

return InteractableCategoricalMatrix(
cat,
multipliers=self.multipliers * other.multipliers,
drop_first=new_drop_first,
)

def get_names(self, col):
if self.drop_first:
categories = self.cat.categories[1:]
else:
categories = self.cat.categories
return [f"{col}[{cat}]" for cat in categories]
Loading