Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrices from formulas #267

Merged
merged 74 commits into from
Aug 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
bd2a2d3
Add an experimental tabmat materializer class
stanmart Jun 15, 2023
100bdb8
Nicer way of handling interactions
stanmart Jun 19, 2023
85da52e
Have proper column names [skip ci]
stanmart Jun 19, 2023
ce7dfaa
Make dummy ordering consistent with pandas [skip ci]
stanmart Jun 19, 2023
d23cca5
Fix mistake in categorical interactions [skip ci]
stanmart Jun 19, 2023
55b01bf
Add formulaic to environment files
stanmart Jun 19, 2023
5b7da3c
Add from_formula constructor
stanmart Jun 19, 2023
51ecfc2
Add some tests
stanmart Jun 19, 2023
ffa4955
Add more tests
stanmart Jun 19, 2023
a54a1a3
Major refactoring
stanmart Jun 21, 2023
6157acd
Make name formatting custommizable
stanmart Jun 22, 2023
c9959cc
Add formulaic to conda recipe
stanmart Jun 22, 2023
64af944
Implement `C()` function to convert to categoricals
stanmart Jun 22, 2023
8abca00
Auto-convert strings to categories
stanmart Jun 22, 2023
124d47c
Fix C() not working from materializer interface
stanmart Jun 22, 2023
bb1faf6
Add the pandasmaterializer tests from formulaic
stanmart Jun 22, 2023
5716573
Add formulaic to setup.py deps
stanmart Jun 22, 2023
21fcdff
Implement suggestions from code review
stanmart Jun 22, 2023
eaf968e
Clean up code
stanmart Jun 22, 2023
7cb70f6
Pin formulaic minimum version
stanmart Jun 22, 2023
fb629c6
Add support for architectures not supported by xsimd (#262)
xhochy Jun 15, 2023
9fb2993
Release 3.1.9 (#263)
xhochy Jun 16, 2023
1ad6a93
Pre-commit autoupdate (#264)
quant-ranger[bot] Jun 19, 2023
5b2c885
Add params for density and cardinality thresholds
stanmart Jun 23, 2023
63cbc7b
Skip python 3.6 build
stanmart Jun 23, 2023
2daba93
Refactor to avoid circular imports
stanmart Jun 23, 2023
8d049c9
Merge branch 'main' into formula
stanmart Jun 23, 2023
ef84e7d
Interaction of dropped and NA is dropped
stanmart Jun 26, 2023
927b2be
Add type hint for context
stanmart Jun 27, 2023
fbd9ad9
Add unit tests for interactable vectors
stanmart Jun 27, 2023
20b617b
Add more checks
stanmart Jun 27, 2023
010ad8e
Change argument name
stanmart Jun 27, 2023
fcc1a91
Make C() stateful (remember levels)
stanmart Jun 27, 2023
d9d1353
Add test for categorizer state
stanmart Jun 27, 2023
695de6f
More correct handling of encoding categoricals
stanmart Jun 28, 2023
064daac
Make adding an intercept implicitly parametrizable
stanmart Jun 28, 2023
55ae36f
Add na_action parameter to constrictor
stanmart Jun 30, 2023
5f253a3
Add test for sparse numerical columns
stanmart Jun 30, 2023
9caa6df
Add option to not add the constant column
stanmart Aug 2, 2023
f180075
Pre-commit autoupdate (#274)
quant-ranger[bot] Jun 26, 2023
5eaba13
Pre-commit autoupdate (#276)
quant-ranger[bot] Jul 3, 2023
6fb5a96
Bump pypa/gh-action-pypi-publish from 1.8.6 to 1.8.7 (#277)
dependabot[bot] Jul 3, 2023
a473577
Bump pypa/gh-action-pypi-publish from 1.8.7 to 1.8.8 (#279)
dependabot[bot] Jul 17, 2023
1084b53
Bump pypa/cibuildwheel from 2.13.1 to 2.14.1 (#280)
dependabot[bot] Jul 17, 2023
ce96be8
Minimal implementation (tests green)
stanmart Jul 18, 2023
6fdde75
Remove sum method and rely on np.sum
stanmart Jul 19, 2023
a8cbf96
Force DenseMatrix to always be 2-dimensional
stanmart Jul 19, 2023
b80cdc1
Add __repr__ and __str__ methods
stanmart Jul 19, 2023
8983f4d
Fix as_mx
stanmart Jul 20, 2023
16e0217
Fix ufunc return value
stanmart Jul 20, 2023
272ba65
Wrap SparseMatrix, too
stanmart Jul 20, 2023
7775b79
Demo of how the ufunc interface can be implemented
stanmart Jul 20, 2023
a493e03
Do not subclass csc_matrix
stanmart Jul 20, 2023
008dfa3
Improve the performance of `from_pandas` in the case of low-cardinali…
stanmart Jul 18, 2023
407a12f
Add benchmark data to .gitignore (#282)
stanmart Jul 19, 2023
ac9c121
Demonstrate binary ufuncs for sparse
stanmart Jul 21, 2023
95bc477
Add tocsc method
stanmart Jul 21, 2023
a6173f5
Fix type checks
stanmart Jul 21, 2023
35e7330
Minor improvements
stanmart Jul 21, 2023
aa264df
ufunc support for categoricals
stanmart Jul 21, 2023
51d31e5
Remove __array_ufunc__ interface
stanmart Jul 25, 2023
86e3178
Remove numpy operator mixin
stanmart Jul 25, 2023
e4bb2ea
Add hstack function
stanmart Jul 26, 2023
17c36ca
Add method for unpacking underlying array
stanmart Jul 26, 2023
9dd638d
Add __matmul__ methods to SparseMatrix
stanmart Jul 26, 2023
ba2b70e
Stricter and more consistent indexing
stanmart Jul 27, 2023
9b04f8c
Be consistent when instantiating from 1d arrays
stanmart Aug 9, 2023
5c064c2
Adjust tests to work with v4
stanmart Aug 9, 2023
f1ba304
Fix type hints
stanmart Aug 9, 2023
01e20b3
Merge branch 'tabmat-v4' into formula
stanmart Aug 15, 2023
603293b
Add changelog entry
stanmart Aug 15, 2023
3686371
term and column names for formula-based matrices
stanmart Aug 15, 2023
7aa36a4
Fix handling of formula-based names
stanmart Aug 15, 2023
c9cfc0f
Add tests for formula-based names
stanmart Aug 15, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Unreleased
**New features:**

- Add column name and term name metadata to ``MatrixBase`` objects. These are automatically populated when initializing a ``MatrixBase`` from a ``pandas.DataFrame``. In addition, they can be accessed and modified via the ``column_names`` and ``term_names`` properties.
- Add a formula interface for creating tabmat matrices from pandas data frames. See :func:`tabmat.from_formula` for details.

**Other changes:**

Expand Down
1 change: 1 addition & 0 deletions conda.recipe/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ requirements:
- {{ pin_compatible('numpy') }}
- pandas
- scipy
- formulaic>=0.4

test:
requires:
Expand Down
1 change: 1 addition & 0 deletions environment-win.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ channels:
dependencies:
- libblas>=0=*mkl
- pandas
- formulaic>=0.4

# development tools
- black
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ channels:
- nodefaults
dependencies:
- pandas
- formulaic>=0.4

# development tools
- black
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ default_section = 'THIRDPARTY'

[tool.cibuildwheel]
skip = [
"cp36-*",
"*-win32",
"*-manylinux_i686",
"pp*",
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@
],
package_dir={"": "src"},
packages=find_packages(where="src"),
install_requires=["numpy", "pandas", "scipy"],
install_requires=["numpy", "pandas", "scipy", "formulaic>=0.4"],
ext_modules=cythonize(
ext_modules,
annotate=False,
Expand Down
3 changes: 2 additions & 1 deletion src/tabmat/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from .categorical_matrix import CategoricalMatrix
from .constructor import from_csc, from_pandas
from .constructor import from_csc, from_formula, from_pandas
from .dense_matrix import DenseMatrix
from .matrix_base import MatrixBase
from .sparse_matrix import SparseMatrix
Expand All @@ -14,6 +14,7 @@
"SplitMatrix",
"CategoricalMatrix",
"from_csc",
"from_formula",
"from_pandas",
"as_tabmat",
"hstack",
Expand Down
138 changes: 96 additions & 42 deletions src/tabmat/constructor.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,20 @@
import sys
import warnings
from typing import List, Optional, Sequence, Tuple, Union
from typing import Any, List, Mapping, Optional, Union

import numpy as np
import pandas as pd
from formulaic import Formula, ModelSpec
from formulaic.materializers.types import NAAction
from formulaic.parser import DefaultFormulaParser
from formulaic.utils.layered_mapping import LayeredMapping
from pandas.api.types import is_numeric_dtype
from scipy import sparse as sps

from .categorical_matrix import CategoricalMatrix
from .constructor_util import _split_sparse_and_dense_parts
from .dense_matrix import DenseMatrix
from .formula import TabmatMaterializer
from .matrix_base import MatrixBase
from .sparse_matrix import SparseMatrix
from .split_matrix import SplitMatrix
Expand Down Expand Up @@ -179,47 +186,6 @@ def from_pandas(
return matrices[0]


def _split_sparse_and_dense_parts(
arg1: sps.csc_matrix,
threshold: float = 0.1,
column_names: Optional[Sequence[Optional[str]]] = None,
term_names: Optional[Sequence[Optional[str]]] = None,
) -> Tuple[DenseMatrix, SparseMatrix, np.ndarray, np.ndarray]:
"""
Split matrix.

Return the dense and sparse parts of a matrix and the corresponding indices
for each at the provided threshold.
"""
if not isinstance(arg1, sps.csc_matrix):
raise TypeError(
f"X must be of type scipy.sparse.csc_matrix or matrix.SparseMatrix,"
f"not {type(arg1)}"
)
if not 0 <= threshold <= 1:
raise ValueError("Threshold must be between 0 and 1.")
densities = np.diff(arg1.indptr) / arg1.shape[0]
dense_indices = np.where(densities > threshold)[0]
sparse_indices = np.setdiff1d(np.arange(densities.shape[0]), dense_indices)

if column_names is None:
column_names = [None] * arg1.shape[1]
if term_names is None:
term_names = column_names

X_dense_F = DenseMatrix(
np.asfortranarray(arg1[:, dense_indices].toarray()),
column_names=[column_names[i] for i in dense_indices],
term_names=[term_names[i] for i in dense_indices],
)
X_sparse = SparseMatrix(
arg1[:, sparse_indices],
column_names=[column_names[i] for i in sparse_indices],
term_names=[term_names[i] for i in sparse_indices],
)
return X_dense_F, X_sparse, dense_indices, sparse_indices


def from_csc(mat: sps.csc_matrix, threshold=0.1, column_names=None, term_names=None):
"""
Convert a CSC-format sparse matrix into a ``SplitMatrix``.
Expand All @@ -229,3 +195,91 @@ def from_csc(mat: sps.csc_matrix, threshold=0.1, column_names=None, term_names=N
"""
dense, sparse, dense_idx, sparse_idx = _split_sparse_and_dense_parts(mat, threshold)
return SplitMatrix([dense, sparse], [dense_idx, sparse_idx])


def from_formula(
formula: Union[str, Formula],
data: pd.DataFrame,
ensure_full_rank: bool = False,
na_action: Union[str, NAAction] = NAAction.IGNORE,
dtype: np.dtype = np.float64,
sparse_threshold: float = 0.1,
cat_threshold: int = 4,
interaction_separator: str = ":",
categorical_format: str = "{name}[{category}]",
intercept_name: str = "Intercept",
include_intercept: bool = False,
add_column_for_intercept: bool = True,
context: Optional[Union[int, Mapping[str, Any]]] = 0,
) -> SplitMatrix:
"""
Transform a pandas data frame to a SplitMatrix using a Wilkinson formula.

Parameters
----------
formula: str
A formula accepted by formulaic.
data: pd.DataFrame
pandas data frame to be converted.
ensure_full_rank: bool, default False
If True, ensure that the matrix has full structural rank by categories.
na_action: Union[str, NAAction], default NAAction.IGNORE
How to handle missing values. Can be one of "drop", "ignore", "raise".
dtype: np.dtype, default np.float64
The dtype of the resulting matrix.
sparse_threshold: float, default 0.1
The density below which a column is treated as sparse.
cat_threshold: int, default 4
The number of categories below which a categorical column is one-hot
encoded. This is only checked after interactions have been applied.
interaction_separator: str, default ":"
The separator between the names of interacted variables.
categorical_format: str, default "{name}[T.{category}]"
The format string used to generate the names of categorical variables.
Has to include the placeholders ``{name}`` and ``{category}``.
intercept_name: str, default "Intercept"
The name of the intercept column.
include_intercept: bool, default False
Whether to include an intercept term if the formula does not
include (``+ 1``) or exclude (``+ 0`` or ``- 1``) it explicitly.
add_column_for_intercept: bool, default = True
Whether to add a column of ones for the intercept, or just
have a term without a corresponding column. For advanced use only.
context: Union[int, Mapping[str, Any]], default 0
The context to use for evaluating the formula. If an integer, the
context is taken from the stack frame of the caller at the given
depth. If None, the context is taken from the stack frame of the
caller at depth 1. If a dict, it is used as the context directly.
"""
if isinstance(context, int):
if hasattr(sys, "_getframe"):
frame = sys._getframe(context + 1)
MartinStancsicsQC marked this conversation as resolved.
Show resolved Hide resolved
context = LayeredMapping(frame.f_locals, frame.f_globals)
else:
context = None
spec = ModelSpec(
formula=Formula(
formula, _parser=DefaultFormulaParser(include_intercept=include_intercept)
),
ensure_full_rank=ensure_full_rank,
na_action=na_action,
)
materializer = TabmatMaterializer(
data,
context=context,
interaction_separator=interaction_separator,
categorical_format=categorical_format,
intercept_name=intercept_name,
dtype=dtype,
sparse_threshold=sparse_threshold,
cat_threshold=cat_threshold,
add_column_for_intercept=add_column_for_intercept,
)
result = materializer.get_model_matrix(spec)

term_names = np.zeros(len(result.term_names), dtype="object")
for term, indices in result.model_spec.term_indices.items():
term_names[indices] = str(term)
result.term_names = term_names.tolist()

return result
48 changes: 48 additions & 0 deletions src/tabmat/constructor_util.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
from typing import Optional, Sequence, Tuple

import numpy as np
import scipy.sparse as sps

from .dense_matrix import DenseMatrix
from .sparse_matrix import SparseMatrix


def _split_sparse_and_dense_parts(
arg1: sps.csc_matrix,
threshold: float = 0.1,
column_names: Optional[Sequence[Optional[str]]] = None,
term_names: Optional[Sequence[Optional[str]]] = None,
) -> Tuple[DenseMatrix, SparseMatrix, np.ndarray, np.ndarray]:
"""
Split matrix.

Return the dense and sparse parts of a matrix and the corresponding indices
for each at the provided threshold.
"""
if not isinstance(arg1, sps.csc_matrix):
raise TypeError(
f"X must be of type scipy.sparse.csc_matrix or matrix.SparseMatrix,"
f"not {type(arg1)}"
)
if not 0 <= threshold <= 1:
raise ValueError("Threshold must be between 0 and 1.")
densities = np.diff(arg1.indptr) / arg1.shape[0]
dense_indices = np.where(densities > threshold)[0]
sparse_indices = np.setdiff1d(np.arange(densities.shape[0]), dense_indices)

if column_names is None:
column_names = [None] * arg1.shape[1]
if term_names is None:
term_names = column_names

X_dense_F = DenseMatrix(
np.asfortranarray(arg1[:, dense_indices].toarray()),
column_names=[column_names[i] for i in dense_indices],
term_names=[term_names[i] for i in dense_indices],
)
X_sparse = SparseMatrix(
arg1[:, sparse_indices],
column_names=[column_names[i] for i in sparse_indices],
term_names=[term_names[i] for i in sparse_indices],
)
return X_dense_F, X_sparse, dense_indices, sparse_indices
Loading
Loading