Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrices from formulas #267

Merged
merged 74 commits into from
Aug 15, 2023
Merged

Matrices from formulas #267

merged 74 commits into from
Aug 15, 2023

Conversation

MatthiasSchmidtblaicherQC
Copy link
Contributor

@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC commented Jun 20, 2023

This draft PR is for allowing early review of the formula branch.

Checklist

  • Added a CHANGELOG.rst entry

 - simplify categorical interactions
 - NaNs in categoricals should be handled correctly
 - parity with formulaic in categorical names
@MartinStancsicsQC MartinStancsicsQC added this to the Tabmat v4 milestone Aug 9, 2023
@MartinStancsicsQC
Copy link
Contributor

Example notebook for the new feature: https://gist.github.com/MartinStancsicsQC/eaad758d27cb119336804a3d4170aad1

Copy link
Member

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Martin!
Only question is a small detail: in the notebook it mentions that if you want to add an intercept you need to specify + 0. I can't load the grammar of formulaic but patsy and others use + 1 instead. Is that intentional? Can we use both?

@@ -0,0 +1,32 @@
from typing import Tuple

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with having this file. formula.py is already 700+ lines of code. Let's keep it separate from constructor.py.

CHANGELOG.rst Outdated Show resolved Hide resolved
CHANGELOG.rst Outdated Show resolved Hide resolved
@MartinStancsicsQC
Copy link
Contributor

Great work Martin! Only question is a small detail: in the notebook it mentions that if you want to add an intercept you need to specify + 0. I can't load the grammar of formulaic but patsy and others use + 1 instead. Is that intentional? Can we use both?

Sorry, that's a typo. We also use +1 to add the intercept. +0 or -1 can be used to remove it explicitly (in line with other libs).

@MartinStancsicsQC MartinStancsicsQC marked this pull request as ready for review August 15, 2023 09:08
@MartinStancsicsQC MartinStancsicsQC merged commit a384ee6 into tabmat-v4 Aug 15, 2023
9 of 12 checks passed
@MartinStancsicsQC MartinStancsicsQC deleted the formula branch August 15, 2023 09:09
MatthiasSchmidtblaicherQC added a commit that referenced this pull request Apr 23, 2024
* Minimal implementation (tests green)

* Remove sum method and rely on np.sum

* Force DenseMatrix to always be 2-dimensional

* Add __repr__ and __str__ methods

* Fix as_mx

* Fix ufunc return value

* Wrap SparseMatrix, too

* Demo of how the ufunc interface can be implemented

* Do not subclass csc_matrix

* Demonstrate binary ufuncs for sparse

* Add tocsc method

* Fix type checks

* Minor improvements

* ufunc support for categoricals

* Remove __array_ufunc__ interface

* Remove numpy operator mixin

* Add hstack function

* Add method for unpacking underlying array

* Add __matmul__ methods to SparseMatrix

* Stricter and more consistent indexing

* Be consistent when instantiating from 1d arrays

* Add column name metadata to `tabmat` matrices (#278)

* Add column name getters

* Matrix names are also combined

* Add names to constructors

* Add indexing support for column names

* Remove unnecessary code

* Better default column names

* Reduce code duplication

* Saner defaults

* Add convenient getters and setters

* Fix indexing

* Smarter setter for categorical matrices

* Add tests

* Fix subsetting with np.newaxis

* Remove the walrus :(

* Fix test

* Fix indexing with np.ix_

* Propagate column names where it makes sense

* Fix merge mistake

* Add changelog entry

* Matrices from formulas (#267)

* Add an experimental tabmat materializer class

* Nicer way of handling interactions

* Have proper column names [skip ci]

* Make dummy ordering consistent with pandas [skip ci]

* Fix mistake in categorical interactions [skip ci]

* Add formulaic to environment files

Have not added to the conda recipe yet.
Should probably be optional.

* Add from_formula constructor

* Add some tests

* Add more tests

* Major refactoring

 - simplify categorical interactions
 - NaNs in categoricals should be handled correctly
 - parity with formulaic in categorical names

* Make name formatting custommizable

 - interaction_separator
 - categorical_format
 - intercept_name

* Add formulaic to conda recipe

* Implement `C()` function to convert to categoricals

* Auto-convert strings to categories

* Fix C() not working from materializer interface

* Add the pandasmaterializer tests from formulaic

* Add formulaic to setup.py deps

* Implement suggestions from code review

* Clean up code

 - Add docstrings
 - Add type hints
 - Rename some classes

* Pin formulaic minimum version

* Add support for architectures not supported by xsimd (#262)

* Release 3.1.9 (#263)

* Pre-commit autoupdate (#264)

Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com>

* Add params for density and cardinality thresholds

* Skip python 3.6 build

* Refactor to avoid circular imports

* Interaction of dropped and NA is dropped

* Add type hint for context

* Add unit tests for interactable vectors

* Add more checks

* Change argument name

* Make C() stateful (remember levels)

* Add test for categorizer state

* More correct handling of encoding categoricals

* Make adding an intercept implicitly parametrizable

Default is False

* Add na_action parameter to constrictor

* Add test for sparse numerical columns

* Add option to not add the constant column

* Pre-commit autoupdate (#274)

* Pre-commit autoupdate (#276)

Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com>

* Bump pypa/gh-action-pypi-publish from 1.8.6 to 1.8.7 (#277)

Bumps [pypa/gh-action-pypi-publish](https://github.com/pypa/gh-action-pypi-publish) from 1.8.6 to 1.8.7.
- [Release notes](https://github.com/pypa/gh-action-pypi-publish/releases)
- [Commits](pypa/gh-action-pypi-publish@v1.8.6...v1.8.7)

---
updated-dependencies:
- dependency-name: pypa/gh-action-pypi-publish
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump pypa/gh-action-pypi-publish from 1.8.7 to 1.8.8 (#279)

Bumps [pypa/gh-action-pypi-publish](https://github.com/pypa/gh-action-pypi-publish) from 1.8.7 to 1.8.8.
- [Release notes](https://github.com/pypa/gh-action-pypi-publish/releases)
- [Commits](pypa/gh-action-pypi-publish@v1.8.7...v1.8.8)

---
updated-dependencies:
- dependency-name: pypa/gh-action-pypi-publish
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump pypa/cibuildwheel from 2.13.1 to 2.14.1 (#280)

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.13.1 to 2.14.1.
- [Release notes](https://github.com/pypa/cibuildwheel/releases)
- [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md)
- [Commits](pypa/cibuildwheel@v2.13.1...v2.14.1)

---
updated-dependencies:
- dependency-name: pypa/cibuildwheel
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Minimal implementation (tests green)

* Remove sum method and rely on np.sum

* Force DenseMatrix to always be 2-dimensional

* Add __repr__ and __str__ methods

* Fix as_mx

* Fix ufunc return value

* Wrap SparseMatrix, too

* Demo of how the ufunc interface can be implemented

* Do not subclass csc_matrix

* Improve the performance of `from_pandas` in the case of low-cardinality categoricals (#275)

* Improve the performance of `from_pandas`

* Update changelog according to review

* Add benchmark data to .gitignore (#282)

* Demonstrate binary ufuncs for sparse

* Add tocsc method

* Fix type checks

* Minor improvements

* ufunc support for categoricals

* Remove __array_ufunc__ interface

* Remove numpy operator mixin

* Add hstack function

* Add method for unpacking underlying array

* Add __matmul__ methods to SparseMatrix

* Stricter and more consistent indexing

* Be consistent when instantiating from 1d arrays

* Adjust tests to work with v4

* Fix type hints

* Add changelog entry

* term and column names for formula-based matrices

* Fix handling of formula-based names

* Add tests for formula-based names

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Martin Stancsics <[email protected]>
Co-authored-by: Uwe L. Korn <[email protected]>
Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Apply Matthias' suggestions

Co-authored-by: Matthias Schmidtblaicher <[email protected]>

* Allow missing values in `CategoricalMatrix` (#281)

* Add missing support to categoricals

* Rename functions

* Parametrize missing behavior in constructors

* Return a maskedarray from recover_orig

* Propagate missing_method when indexing

* Add tests

* Template all the things!

* Privatize has_missing attribute

* Add changelog entry

* Add option to treat missing values as a category

* Update changelog

* Raise if the missing category already exists

* Add tests for missing name and raise on existing

* Don't skip tests (they are fast)

* Apply suggestions from review

* Fix indxing

* Fix intercept name in formulas

* Add missing cateegorical functinoality to formulas

* Much cooler handlong of missing categoricals

* Add changelog entry

* Correctly create missing category from model_spec (#297)

* pyupgrade 3.9

* make ruff and mypy happy

* bump minimum formulaic version (stateful transforms)

* add test case with custom cat format

* pin formulaic minimum version to 0.6 (#340)

* cosmetics

* Raise for unseen categories when materializing from an existing `ModelSpec` (#341)

* Raise error on unseen levels when materializing

* Fix test for unseen categories

* Add test for raising on unseen categories

* Properly handle missings when checking for unseen

* Expand test for unseen missings

* Improve attribute name

* Add comment about dropping missings in tests for new levels

* consistent tense

* typo

* slightly improve wording

* Describe breaking change

* improve wording

* review comments

* add change from #356

* fix

* set default context to None

* add scope to other test, too

* tiny docstring cosmetics

* remove duplicate . [skip-ci]

* more docstring formatting

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Matthias Schmidtblaicher <[email protected]>
Co-authored-by: Uwe L. Korn <[email protected]>
Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Marc-Antoine Schmidt <[email protected]>
Co-authored-by: Matthias Schmidtblaicher <[email protected]>
Co-authored-by: Martin Stancsics <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants