Implementation of the Separate and Conquer or Covering-Algorithm for scikit-learn. This is a classifier learning a collection (theory) of human-comprehensible rules.
sklearn_seco aims to be
- computationally fast,
- open to algorithm modification (think e.g. new heuristics),
- yet still comprising of understandable code.
It was developed as a masters thesis at the Knowledge Engineering Group at TU Darmstadt, under supervision of Prof. Johannes Fürnkranz. Check the thesis document (see the releases page) for additional documentation, including evaluation plots.
For current test suite results, check Continuous Integration.
To run a comparison of sklearn_seco.RipperEstimator
with
weka.JRip,
weka.J48, and
sklearn.dtree
on a selected collection of UCI datasets, run python3 evaluation.py
.
Required are python >= 3.6 and the packages defined in setup.py
.
If needed, a known-working list of versioned dependencies is pinned in requirements.txt
.
-
If you want to work on the code, check out the repository:
git clone https://github.com/azrdev/sklearn-seco
and run directly from the working tree, or make an editable install:
pip install --update -e "/path/to/sklearn-seco[numba]"
-
If you want to use the project without modifying it, install the latest
master
withpip install --update "git+https://github.com/azrdev/sklearn-seco#egg=sklearn_seco[numba]"
-
TODO: publish on pypi so getting a stable version with
pip install "sklearn-seco[numba]"
is possible.
Note that for the speed-optimized matching you need to install numba
,
and for the coverage plots in extra.py
and tests/test_extra.py
you need matplotlib
.
The former is installed above, through specified
"extra"
dependency sets "numba" and "tests".
-
abstract seco is implemented and usable, the compatibility test
sklearn.utils.estimator_checks.check_estimator()
succeeds completely for SimpleSeCo and CN2 -
CN2 has not been thoroughly compared to original code or the Orange implementation, but should be complete
-
Ripper lacks the original class binarization strategy and the global post-optimization, therefore results are not identical to JRip (the only other freely available implementation).
-
testsuite has a few consistent failures (all in test_concrete.py):
-
test_perfectly_correlated_categories_multiclass[CN2Estimator]
fails because CN2 learns only one rule when it gets tox=[[… 2 0] [… 0 1]], y=[2 1]
which is perfectly valid behavior. -
test_sklearn_check_estimator[IrepEstimator]
failssklearn.utils.estimator_checks.check_classifier_data_not_an_array
(NOTE: maybe later tests in check_estimator fail, too) because: if default rule is better than any refinement and survives rule_stopping_criterion, the learned theory is[init_rule()]
. We consider this an error and raise an exception , maybe stopping criteria are buggy that they let the default rule through. -
test_sklearn_check_estimator[RipperEstimator]
failssklearn.utils.estimator_checks.check_classifiers_train
due to scikit-learn/scikit-learn#14124 (NOTE: maybe later tests in check_estimator fail, too) -
pypy test runs timeout on Travis-CI after 10 minutes.
-
-
various TODOs throughout the code mark missing details and/or ideas for improvement (of functionality or runtime performance)