Skip to content

Commit

Permalink
Merge branch 'release-1.5'
Browse files Browse the repository at this point in the history
  • Loading branch information
bbengfort committed Aug 21, 2022
2 parents cbac5e3 + 91cf014 commit 223a252
Show file tree
Hide file tree
Showing 177 changed files with 4,405 additions and 559 deletions.
2 changes: 1 addition & 1 deletion .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,4 +70,4 @@ Here's a handy checklist to go through before submitting a PR, note that you can

<!-- If you've added to the docs -->

- [ ] _Have you built the docs using `make html`?_
- [ ] _Have you built the docs using `make html` (must be run from `docs/`)?_
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -127,4 +127,4 @@ jobs:
- name: Run Sphinx
uses: ammaraskar/sphinx-action@master
with:
docs-folder: "docs/"
docs-folder: "docs/"
34 changes: 34 additions & 0 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: Yellowbrick PR Linting

on:
# Trigger on pull request always (note the trailing colon)
pull_request:

jobs:
# Run pre-commit checks on the files changed
linting:
runs-on: ubuntu-latest
name: Linting
steps:
- name: Checkout Code
uses: actions/checkout@v2
with:
fetch-depth: 0

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.9

- name: Install Dependencies
run: |
python -m pip install --upgrade pip
pip install pre-commit
pre-commit install
- name: Run Checks
run: |
pre-commit run --from-ref origin/${{ github.base_ref }} --to-ref HEAD --show-diff-on-failure
26 changes: 26 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.2.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
- id: check-json
- id: check-merge-conflict
- repo: https://github.com/psf/black
rev: 22.6.0
hooks:
- id: black
- repo: https://github.com/PyCQA/flake8
rev: 5.0.4
hooks:
- id: flake8
- repo: https://github.com/pre-commit/pygrep-hooks
rev: v1.9.0
hooks:
- id: rst-backticks
- id: rst-directive-colons
- id: rst-inline-touching-normal
13 changes: 12 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,18 @@ Once forked, use the following steps to get your development environment set up
$ pip install -r docs/requirements.txt
```
4. Switch to the develop branch.
4. (Optional) Set up pre-commit hooks.
When opening a PR in the Yellowbrick repository, a series of checks will be run on your contribution, some of which lint and look at the formatting of your code. These may indicate some changes that need to be made before your contribution can be reviewed. You can set up pre-commit hooks to run these checks locally upon running `git commit` to ensure your contribution will pass formatting and linting checks. To set this up, you will need to uncomment the pre-commit line in `requirements.txt` and then run the following commands:
```
$ pip install -r requirements.txt
$ pre-commit install
```
The next time you run `git commit` in the Yellowbrick repository, the checks will automatically run.
5. Switch to the develop branch.
The Yellowbrick repository has a `develop` branch that is the primary working branch for contributions. It is probably already the branch you're on, but you can make sure and switch to it as follows::
Expand Down
9 changes: 5 additions & 4 deletions MAINTAINERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,18 @@ For everyone who has [contributed](https://github.com/DistrictDataLabs/yellowbri
This is a list of the primary project maintainers. Feel free to @ message them in issues and converse with them directly.

- [bbengfort](https://github.com/bbengfort)
- [ndanielsen](https://github.com/ndanielsen)
- [rebeccabilbro](https://github.com/rebeccabilbro)
- [lwgray](https://github.com/lwgray)
- [NealHumphrey](https://github.com/NealHumphrey)
- [jkeung](https://github.com/jkeung)
- [pdamodaran](https://github.com/pdamodaran)

## Core Contributors

This is a list of the core-contributors of the project. Core contributors set the road map and vision of the project. Keep an eye out for them in issues and check out their work to use as inspiration! Most likely they would also be happy to chat and answer questions.

- [rebeccabilbro](https://github.com/rebeccabilbro)
- [pdeziel](https://github.com/pdeziel)
- [ndanielsen](https://github.com/ndanielsen)
- [NealHumphrey](https://github.com/NealHumphrey)
- [jkeung](https://github.com/jkeung)
- [mattandahalfew](https://github.com/mattandahalfew)
- [tuulihill](https://github.com/tuulihill)
- [balavenkatesan](https://github.com/balavenkatesan)
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
[![Language Grade: Python](https://img.shields.io/lgtm/grade/python/g/DistrictDataLabs/yellowbrick.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/DistrictDataLabs/yellowbrick/context:python)
[![PyPI version](https://badge.fury.io/py/yellowbrick.svg)](https://badge.fury.io/py/yellowbrick)
[![Documentation Status](https://readthedocs.org/projects/yellowbrick/badge/?version=latest)](http://yellowbrick.readthedocs.io/en/latest/?badge=latest)
[![Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1206239.svg)](https://doi.org/10.5281/zenodo.1206239)
[![JOSS](http://joss.theoj.org/papers/10.21105/joss.01075/status.svg)](https://doi.org/10.21105/joss.01075)
[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/DistrictDataLabs/yellowbrick/develop?filepath=examples%2Fexamples.ipynb)
Expand Down
2 changes: 1 addition & 1 deletion docs/api/features/rankd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ A one-dimensional ranking of features utilizes a ranking algorithm that takes in
# Load the credit dataset
X, y = load_credit()

# Instantiate the 1D visualizer with the Sharpiro ranking algorithm
# Instantiate the 1D visualizer with the Shapiro ranking algorithm
visualizer = Rank1D(algorithm='shapiro')

visualizer.fit(X, y) # Fit the data to the visualizer
Expand Down
79 changes: 79 additions & 0 deletions docs/api/model_selection/dropping_curve.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
.. -*- mode: rst -*-
Feature Dropping Curve
=============================

================= =====================
Visualizer :class:`~yellowbrick.model_selection.dropping_curve.DroppingCurve`
Quick Method :func:`~yellowbrick.model_selection.dropping_curve.dropping_curve`
Models Classification, Regression, Clustering
Workflow Model Selection
================= =====================

A feature dropping curve (FDC) shows the relationship between the score and the number of features used.
This visualizer randomly drops input features, showing how the estimator benefits from additional features of the same type.
For example, how many air quality sensors are needed across a city to accurately predict city-wide pollution levels?

Feature dropping curves helpfully complement :doc:`rfecv` (RFECV).
In the air quality sensor example, RFECV finds which sensors to keep in the specific city.
Feature dropping curves estimate how many sensors a similar-sized city might need to track pollution levels.

Feature dropping curves are common in the field of neural decoding, where they are called `neuron dropping curves <https://dx.doi.org/10.3389%2Ffnsys.2014.00102>`_ (`example <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8293867/figure/F3/>`_, panels C and H).
Neural decoding research often quantifies how performance scales with neuron (or electrode) count.
Because neurons do not correspond directly between participants, we use random neuron subsets to simulate what performance to expect when recording from other participants.

To show how this works in practice, consider an image classification example using `handwritten digits <https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits>`_.

.. plot::
:context: close-figs
:alt: Dropping Curve on the digits dataset

from sklearn.svm import SVC
from sklearn.datasets import load_digits

from yellowbrick.model_selection import DroppingCurve

# Load dataset
X, y = load_digits(return_X_y=True)

# Initialize visualizer with estimator
visualizer = DroppingCurve(SVC())

# Fit the data to the visualizer
visualizer.fit(X, y)
# Finalize and render the figure
visualizer.show()

This figure shows an input feature dropping curve.
Since the features are informative, the accuracy increases with more larger feature subsets.
The shaded area represents the variability of cross-validation, one standard deviation above and below the mean accuracy score drawn by the curve.

The visualization can be interpreted as the performance if we knew some image pixels were corrupted.
As an alternative interpretation, the dropping curve roughly estimates the accuracy if the image resolution was downsampled.

Quick Method
------------
The same functionality can be achieved with the associated quick method ``dropping_curve``. This method will build the ``DroppingCurve`` with the associated arguments, fit it, then (optionally) immediately show the visualization.

.. plot::
:context: close-figs
:alt: Dropping Curve Quick Method on the digits dataset

from sklearn.svm import SVC
from sklearn.datasets import load_digits

from yellowbrick.model_selection import dropping_curve

# Load dataset
X, y = load_digits(return_X_y=True)

dropping_curve(SVC(), X, y)


API Reference
-------------

.. automodule:: yellowbrick.model_selection.dropping_curve
:members: DroppingCurve, dropping_curve
:undoc-members:
:show-inheritance:
2 changes: 2 additions & 0 deletions docs/api/model_selection/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ The currently implemented model selection visualizers are as follows:
- :doc:`cross_validation`: displays cross-validated scores as a bar chart with average as a horizontal line.
- :doc:`importances`: rank features by relative importance in a model
- :doc:`rfecv`: select a subset of features by importance
- :doc:`dropping_curve`: select subsets of features randomly

Model selection makes heavy use of cross validation to measure the performance of an estimator. Cross validation splits a dataset into a training data set and a test data set; the model is fit on the training data and evaluated on the test data. This helps avoid a common pitfall, overfitting, where the model simply memorizes the training data and does not generalize well to new or unknown input.

Expand All @@ -27,3 +28,4 @@ There are many ways to define how to split a dataset for cross validation. For m
cross_validation
importances
rfecv
dropping_curve
2 changes: 2 additions & 0 deletions docs/api/model_selection/validation_curve.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ In our first example, we'll explore using the ``ValidationCurve`` visualizer wit
viz.fit(X, y)
viz.show()

To further customize this plot, the visualizer also supports a ``markers`` parameter that changes the marker style.

After loading and wrangling the data, we initialize the ``ValidationCurve`` with a ``DecisionTreeRegressor``. Decision trees become more overfit the deeper they are because at each level of the tree the partitions are dealing with a smaller subset of data. One way to deal with this overfitting process is to limit the depth of the tree. The validation curve explores the relationship of the ``"max_depth"`` parameter to the R2 score with 10 shuffle split cross-validation. The ``param_range`` argument specifies the values of ``max_depth``, here from 1 to 10 inclusive.

We can see in the resulting visualization that a depth limit of less than 5 levels severely underfits the model on this data set because the training score and testing score climb together in this parameter range, and because of the high variability of cross validation on the test scores. After a depth of 7, the training and test scores diverge, this is because deeper trees are beginning to overfit the training data, providing no generalizability to the model. However, because the cross validation score does not necessarily decrease, the model is not suffering from high error due to variance.
Expand Down
63 changes: 63 additions & 0 deletions docs/api/text/correlation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
.. -*- mode: rst -*-
Word Correlation Plot
=====================

Word correlation illustrates the extent to which words or phrases co-appear across the documents in a corpus. This can be useful for understanding the relationships between known text features in a corpus with many documents. ``WordCorrelationPlot`` allows for the visualization of the document occurrence correlations between select words in a corpus. For a number of features n, the plot renders an n x n heatmap containing correlation values.

The correlation values are computed using the `phi coefficient <https://en.wikipedia.org/wiki/Phi_coefficient>`_ metric, which is a measure of the association between two binary variables. A value close to 1 or -1 indicates that the occurrences of the two features are highly positively or negatively correlated, while a value close to 0 indicates no relationship between the two features.

================= ==============================
Visualizer :class:`~yellowbrick.text.correlation.WordCorrelationPlot`
Quick Method :func:`~yellowbrick.text.correlation.word_correlation()`
Models Text Modeling
Workflow Feature Engineering
================= ==============================

.. plot::
:context: close-figs
:alt: Word Correlation Plot

from yellowbrick.datasets import load_hobbies
from yellowbrick.text.correlation import WordCorrelationPlot

# Load the text corpus
corpus = load_hobbies()

# Create the list of words to plot
words = ["Tatsumi Kimishima", "Nintendo", "game", "play", "man", "woman"]

# Instantiate the visualizer and draw the plot
viz = WordCorrelationPlot(words)
viz.fit(corpus.data)
viz.show()


Quick Method
------------

The same functionality above can be achieved with the associated quick method `word_correlation`. This method will build the Word Correlation Plot object with the associated arguments, fit it, then (optionally) immediately show the visualization.

.. plot::
:context: close-figs
:alt: Word Correlation Plot

from yellowbrick.datasets import load_hobbies
from yellowbrick.text.correlation import word_correlation

# Load the text corpus
corpus = load_hobbies()

# Create the list of words to plot
words = ["Game", "player", "score", "oil"]

# Draw the plot
word_correlation(words, corpus.data)

API Reference
-------------

.. automodule:: yellowbrick.text.correlation
:members: WordCorrelationPlot, word_correlation
:undoc-members:
:show-inheritance:
3 changes: 3 additions & 0 deletions docs/api/text/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ We currently have five text-specific visualizations implemented:
- :doc:`tsne`: plot similar documents closer together to discover clusters
- :doc:`umap_vis`: plot similar documents closer together to discover clusters
- :doc:`dispersion`: plot the dispersion of target words throughout a corpus
- :doc:`correlation`: plot the correlation between target words across the documents in a corpus
- :doc:`postag`: plot the counts of different parts-of-speech throughout a tagged corpus

Note that the examples in this section require a corpus of text data, see :doc:`the hobbies corpus <../datasets/hobbies>` for a sample dataset.
Expand All @@ -21,6 +22,7 @@ Note that the examples in this section require a corpus of text data, see :doc:`
from yellowbrick.text import TSNEVisualizer
from yellowbrick.text import UMAPVisualizer
from yellowbrick.text import DispersionPlot
from yellowbrick.text import WordCorrelationPlot
from yellowbrick.text import PosTagVisualizer
from sklearn.feature_extraction.text import TfidfVectorizer
Expand All @@ -33,4 +35,5 @@ Note that the examples in this section require a corpus of text data, see :doc:`
tsne
umap_vis
dispersion
correlation
postag
34 changes: 34 additions & 0 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,40 @@
Changelog
=========

Version 1.5
-----------

* Tag: v1.5_
* Deployed Sunday, August 21, 2022
* Current Contributors: Stefanie Molin, Prema Roman, Sangam Swadik, David Gilbertson, Larry Gray, Benjamin Bengfort, @admo1, @charlesincharge, Uri Nussbaum, Patrick Deziel, Rebecca Bilbro

Major
- Added ``WordCorrelationPlot`` Visualizer
- Built tests for using sklearn pipeline with visualizers
- Allowed Marker Style to be specified in Validation Curve Visualizer
- Fixed ``get_params`` for estimator wrapper to prevent ``AttributeError``
- Updated missing values visualizer to handle multiple data types and work on both numpy arrays and pandas data frames.
- Added pairwise distance metrics to scoring metrics in KElbowVisualizer
Minor
- Pegged Numba to v0.55.2
- Updated Umap to v0.5.3
- Fixed Missing labels in classification report visualizer
- Updated Numpy to v1.22.0
Documentation
- The Spanish language Yellowbrick docs are now live: https://www.scikit-yb.org/es/latest/
- Added Dropping curve documentation
- Added new example Notebook for Regression Visualizers
- Fixed Typo in PR section of getting started docs
- Fixed Typo in rank docs
- Updated docstring in kneed.py utility file
- Clarified how to run ‘make html’ in PR template
Infrastructure
- Added ability to run linting Actions on PRs
- Implemented black code formatting as pre-commit hook

.. _v1.5: https://github.com/DistrictDataLabs/yellowbrick/releases/tag/v1.5


Version 1.4
-----------

Expand Down
Loading

0 comments on commit 223a252

Please sign in to comment.