Create a feature selection/evaluation template #249

dfsnow · 2024-06-25T16:40:17Z

The current CCAO feature evaluation process for new model features is very ad-hoc. We typically look at the change in model performance metrics before and after the addition of a new feature, as well as the absolute SHAP values associated with that feature.

In order to make this ad-hoc process slightly more repeatable and rigorous, we should create a template Quarto document we can use to evaluate new features. This document should contain both standard, repeatable sections (e.g. model performance stats by township) and a series of questions that will likely require additional ad-hoc analysis. For example, given a question like "Where is the new model feature most impactful?" and a feature that adds distance to stadium, one might add maps of PIN-level SHAP values surrounding each stadium.

Goal

The goal here is to remove (or exclude in the first place) features which have no predictive power in any geography (i.e. they are merely noise). The goal is not to remove features which may be redundant, only mildly predictive, or only predictive in certain geographic areas; all of this work is done more-or-less automatically by the model, which is regularized and performs other forms of de-facto feature selection.

Task

Create a Quarto document at analyses/new-feature-template.qmd that can be used to evaluate whether or not new features are merely noise. The document should:

Be buildable from existing data:
- It should reference a new model output run containing the new feature you added, and should load the modeling results directly from S3
- It should use the output metadata to load the input data from DVC, using the DVC S3 cache
- Basically, anyone should be able to click render on the document and have it build, assuming they have credential access to S3
Exist as a one-off. Unlike the documents in reports/ (which are rendered on every run), documents in analyses/ are only run to answer a specific, one-time question.
Contain three types of content:
- Templated content which does not need to be changed per feature, i.e. model performance statistics, aggregate SHAP plots, etc.
- Ad-hoc content specific to that feature (see stadium example above)
- Text that explains the plots/tables and indicates what decision was reached

This document can be copied and then renamed for each new feature added, similar to the workflow for enterprise intelligence.

@ccao-jardine

The text was updated successfully, but these errors were encountered:

ccao-jardine · 2024-06-27T18:23:29Z

Excellent. Let's add to the scope of templated content:

Descriptive stats of the input (median, range, histogram)

dfsnow assigned Damonamajor Jun 25, 2024

dfsnow added the new data/feature Create or edit a column/feature or collect new data label Jun 25, 2024

Damonamajor linked a pull request Jul 5, 2024 that will close this issue

Create a feature selection evaluation template #250

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a feature selection/evaluation template #249

Create a feature selection/evaluation template #249

dfsnow commented Jun 25, 2024

ccao-jardine commented Jun 27, 2024

Create a feature selection/evaluation template #249

Create a feature selection/evaluation template #249

Comments

dfsnow commented Jun 25, 2024

Goal

Task

ccao-jardine commented Jun 27, 2024