Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a feature selection/evaluation template #249

Open
dfsnow opened this issue Jun 25, 2024 · 1 comment · May be fixed by #250
Open

Create a feature selection/evaluation template #249

dfsnow opened this issue Jun 25, 2024 · 1 comment · May be fixed by #250
Assignees
Labels
new data/feature Create or edit a column/feature or collect new data

Comments

@dfsnow
Copy link
Member

dfsnow commented Jun 25, 2024

The current CCAO feature evaluation process for new model features is very ad-hoc. We typically look at the change in model performance metrics before and after the addition of a new feature, as well as the absolute SHAP values associated with that feature.

In order to make this ad-hoc process slightly more repeatable and rigorous, we should create a template Quarto document we can use to evaluate new features. This document should contain both standard, repeatable sections (e.g. model performance stats by township) and a series of questions that will likely require additional ad-hoc analysis. For example, given a question like "Where is the new model feature most impactful?" and a feature that adds distance to stadium, one might add maps of PIN-level SHAP values surrounding each stadium.

Goal

The goal here is to remove (or exclude in the first place) features which have no predictive power in any geography (i.e. they are merely noise). The goal is not to remove features which may be redundant, only mildly predictive, or only predictive in certain geographic areas; all of this work is done more-or-less automatically by the model, which is regularized and performs other forms of de-facto feature selection.

Task

Create a Quarto document at analyses/new-feature-template.qmd that can be used to evaluate whether or not new features are merely noise. The document should:

  • Be buildable from existing data:
    • It should reference a new model output run containing the new feature you added, and should load the modeling results directly from S3
    • It should use the output metadata to load the input data from DVC, using the DVC S3 cache
    • Basically, anyone should be able to click render on the document and have it build, assuming they have credential access to S3
  • Exist as a one-off. Unlike the documents in reports/ (which are rendered on every run), documents in analyses/ are only run to answer a specific, one-time question.
  • Contain three types of content:
    • Templated content which does not need to be changed per feature, i.e. model performance statistics, aggregate SHAP plots, etc.
    • Ad-hoc content specific to that feature (see stadium example above)
    • Text that explains the plots/tables and indicates what decision was reached

This document can be copied and then renamed for each new feature added, similar to the workflow for enterprise intelligence.

@ccao-jardine

@dfsnow dfsnow added the new data/feature Create or edit a column/feature or collect new data label Jun 25, 2024
@ccao-jardine
Copy link
Member

Excellent. Let's add to the scope of templated content:

  • Descriptive stats of the input (median, range, histogram)

@Damonamajor Damonamajor linked a pull request Jul 5, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new data/feature Create or edit a column/feature or collect new data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants