Skip to content

In depth evaluation of the ETHICS utilitarianism task dataset. An assessment of approaches to improved interpretability (SHAP, Bayesian transformers).

License

Notifications You must be signed in to change notification settings

ravipatelxyz/nlp-ethics

Repository files navigation

Re-thinking the ETHICS utilitarianism task

This repository corresponds to the report, Re-thinking the ETHICS utilitarianism task, available here.

Abstract

We perform an exploratory study of the ETHICS utilitarianism task dataset (Hendrycks et al. 2021), and investigate approaches to improve interpretability of transformer models fine-tuned on this task. We identify substantial train-test overlap, marked train-test distributional shift, and significant label non-reproducibility yielding ceilings of performance. This motivates a re-release of a reformulated dataset. We then consider attention mapping, Shapley additive explanations (SHAP), and Bayesian methods for model certainty estimation, as approaches to improve interpretability. Through SHAP we identify several model failure modes, including sensitivity to sentence length and ungrammatical word repetition. We find weight perturbation techniques have limited utility when applied to large transformer models despite being computationally cheap, and identify Monte Carlo dropout as a promising candidate for certainty estimation. We implement a direct scenario comparison model that improves performance on a hard subset of the data.

We also make available:

  1. A spotlight talk (slides)
  2. A demo notebook
  3. All code
  4. Full report

About

In depth evaluation of the ETHICS utilitarianism task dataset. An assessment of approaches to improved interpretability (SHAP, Bayesian transformers).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published