The area of interpretability in large language models (LLMs) has been growing rapidly in recent years. This repository tries to collect all relevant resources to help beginners quickly get started in this area and help researchers to keep up with the latest research progress.
This is an active repository and welcome to open a new issue if I miss any relevant resources. If you have any questions or suggestions, please feel free to contact me via email: [email protected]
.
Table of Contents
- Awesome Interpretability Libraries
- Awesome Interpretability Blogs & Videos
- Awesome Interpretability Tutorials
- Awesome Interpretability Forums
- Awesome Interpretability Tools
- Awesome Interpretability Programs
- Awesome Interpretability Papers
- Other Awesome Interpretability Resources
- TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models. (Doc, Tutorial, Demo)
- nnsight: enables interpreting and manipulating the internals of deep learned models. (Doc, Tutorial, Paper)
- SAE Lens: train and analyse SAE. (Doc, Tutorial, Blog)
- EleutherAI: sae: train SAE on very large model based on the method and released code of the openAI SAE paper
- Automatic Circuit DisCovery: automatically build circuit for mechanistic interpretability. (Paper, Demo)
- Pyvene: A Library for Understanding and Improving PyTorch Models via Interventions. (Paper, Demo)
- pyreft: A Powerful, Efficient and Interpretable fine-tuning method. (Paper, Demo)
- repeng: A Python library for generating control vectors with representation engineering. (Paper, Blog)
- Penzai: a JAX library for writing models as legible, functional pytree data structures, along with tools for visualizing, modifying, and analyzing them. (Paper, Doc, Tutorial)
- LXT: LRP eXplains Transformers: Layer-wise Relevance Propagation (LRP) extended to handle attention layers in Large Language Models (LLMs) and Vision Transformers (ViTs). (Paper, Doc)
- Tuned Lens: Tools for understanding how transformer predictions are built layer-by-layer. (Paper, Doc)
- Inseq: Pytorch-based toolkit for common post-hoc interpretability analyses of sequence generation models. (Paper, Doc)
- shap: Python library for computing SHAP feature / token importance for any black box model. Works with hugginface, pytorch, tensorflow models, including LLMs. (Paper, Doc)
- captum: Model interpretability and understanding library for PyTorch (Paper, Doc)
- A Barebones Guide to Mechanistic Interpretability Prerequisites
- Concrete Steps to Get Started in Transformer Mechanistic Interpretability
- An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers
- An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
- 200 Concrete Open Problems in Mechanistic Interpretability
- 3Blue1Brown: But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning
- 3Blue1Brown: Attention in transformers, visually explained | Chapter 6, Deep Learning
- 3Blue1Brown: How might LLMs store facts | Chapter 7, Deep Learning
- ARENA 3.0: understand mechanistic interpretability using TransformerLens.
- EACL24: Transformer-specific Interpretability (Github)
- ICML24: Physics of Language Models (Youtube)
- NAACL24: Explanations in the Era of Large Language Models
- AI Alignment Forum
- LessWrong
- Mechanistic Interpretability Workshop 2024 ICML (Accepted papers)
- Attributing Model Behavior at Scale Workshop 2023 NeurIPS (Accepted papers)
- BlackboxNLP 2023 EMNLP (Accepted papers)
- Transformer Debugger: investigate specific behaviors of small LLMs
- LLM Transparency Tool (Demo)
- sae_vis: a tool to replicate Anthropic's sparse autoencoder visualisations (Demo)
- Neuronpedia: an open platform for interpretability research. (Doc)
- Comgra: A tool to analyze and debug neural networks in pytorch. Use a GUI to traverse the computation graph and view the data from many different angles at the click of a button. (Paper)
- ML Alignment & Theory Scholars (MATS): an independent research and educational seminar program that connects talented scholars with top mentors in the fields of AI alignment, interpretability, and governance.
Title | Venue | Date | Code |
---|---|---|---|
Knowledge Mechanisms in Large Language Models: A Survey and Perspective |
EMNLP | 2024-10-06 | - |
Attention Heads of Large Language Models: A Survey | arXiv | 2024-09-06 | Github |
Internal Consistency and Self-Feedback in Large Language Models: A Survey | arXiv | 2024-07-22 | Github Paper List |
Relational Composition in Neural Networks: A Survey and Call to Action |
MechInterp@ICML | 2024-07-15 | - |
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP | arXiv | 2024-06-18 | - |
A Primer on the Inner Workings of Transformer-based Language Models | arXiv | 2024-05-02 | - |
Mechanistic Interpretability for AI Safety -- A Review | arXiv | 2024-04-22 | - |
From Understanding to Utilization: A Survey on Explainability for Large Language Models | arXiv | 2024-02-22 | - |
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks | arXiv | 2023-08-18 | - |
Title | Venue | Date | Code |
---|---|---|---|
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience | ICML | 2024-06-25 | - |
Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience | ICML | 2024-06-03 | - |
Interpretability Needs a New Paradigm | arXiv | 2024-05-08 | - |
Position Paper: Toward New Frameworks for Studying Model Representations | arXiv | 2024-02-06 | - |
Rethinking Interpretability in the Era of Large Language Models | arXiv | 2024-01-30 | - |
Title | Venue | Date | Code | Blog |
---|---|---|---|---|
Benchmarking Mental State Representations in Language Models |
MechInterp@ICML | 2024-06-25 | - | - |
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains |
ACL | 2024-05-21 | Dataset | Blog |
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations |
arXiv | 2024-02-27 | Github | - |
CausalGym: Benchmarking causal interpretability methods on linguistic tasks |
arXiv | 2024-02-19 | Github | - |
Title | Venue | Date | Code | Blog |
---|---|---|---|---|
Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability |
arXiv | 2024-01-08 | - | - |
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability |
arXiv | 2023-06-06 | Github | - |
Title | Venue | Date | Code | Blog |
---|---|---|---|---|
An introduction to graphical tensor notation for mechanistic interpretability |
arXiv | 2024-02-02 | - | - |
Episodic Memory Theory for the Mechanistic Interpretation of Recurrent Neural Networks |
arXiv | 2023-10-03 | Github | - |