My academic journey started with a fascination for the brain, leading me through psychology and Philosophy & Cognitive Science. However, I increasingly felt that the ideas and beliefs I gained “don’t pay rent”; they didn't seem to relate to stuff I actually care about. This ultimately led me to studying AI in-depth at Radboud University.
After conducting research on computational models for depression with Roshan Cools at the Donders Institute, I became increasingly motivated by preventing catastrophic outcomes of superintelligent AI through engagement with EA communities. Since January 2025, I've dedicated myself to Mechanistic Interpretability, specifically detecting deception in neural networks, as well as other empirical AI alignment research agendas. My biggest worry is that models might learn different internal goals and learn to hide these with superhuman capability through alignment faking or steganographic reasoning.
I was an ARENA Fellow in the 2025 iteration, building technical skills and working on a project on LLM reasoning from my AI Safety Camp project (supervised by Nandi Schoots).
Now I am researching automated red-teaming for safety evaluation with Tal Kachman's lab. I’m particularly focused on understanding how LLM-to-LLM interactions differ from human-LLM interactions in a multipolar AI world: Do models exploit linguistic quirks when interacting with each other? Do our current safety evaluation tools adequately replicate real deployment conditions? See my research statement for more information on my research interests (however, a bit outdated, without mention of jailbreaking).
Wanna talk? Feel free to book a 1-on-1. Always happy to chat. Or the old fashioned way: samuelgerrit.nellessen{at}gmail.com.

