Skip to content

batu-el/understanding-toy-backdoors

Repository files navigation

Understanding Toy Backdoors via Mechanistic Interpretability

Paper Drive Folder Course Page

Abstract

Backdoors and hidden harmful behaviour represent a severe risk to the safe deployment of deep neural networks. In this paper, we explore how a small Transformer model implements a toy backdoor behaviour. Our head attribution and activation patching experiments suggest that our model uses a single attention head to implement a simple backdoor. Easy-to-run Colab notebooks for the experiments are available in the Google Drive Folder.

Reference: https://github.com/TransformerLensOrg/TransformerLens

Toy Backdoors

Alt text Alt text

Training

Alt text

Results

Alt text Alt text

About

Theory of Deep Learning @ University of Cambridge

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published