Understanding Toy Backdoors via Mechanistic Interpretability

Abstract

Backdoors and hidden harmful behaviour represent a severe risk to the safe deployment of deep neural networks. In this paper, we explore how a small Transformer model implements a toy backdoor behaviour. Our head attribution and activation patching experiments suggest that our model uses a single attention head to implement a simple backdoor. Easy-to-run Colab notebooks for the experiments are available in the Google Drive Folder.

Reference: https://github.com/TransformerLensOrg/TransformerLens

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
README.md		README.md
SmallTransformerAnalysis.ipynb		SmallTransformerAnalysis.ipynb
SmallTransformerTraining.ipynb		SmallTransformerTraining.ipynb
SmallTransformerTraining10Experiments.ipynb		SmallTransformerTraining10Experiments.ipynb
understanding-backdoors.pdf		understanding-backdoors.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding Toy Backdoors via Mechanistic Interpretability

Abstract

Toy Backdoors

Training

Results

About

Releases

Packages

Languages

batu-el/understanding-toy-backdoors

Folders and files

Latest commit

History

Repository files navigation

Understanding Toy Backdoors via Mechanistic Interpretability

Abstract

Toy Backdoors

Training

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages