Skip to content

PAIR-code/interpretability

Repository files navigation

PAIR Interpretability

This repo contains code and articles on PAIR interpretability projects.

Scalable Influence and Fact Tracing for Large Language Model Pretraining (ICLR'25)

See blog post, for a light introduction to the paper. There is also a public demo, and the dedicated github repo. The full paper is Scalable Influence and Fact Tracing for Large Language Model Pretraining -- Tyler Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, Ian Tenney (RH)

Racing Thoughts: Explaining Large Language Model Contextualization Errors (NAACL'25)

Racing Thoughts: Explaining Contextualization Errors Within Large Language Models -- Michael A. Lepori, Mike Mozer, Asma Ghandeharioun (RH)

Who's asking? User personas and the mechanics of latent misalignment (NeurIPS'24)

Who's asking? User personas and the mechanics of latent misalignment -- Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori, Lucas Dixon, at NeurIPS'24.

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models (ICML'24)

The Patchscopes mini-site & the interactive explorable contain a brief introduction to the longer paper (ICML'24) by Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori, Lucas Dixon.

Visualizing and Measuring the Geometry of BERT

bert-tree and context-atlas are repos for two interactive blogposts/visualizations for the paper Visualizing and Measuring the Geometry of BERT :

  1. Language, trees, and geometry in neural networks explores the geometry of syntactic information in BERT (bert-tree)

  2. Language, Context, and Geometry in Neural Network explores semantics and context in BERT. See the accompanying tool, Context Atlas, for more details (context-atlas)

Deep dreaming on text

text-dream contains different experiments and tools to work with deep dreaming for text.

LinguisticLens

data-synth-syntax contains LinguisticLens, a tool for visualizing generated text data.

About

PAIR.withgoogle.com and friend's work on interpretability methods

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published