Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamline evals #200

Open
4 tasks
mivanit opened this issue Nov 16, 2023 · 0 comments
Open
4 tasks

streamline evals #200

mivanit opened this issue Nov 16, 2023 · 0 comments
Labels
code-quality code quality improvement evals Model evaluations

Comments

@mivanit
Copy link
Member

mivanit commented Nov 16, 2023

Evals are a mess, and these is no unified way to run them.

  • organize existing evals into categories based on type signature
  • single function which takes a model and runs sets of evals
  • fix usage of evals in the codebase to conform to new standard
  • add all evals to training add model evals into training loop #97

currently, we have a couple forms of evals:

  • PathEvals, which take the solution coordinate and generated rollout
    • these are the oldest group
    • there are a couple things in rollout_evals() which belong here
  • LOGIT_ATTRIB_TASKS, which give the model a specific prompt and have it generate a single token
    • these were made for direct logit attribution, and are of the form "can you predict the origin correctly" or "how well do you do on a random non-endpoint in the path"
  • "rollout evals", which take the raw tokens produced by a model and compute things about the validity of tokens
    • these are in rollout_evals()
  • "logit evals" as mentioned in Add logit-based evals #165, but this might really be part of LOGIT_ATTRIB_TASKS
    • would be nice to have a "total probability mass assigned to valid tokens / coordinate tokens" metric, more useful than just raw perplexity

these need to be wrapped into a couple different groups, depending on the inputs. Perhaps:

  • model, mazes for the task based evals
  • generated_path, correct_path as in the older PathEvals
  • generated_tokens, maze as in the "rollout evals"
  • sequence_logits, maze for logit evals Add logit-based evals #165

the function rollout_evals() is a mess because it exports stuff which was in a jupyter notebook for the UniReps submission, no time to integrate it with the rest of the code right now.

@mivanit mivanit added code-quality code quality improvement evals Model evaluations labels Nov 16, 2023
mivanit added a commit that referenced this issue Dec 8, 2023
mega-PR, adding a bunch of experiment notebooks and the required code for them.

broad overview:
- added some example models to `examples/`
- reworked eval code, needs big changes -- see #200 
- many modifications to mechinterp code
- had to enforce transformerlens 1.6.1 due to tokenizer changes (it tried to get our custom tokenizer from huggingface?)
- exported some code to muutils
- notebooks added:
  - `eval_tasks_table.ipynb`: evaluate on a bunch of single token tasks. should be merged with other evals notebook
  - `appendix_figures.ipynb`: junk and duplicates of code in other notebooks :/
  - `generate_rollouts.ipynb`: what the name says, simple notebooks

comment history:

* trying to see if wandb model loading is working right

* moved dict shapes to muutils (its on unmerged branch tho)

* better loading of models from wandb

* wip????????????????

* way more testing for loading wandb models

* aaaa

* ???

* hallway run

* update muutils dep to 0.5.3

* updated TL and maze-dataset dep

* type hint

* notebook runs?

* wip runs

* cleared notebooks?

* exported eval plots

* format

* many fixes and changes sorry

* wip

* poetry lock

* minor adjustment to make model names cleaner

* exported single token tasks

* refactored baseline model, allowed return of multiple options

going to be useful for plot_logits

* more baseline model refactor

* format

* dep?

* train_model test was trying to train on 3M samples lol

* seperate appendix figures notebooks, better logits plotting

logits plotting now allows for adding other categories to the histogram besides
correct / incorrect, which we can use the baseline model for

* misc

* rename original hallway model

need to fix refs to it later lol

* WE'RE SO BACK, ADJACENCY HEADS ARE HERE

check the dla notebook!!!

* correlation of attention and distance

* misc

* ok no more figures for now

* temp notebooks, for experiments. move these to paper repo later

* eval tasks table

* final before unireps submit

* misc fixes??

* added padding functionality and batched predictions

* wip

* wip

* wip

* added attention animation plotter

* format

* update deps

* transformerlens 1.6.1 due to issues :/

* cleaning up notebooks

latest versions of some were in experiments repo

* fix up some notebooks, eval_model is still broken

* providing hallway model

* fixing eval_model issues with baseline solver

batching was not working at all, had to add a hack to recursively
call .generate() on RandomBaseline

return type was list[str] instead of tensor or list[list[str]] so
had to fix that as well

* update dep to muutils 0.5.5 (poetry not recognizing it yet)

* format

* poetry lock

* changed model used to hallway

* changed model paths, no jirpy

* update embedding structure nb

* updated plot attention for better cbar

* fix up eval tasks table notebook

* fix when cbar is none

* ran notebook
@mivanit mivanit added this to the Fixing evals & training loop milestone Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code-quality code quality improvement evals Model evaluations
Projects
None yet
Development

No branches or pull requests

1 participant