-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zanj integration: datasets & training #177
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- config equality check gets the diff from a ConfigMismatchException - this requires a yet-unpublished feature in muutils, coming in 0.3.7 - old `load_model_with_configs` now takes a `foln_ln` arg which it passes when calling `model.process_weights_()` this is hopefully temporary since ZANJ now records whether state dict was folded!
…getedLatticeMaze" there is a bug, but my fix does not fix it! This reverts commit 88002f6.
* return SolvedMazes from dataset.__getitem__ * Move tokenization into Maze classes * Move batch preprocessing into dataloader * Lots of tests for datasets * Tidy up filters a bit and allow positional args * Speed up tests by using a non-parallel dataloader * integration-v1 training config renamed to test-v1
- constraint options for `gen_dfs` generation algorithm (by @canrager) - added `maze_ctor_kwargs` to `MazeDatasetConfig` to allow setting those options - fixed some issues arising from parallelism + fixed seed (this was hacky) - minor things: - bumped muutils to 0.3.10 - we now use `Coord` and `CoordArray` (numpy) in many places, instead of tuples/lists - separated `MAZE_DATASET_CONFIGS` to [maze_transformer/training/maze_dataset_configs.py](https://github.com/AISC-understanding-search/maze-transformer/pull/184/files#diff-ab008b2d4ddb7138116afef18584f657832ec00430af732f195136a63b0debaf) - some random junk --------- Co-authored-by: mivanit <[email protected]>
…nderstanding-search/maze-transformer into zanj-integration-datasets
@valedan here are the remaining problems which we need to fix before merging. Once tests pass, I think we are good to go!
|
valedan
approved these changes
Apr 28, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
This was referenced Jun 16, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(this is a mega pr, sorry)
configs
Modifying configs from the command line is now easier!
ConfigHolder.get_config_multisource()
GPTDataset().to_fname()
used to generate filename for saving config (and also to find matching config to load/download).MazeDatasetConfig
also implements this in a custom wayMazeDatasetConfig
now has amaze_ctor_kwargs
field, for passing keyword arguments to maze generation (see Constrained depth first search #183)maze dataset
You can now get a
MazeDataset
from just a config -- it will load, download, or generate a dataset on the fly. The mess of ways of storing a dataset we had before is now gone -- aMazeDataset
contains a list ofSolvedMaze
, and it will return one of those when you call__getitem__
. We also added filters and fixed some parallelization issues!GPTDataset().from_config()
as a new, simplified version of getting a dataset: simply pass a config, and it will attempt to load from local directory, download, or generate. any of these can be disabled, and kwargs (for things like # of cores to use) are passed down.SolvedMaze
mazes_objs, mazes_tokens, mazes_array
are now cached properties. they will work, but might be slow due to no parallelizationMazeDataset.__getitem__()
now returns aSolvedMaze
create_dataset()
deprecated but should still work. remove this?applied_filters
field, or you can calldataset.filter_by.your_filter_func(your_arg=your_val)
. Both of these work the same under the hood.from_config()
whether to run in parallel or not (default is no). this is useful since for small datasets, parallelization has huge overhead. tests are now much faster.training
Models now saved as ZANJ objects, and the command line interface is improved.
train()
now:ZanjHookedTransformer
train_model()
:TrainingResult
which contains output path, model, and eventually logging info perhaps?ConfigHolder.get_config_multisource()
and kwargs are passed as modification dictremaining todos:
by path lengthhowever you want viadataset.filter_by.some_function(**kwargs)
MazeDataset().custom_maze_filter()
which takes a custom function (operating on mazes) as argument -- this makes it easier to add new filters in notebooks etc@register_wrap_dataset_filter
wraps a function which takes dataset and kwargs, and returns a dataset. we might want to have a different decoratorregister_wrap_solved_maze_filter
which just takes a function(m: SolvedMaze, **kwargs) -> bool
which is then put inside a regular pythonfilter()
functiontrain_model()
SolvedMaze
] check that conversion to array works correctlyeval_model.py
and associated notebooks to use TransformerLens model #48MazeDataset.generate()
maze_ctor_kwargs
toMazeDatasetConfig
, make backwards compatiblequestions:
GPTDataset.to_fname()
?MazeDataset.__getitem__()
give aSolvedMaze
, string or tokenized array?