Support maze dataset tokenizers update #214

aaron-sandoval · 2024-05-17T23:44:14Z

Update to support maze-dataset #38

Tokenizer overhaul (complete) maze-dataset#38
checks.yml:
all-tok-checks.yml:

…irectory. Updated imports.

review-notebook-app · 2024-05-17T23:44:18Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…to unit tests yet

…maze-dataset

…aset-branch-testing

…anch-testing Make a way to run tests with a user-specified branch of maze-dataset

….com/understanding-search/maze-transformer into update-maze-dataset-tokenizers-step2

aaron-sandoval · 2024-07-26T05:25:38Z

maze_transformer/training/train_save_files.py

@@ -6,7 +6,7 @@
 from maze_transformer.training.config import ConfigHolder


-@freeze


It looks like freeze can only act on objects, not types. It raises an exception when it recursively calls freeze(TRAIN_SAVE_FILES.__dict__), since it can't act on a mappingproxy.

turns out that the outputs were actually equal to within tolerances, but argsort for large enough vocabularies fails (since weight recovery is not perfect and introduces some error)

old model had no tokenizer saved, caused issues

- demo cfg is an actual training run with 10k samples, too long for CI - test cfg is much shorter - also re-ran the notebook

mivanit · 2024-08-20T21:57:55Z

so integration tests are hanging after they are complete.

pretty sure I've narrowed this down to a wandb issue, since 8311eb8 makes test_eval_model.py not hang just by moving test_model_loading

wandb fails in the backend when passing data with this key and a value of longer than 64 chars, see: wandb/wandb#8159 this was causing integration tests to hang idefinitely in understanding-search/maze-transformer#214

- `MAZE_DATASET_CONFIGS` key is something totally different from value -- asked for 100 mazes, get just 1 - it only breaks for test_model_loading, no other functions - only breaks for 100 -- 99 and 101 both work?

- maze dataset configs were being gotten from `MAZE_DATASET_CONFIGS` by ref - this is also modified upstream to avoid this issue - configs were then being passed by ref again in validation dataset creation step - configs then get updated to match new dataset size, this modifies the original - first test works but subsequent tests fail passing the (cheap) configs around by copy fixes this, I think

@aaron-sandoval

…ntegration Tokenizer overhaul (complete) # TODOs: - `checks.yml`: ![status](https://github.com/understanding-search/maze-dataset/actions/workflows/checks.yml/badge.svg?branch=tokenizer-overhaul-integration) - `all-tok-checks.yml`: ![status](https://github.com/understanding-search/maze-dataset/actions/workflows/all-tok-checks.yml/badge.svg?branch=tokenizer-overhaul-integration) - [x] depends on #37 being merged into `tokenizer-overhaul-integration` first - [x] depends on tests passing in [`maze-transformer` #214](understanding-search/maze-transformer#214) - PR: understanding-search/maze-transformer#214 - PR checks status: ![status](https://github.com/understanding-search/maze-transformer/actions/workflows/checks.yml/badge.svg?branch=update-maze-dataset-tokenizers-step2) - [x] ~~make sure that tokenizer hash numpy file is shipped with `maze-dataset` package~~ - [x] ~~maybe make it optional, only if `tokenization` extra is requested?~~ - [x] no need to ship this, its only required for tests - [x] go over existing `# TODO` comments, see if anything is critical - [x] parametrize some tests in `tests/unit/maze_dataset/generation/test_maze_dataset.py` - [x] clean up various superfluous warnings - [x] consider optimizations to `all_instances` and `save_hashes` - [x] unclear if `save_hashes` is being properly parallelized :/ - [x] see if any more utils should be moved to muutils # commit log: aaron-sandoval and others added 30 commits 3 months ago @aaron-sandoval Implemented validation_funcs in all_instances, added docs, added … 5c79d22 @aaron-sandoval Minor unit test fixes. zanj loading of StepSequencer is broken on t… a72680b @aaron-sandoval Added StepSizes.Straightaways. test_path_tokenizers passing. 72380c0 @aaron-sandoval Added some PathTokenizers and tests f184c1b @aaron-sandoval docs 3c35da7 @mivanit minor assert 5d5b405 @aaron-sandoval wip reversing import direction between maze_tokenizer.py and lattice_… f49d0f2 @aaron-sandoval Refactored from_tokens to remove import dependency. `test_from_toke… 7d56a1d @aaron-sandoval Moved util.py and token_utils.py out of tokenization directory … 1795f16 @aaron-sandoval Fixed hash collisions among different TokenizerElement instances. A… cee9eb4 @aaron-sandoval Added Cardinal. test_path_tokenizers passing 99c6083 @aaron-sandoval Added Forks. test_path_tokenizers passing. 782f204 @aaron-sandoval Test fuzzing for test_path_tokenizers db85262 @aaron-sandoval Added ForksAndStraightaways. test_path_tokenizers passing be642e3 @aaron-sandoval Compacted test_path_tokenizers 5788cf9 @aaron-sandoval Added Relative. test_path_tokenizers passing. Had to comment out … f961707 @aaron-sandoval Temporarily deactivated StepTokenizers.Cardinal and `StepTokenizers… c1e8b52 @aaron-sandoval Ran make unit. Only test_is_tested_tokenizer and `test_zanj_save_re… 438deae @aaron-sandoval Fixed one non-fuzzed use of ALL_TOKENIZERS in unit tests. That migh… 9eb338f @aaron-sandoval Deleted extraneous files c3aa60b @aaron-sandoval Copied AdjListTokenizers skeleton from Notion 7767ad1 @aaron-sandoval AdjListTokenizer docs and wip implementation 243a118 @aaron-sandoval Docs, added new VOCAB items 35f2052 @aaron-sandoval Docs 7479445 @aaron-sandoval Added numpy_rng, refactored ChessboardSublattice into EdgePermuters. bb66156 @aaron-sandoval EdgePermuter._permute working in debugger for both subclasses f934fe0 @aaron-sandoval wip test_edge_permuters f982e99 @aaron-sandoval test_edge_permuters passing 81fb7b8 @aaron-sandoval minor things 3efbc6c @aaron-sandoval Added lattice_connection_array f680b1f aaron-sandoval and others added 19 commits 3 months ago @aaron-sandoval AllLatticeEdges.get_edges written, not tested ba6404a @aaron-sandoval Updated .gitignore for pyenv environment file fa58ef5 @aaron-sandoval Bug in CPython raises exception when building constants._VOCAB_BASE… f77f077 @aaron-sandoval Updated poetry.lock 9a52cc7 @aaron-sandoval test_manhattan_distance passing 00bd0a7 @aaron-sandoval test_edge_subsets passing c4253ec @aaron-sandoval Made method skeleton for EdgeGrouping and `AdjListTokenizer.to_toke… 973df5a @aaron-sandoval wip test_edge_subsets 64dd2ff @aaron-sandoval test_edge_subsets passing 26db9f6 @mivanit fix local function 03ae196 @mivanit underscore prefix for tokenizer element namespace ABCs, fix ABC 976c361 @mivanit better error on MazeTokenizer2.from_tokens() 66d8cfa @mivanit fix type hints for get_all_subclasses b01aca9 @mivanit minor cleanup to flatten(), added some tests for tokenizer utils 9bffdd3 @mivanit run format 9c43aa5 @mivanit cleanup of connection_list_to_adj_list 313b1f8 @mivanit rename and fix unpackable_if_true_attribute 2395b5b @mivanit update poetry lock b006afe @mivanit Merge pull request #35 from understanding-search/tokenizer-overhaul 33afb1b @review-notebook-appReview Notebook App review-notebook-app bot commented on Jun 10 Check out this pull request on ReviewNB See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB mivanit and others added 40 commits 2 months ago @mivanit changed workflow to run checks on any PR e14ba3d @mivanit merge commit? 999b251 @mivanit format 042923d @mivanit update lockfile 6223013 @mivanit mostly naming changes 0e63156 @mivanit format ad703b0 @mivanit minor fix to test_isinstance_by_type_name 1511108 @mivanit fix import in test_util.py 32c59d5 @mivanit moved tests requiring all tokenizers into a separate dir c194d69 @mivanit run format 4c52c4b @mivanit kill the all_tokenizers call after fixed time (TEMP) a04f5be @mivanit (TEMP) ignore tokenizer deprecation warnings in tests 37314bd @mivanit wip 1f16761 @mivanit wip merging 4be1957 @mivanit run format 6a719e5 @mivanit some imports 2aa17e9 @mivanit unit tests collect, but don't pass d418dfb @mivanit format 5ff09a2 @mivanit Merge pull request #40 from understanding-search/tokenizer-overhaul-m… 0f116cd @mivanit ....wip 0d372f5 @mivanit merge main into tokenizer-overhaul-integration 17c3209 @aaron-sandoval Update dependency 27eb7db @aaron-sandoval Fixed incomplete refactor of TokenizerElement abstract subclasses. … c0da82d @aaron-sandoval Added missing @serializable_dataclass decorators` 58b893f @aaron-sandoval Cleaned up _tokenize_edge_grouping, to_tokens interfaces. Impleme… f72a974 @aaron-sandoval test_lattice_connection_arrray passing f843a0c @aaron-sandoval Added testing_utils.py. Not tested yet 675d844 @aaron-sandoval test_is_connection passing 27de83b @aaron-sandoval test_edge_groupings passing with an added assertion. `_tokenize_edg… 633c8c0 @aaron-sandoval Removed duplicated functionality in BothCoords and AllLatticeEdges 00d57b3 @aaron-sandoval wip debugging _tokenize_edge_grouping 67cda3d @aaron-sandoval AdjListCoord.to_tokens slightly hand-tested, unit tests wip 6e6db5f @aaron-sandoval wip AdjListCardinal._tokenize_edge_grouping. Snapshot before refact… cdeeac0 @aaron-sandoval Refactored _tokenize_edge_grouping. All `_AdjListTokenizer.to_token… a5aa3d9 @aaron-sandoval wip test_adjlist_tokenizers 76d66de @aaron-sandoval wip test_adjlist_tokenizers. Problems with random seeds changing py… 754c759 @aaron-sandoval Random seed hackily fixed, pytest running 188309c @aaron-sandoval wip test_adjlist_tokenizers. Snapshot before starting `LatticeMaze.… 1c7dbe6 @aaron-sandoval test_adjlist_tokenizers passing with 20 tokenizers 6c575d5 @aaron-sandoval test_adjlist_tokenizers passing with 100 tokenizers dac7888 aaron-sandoval added 20 commits 2 months ago @aaron-sandoval Moved various TokenizerElement tests back to test_tokenizer.py. 16 … ca401ff @aaron-sandoval Cleanup, added extra pruning to _get_all_tokenizers, but probably i… c27f9df @aaron-sandoval Renamed MazeTokenizer2 >> MazeTokenizerModular. `test_path_tokeni… 5512fd4 @aaron-sandoval wip refactoring all_instances for concision. Now accepts regular `d… 62e6539 @aaron-sandoval wip test_all_instances not quite passing. Enhanced all_instances … 9078f6b @aaron-sandoval test_all_tokenizers passing except for case 1-_PathTokenizer, whi… 0f2fb8a @aaron-sandoval Updated muutils and zanj, dc_eq still has bug ff9f086 @aaron-sandoval Deleted duplicate utilities fd8ce29 @aaron-sandoval Minor generic type fixes, added some unit tests, not run yet 0142fa6 @aaron-sandoval Refactored MazeTokenizer to MazeTokenizerModular throughout the l… bd79997 @aaron-sandoval Set muutils dependency to last working version until latest bugs are … b1e1fba @aaron-sandoval Readded pandas, which got removed from dependencies at some point in … 91c7680 @aaron-sandoval wip updating nb bf83a19 @aaron-sandoval Did some profiling, removed a currerntly redundant validation from `_… cd5a455 @aaron-sandoval wip refactoring uses of all_instances to work with Generator output c82b3e2 @aaron-sandoval Remove temporary pruning of _StepTokenizers space. `_get_all_tokeni… 6951895 @aaron-sandoval Reinstated pruning 02e39ef @aaron-sandoval Small additions for demo nbs 91ada63 @aaron-sandoval all_instances bugfixes, more unit tests. wip reworking `TokenizerEl… ddc72de @aaron-sandoval make unit all passing 2f03963 @mivanit mivanit added enhancement tokenization labels on Jul 3 mivanit added 13 commits last month @mivanit merge in integration branch ffab764 @mivanit run format a45c9d0 @mivanit new lock file 09e7212 @mivanit fix an import ad1506d @mivanit update muutils and zanj deps 4619d2c @mivanit won't check types on applied_filters or maze_ctor func 19dfb6e @mivanit wip 0fdb2f2 @mivanit wip 057ab14 @mivanit wip 147b6d7 @mivanit wip d2042d6 @mivanit Format c66ac97 @mivanit random sample of tokenizers to make things run in the notebook 437e34a @mivanit one line fix to tokenizers 3371303 @mivanit mivanit mentioned this pull request on Jul 5 Add tokenization schemes #30 Open mivanit and others added 24 commits last month @mivanit Merge branch 'main' into fix-save-load-profile cd1eeb3 @mivanit format 9783610 @mivanit Merge branch 'main' into tokenizer-overhaul-integration 290f55d @aaron-sandoval Bugfix in test_has_element. Can't figure out why `ALL_TOKENIZER_HAS… 826ea87 @aaron-sandoval test_zanj_save_read passing f1e937e @aaron-sandoval Added properties_to_serialize to MazeTokenizerModular. `test_zanj… 26fb821 @aaron-sandoval Demo nb milestone, fixed a couple of small bugs 673b986 @aaron-sandoval wip filtering demo 37c1c75 @aaron-sandoval wip filtering demo 069ceb3 @mivanit tentatively marking some tokenizer elements as not supported da0f197 @aaron-sandoval Updated muutils dependency 0.6.3 73d2fdb @aaron-sandoval Removed utilities migrated to muutils, re-pointed imports 6f0f62c @aaron-sandoval Using stable_hash 929ba19 @aaron-sandoval Swapped in @mark_as_unsupported for old @AbstractMethod temporary… 0f86ee4 @mivanit some more removed tokenizers 2293ee9 @aaron-sandoval Hashing I think is fixed and stable. all_instances still broken a16b4e4 @aaron-sandoval Hashing works with the __init_subclass__ hidden field workaround. S… dc2f2c1 @aaron-sandoval test_zanj_save_read passing, ...hashes.npy is out of date so `tes… 6e15a5c @aaron-sandoval wip demo nb fdd4089 @aaron-sandoval Added caching back into all_instances while it still uses generator… ac84ae1 @aaron-sandoval Think I finished the demo nb. Can't test it yet due to bug in muutils 1c2662f @aaron-sandoval Refactored TokenizerElement to _TokenizerElement. Untested for sa… 98727aa @aaron-sandoval Added overall token counter test. Can't run due to muutils bug 2677bff @aaron-sandoval Consolidated util.py into token_utils.py. Imports seem to resolve wit… c6153b1 aaron-sandoval and others added 43 commits last month @aaron-sandoval Deleted tests which were migrated to muutils b52c962 @aaron-sandoval Tried moving connection_list_to_adj_list, is_connection to utils.… 2b1ffdb @mivanit Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… 850c24f @mivanit merge from main, update deps ff7f4da @mivanit Merge branch 'main' into tokenizer-overhaul-integration 22c9dda @mivanit update deps d96248c @mivanit poetry update lockfile bf1701b @mivanit run format 3ab6760 @mivanit BothCoords is depended on, no longer marked as unsupported 13460d8 @mivanit missing regex import 2efc29f @mivanit format import re c304f67 @mivanit fixed that applied_filters list/tuple mismatch bug once and for all 4bf2844 @mivanit minor import fixes and re-run 13f8593 @mivanit fixes to endpoint_kwargs 3a201f2 @mivanit run format ff9ecd1 @mivanit wip 8dd038f @mivanit re-run notebook 1e15fd9 @mivanit save_tok_hashes as prereq of test_all_tok 46b3ec7 @mivanit update deps for spinner 164e91f @mivanit run save_tok_hashes be5ad67 @mivanit fix has_element() call causing test_all_tok fail 43c3653 @mivanit change .has_element() to take *args bfebb29 @mivanit parallelize test_all_tok w/ pytest-xdist 3740164 @mivanit improvements & parallelization to save_hashes 8c8a096 @mivanit faster hashing of new tokenizers c6a0b7d @mivanit blake2b hash func, zipped hashes, more spinners f2e3be4 @mivanit format e8d2758 @mivanit attempt speed up all_instances d8afc7d @mivanit less verbose test_all_tok b263564 @mivanit has_element() fix, back to longer pytest traces 2510d26 @mivanit test fix 9f5166e @mivanit test optimizations c00338c @mivanit test a larger random sample of tokenizers ab246ae @mivanit set n of tokenizers to test via env e3ce590 @mivanit better CI, optimizations to testing all toks 5149058 @aaron-sandoval Added one more condition in test_token_stability 34782a9 @aaron-sandoval Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… 2108e3a @mivanit tokenizer hashes now checked differently b25869e @mivanit Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… afea7c2 @aaron-sandoval Bugfix in test_token_stability, not all mazes have a "<PATH_END>" 8befacc @mivanit format c09ebc5 @aaron-sandoval Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… a4e67ab @aaron-sandoval format 03ed351 This was linked to issues on Jul 25 Add tokenization schemes #30 Open Duplication between test_coord_str_tuple.py and test_token_utils.py #32 Open TokenizationMode and MazeTokenizer ugly and not scalable #33 Open @mivanit mivanit marked this pull request as ready for review last month aaron-sandoval and others added 13 commits last month @aaron-sandoval Docs, a bit of cleanup fa5c691 @aaron-sandoval Format 51b77ae @aaron-sandoval Put back __init_subclass__ 92da244 @aaron-sandoval Format cd08d53 @aaron-sandoval Merge pull request #37 from understanding-search/tokenizer-overhaul-s… 9ded238 @mivanit Merge branch 'tokenizer-overhaul-integration' into fix-save-load-profile 63c21c9 @mivanit fixed long-standing issue with serialization cfg mismatch bfcaeca @mivanit fix exception for generation meta mismatch bb9b69c @mivanit Merge pull request #36 from understanding-search/fix-save-load-profile 7aa97b8 @mivanit reworked CI 2c16c8d @mivanit added comments to save_hashes, testing CI dispatch 6c471ea @mivanit will run all-tok-checks if all_tokenizers tests changed 27cd942 @mivanit all tok modifications 0c16cd6 @mivanit mivanit mentioned this pull request on Jul 26 Support maze dataset tokenizers update understanding-search/maze-transformer#214 Open mivanit and others added 5 commits last month @mivanit reduce default number of tokenizers tested 4edd476 @mivanit Merge branch 'main' into tokenizer-overhaul-integration, bump version 2c19e57 @mivanit summary should be json-serializable 19644d3 @mivanit fix notebook 4c8adbf @aaron-sandoval test commit 2582299 @aaron-sandoval Member aaron-sandoval commented 27 days ago • see if any more utils should be moved to muutils Except for all_instances, I don't think any more of the utilities in utils.py which were newly added in this PR should be moved due to their more narrow applicability to mazes. There are a few older ones which could be moved. I didn't think you wanted to move elements of the old API if it could be helped, but if you're open to it here are my recommendations. I only checked for uses in the 2 libraries. Function Where Used Recommendation all_instances maze-dataset Move bool_array_from_string maze-dataset, maze-transformer Move adj_list_to_nested_set maze-dataset, maze-transformer Maybe keep in m-d since it's at least lattice graph-specific apply_mapping Nowhere Move apply_mapping_chain Nowhere Move @aaron-sandoval Member aaron-sandoval commented 27 days ago • unclear if save_hashes is being properly parallelized :/ Idk a proper way to verify this. As something quick, I just ran a few trials on my new desktop with different numbers of processes in the Pool. Here are the wall clock times reported by the spinner on list(pool.map(hash, all_tokenizers)). # Processes Spinner Runtime [sec] No Pool; list(map(hash, all_tokenizers)) 59 1 75 2 39 4 21 6 19 12 19 24 21 The multiprocessing is clearly helping compared to a single core, but idk why it plateaus so early. I tried using Pool.apply_async, and that was way slower. IMO, runtime doesn't seem long enough to warrant sinking time into this now since neither of us seem to have the knowledge atm. This is the kind of backend optimization that can be done after the 1.0.0 release. @aaron-sandoval Cleaned up test_maze_dataset.py bfeba59 @aaron-sandoval Member aaron-sandoval commented 2 weeks ago @mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal types to artificially restrict the space returned by all_instances. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal types allow that direct construction, but the current implementation of mark_as_unsupported doesn't since the dummy abstract method precludes any construction of that class. What if instead mark_as_unsupported just overwrote that class's is_valid method to just return False all the time? That would make those classes still constructable. Calls to get_all_tokenizers would still ignore them just the same, and direct use of all_instances would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS included by default in calls to all_tokenizers so that users have to consciously opt out of those validation funcs. I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it. aaron-sandoval added 3 commits 2 weeks ago @aaron-sandoval Added a TODO 8ad6f3a @aaron-sandoval bugfix: Shuffling and permuting edges c9e3bc8 @aaron-sandoval Added feature to color_tokens_rgb to automate line breaks for repor… 1aeb408 @mivanit Member Author mivanit commented 2 weeks ago @mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal types to artificially restrict the space returned by all_instances. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal types allow that direct construction, but the current implementation of mark_as_unsupported doesn't since the dummy abstract method precludes any construction of that class. What if instead mark_as_unsupported just overwrote that class's is_valid method to just return False all the time? That would make those classes still constructable. Calls to get_all_tokenizers would still ignore them just the same, and direct use of all_instances would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS included by default in calls to all_tokenizers so that users have to consciously opt out of those validation funcs. I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it. You're right that the current implementation would require people to manually edit their locally installed version of the package to get the unsupported tokenizers to work -- I actually think this behavior is fine and serves as a "make sure you know what you're doing filer", but if you think this is a quick edit then I think it's fine to make as long as we add a warning when initializing an unsupported tokenizer. mivanit added 10 commits last week @mivanit better doc for make test_all_tok dded010 @mivanit update dep 3ca392f @mivanit removed testing length of ALL_TOKENIZERS 2b8b2a5 @mivanit lower n_to_test to see if pass, better conditions e41c417 @mivanit format c3c9635 @mivanit action was not correctly reading num of tokenizers and parallelization cd9e069 @mivanit oops b7ee88b @mivanit poetry update 9ea828d @mivanit option to skip hashes test so we can run it separately 08bf285 @mivanit test_all_tok was always skipping 90612dc @mivanit Member Author mivanit commented last week • making a note for profiling and figuring out how many tokenizers we can test Time (MM:SS) Time (s) n tokenizers tested Link 9:54 594 100 action 10:05 605 300 action 11:15 675 1000 action 13:33 813 3000 action 16:41 1001 5000 action image @mivanit bumping default NUM_TOKENIZERS_TO_TEST to 300 f3d2a72 mivanit and others added 18 commits last week @mivanit bump default NUM_TOKENIZERS_TO_TEST to 1000 62a02b7 @mivanit bump default NUM_TOKENIZERS_TO_TEST to 3000 57cd1d1 @mivanit test 5k tokenizers, expect test to take just over 16 min 3a9701d @mivanit bump muutils to 0.6.8, slight updates to spinners fd6827f @aaron-sandoval Modified mark_as_unsupported to allow construction of all unsupport… 599b955 @aaron-sandoval wip getting nb to work ad11705 @aaron-sandoval Merge branch 'tokenizer-overhaul-integration' into tokenizer-overhaul… f03cf2b @aaron-sandoval Removed warnings, added unit tests for classes with `mark_as_unsuppor… 5c7de66 @mivanit slight clean up to .summary() code c095d05 @mivanit change _type to _type_ in MMT to avoid wandb issue 729bf9a @mivanit format 7ed866d @mivanit Merge pull request #44 from understanding-search/tokenizer-overhaul-s… d5f453b @mivanit bump muutils dep to 0.6.9 ebf9a25 @mivanit Merge branch 'tokenizer-overhaul-integration' of https://github.com/u… ee01ade @mivanit remove deprecation warning for is_UT and is_AOTP eb86a61 @mivanit reverts and fixes 95c693e @mivanit bump muutils to 0.6.10 be7dbfd @mivanit MAZE_DATASET_CONFIGS gives a copy, not the original e3decdb @aaron-sandoval aaron-sandoval linked an issue last week that may be closed by this pull request add option to remove randomization in as_tokens #34 Open mivanit added 10 commits 5 days ago @mivanit fix MAZE_DATASET_CONFIGS interface 5e3d846 @mivanit testing some changes to loading tokenizer hashes a1a75d4 @mivanit update dep a2daf8e @mivanit revert how we load the tokenizer hashes 645fec4 @mivanit ALL_TOKENIZER_HASHES now private and only gotten when required 7316455 @mivanit update dep 2fb57c2 @mivanit removed old setup.py 6a2b8e9 @mivanit minor notebook cleanup 7b73303 @mivanit a test no longer breaks collection, fixed a TODO 87c130e @mivanit bump version to 1.0.0 a638ef8 mivanit mivanit commented 1 minute ago Member Author mivanit left a comment Approving -- this is finally done!

aaron-sandoval added 3 commits April 18, 2024 12:34

Add check on <UNK> token for maze-dataset update

0ff866e

Update dependencies, including maze-dataset = "^1.0.0"

ac2c09c

maze-dataset PR #37 moved token_utils.py and util.py to a different d…

a52baca

…irectory. Updated imports.

aaron-sandoval added the enhancement New feature or request label May 17, 2024

aaron-sandoval self-assigned this May 17, 2024

aaron-sandoval added 12 commits June 29, 2024 00:15

Update mostly just type hints for MazeTokenizerModular. No updates …

404e49f

…to unit tests yet

Updated unit tests to incorporate MazeTokenizerModular. Not run yet

8424609

made a comment

58b6ef5

wip making a make recipe to run tests with a user-provided branch of …

283bdd0

…maze-dataset

wip, bogged down in Windows vs Linux crap

eab11f4

wip, still stuck

f83e169

m-d git branch environment specified in maze-dataset_test directory

c0a48a2

Environment was broken in subdirectory. Move it to the main environment

d34e6f5

Merge branch 'update-maze-dataset-tokenizers-step2' into add-maze-dat…

f4f9303

…aset-branch-testing

Merge pull request #216 from understanding-search/add-maze-dataset-br…

667c6ce

…anch-testing Make a way to run tests with a user-specified branch of maze-dataset

Merge branch 'update-maze-dataset-tokenizers-step2' of https://github…

dc30b32

….com/understanding-search/maze-transformer into update-maze-dataset-tokenizers-step2

Small edits to get unit tests to collect

6711a7e

aaron-sandoval commented Jul 26, 2024

View reviewed changes

mivanit marked this pull request as ready for review July 26, 2024 17:28

mivanit changed the title ~~Support maze dataset PR #37: Tokenizers Step 2~~ Support maze dataset tokenizers update Jul 26, 2024

mivanit added 9 commits July 26, 2024 15:54

Merge branch 'main' into update-maze-dataset-tokenizers-step2

fb0b1da

bump maze-dataset

02537b8

run format

1b39086

fix imports, unit tests collect

ec748ef

upstream mmtokenizer summary() fix

74df787

?????????

34e9e74

run format

24ca2c4

legacy mt was loaded as mmt by mistake

7138677

re-run nb

6c3cce2

mivanit added 2 commits August 20, 2024 13:52

Merge branch 'main' into update-maze-dataset-tokenizers-step2

4157519

poetry update

9e7b888

mivanit mentioned this pull request Aug 20, 2024

Remove unk token #212

Closed

mivanit added 5 commits August 20, 2024 14:23

fix failing model loading tests

3a589b1

turns out that the outputs were actually equal to within tolerances, but argsort for large enough vocabularies fails (since weight recovery is not perfect and introduces some error)

integration test where too many vocab elements caused argsort fail

3956867

trained new demo model

0650190

old model had no tokenizer saved, caused issues

replaced demo model path in tests, chnaged notebook cfg to test

abc3dcc

- demo cfg is an actual training run with 10k samples, too long for CI - test cfg is much shorter - also re-ran the notebook

move training tests to test_train_model.py

8311eb8

mivanit added 2 commits August 20, 2024 16:07

trying to fix pytest hang issue by closing wandb run

0064898

return logger in TrainingResult object from train_model

7cf67a7

mivanit added 9 commits August 21, 2024 12:20

update maze-dataset, new version should maybe fix wandb issues?

fc42faa

re-run notebook to get new model with fixed keys

104e004

changed cfg back to test in train nb, re-run

3e254b2

format

9f15e7e

update maze-dataset dep

4df4a24

ok this bug is incomprehensible

4b3e8ca

- `MAZE_DATASET_CONFIGS` key is something totally different from value -- asked for 100 mazes, get just 1 - it only breaks for test_model_loading, no other functions - only breaks for 100 -- 99 and 101 both work?

format

85ee878

fix paths in notebooks

1f46532

mivanit approved these changes Aug 22, 2024

View reviewed changes

mivanit added 6 commits August 22, 2024 01:42

remove old stuff from makefile

3f795f7

update dep

096f6df

update dep

97fba7b

update dep

0d05191

update deps??

04039a4

update dep

31442ec

update dep to maze-dataset 1.0.0

16e60e4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support maze dataset tokenizers update #214

Support maze dataset tokenizers update #214

aaron-sandoval commented May 17, 2024 •

edited by mivanit

Loading

review-notebook-app bot commented May 17, 2024

aaron-sandoval Jul 26, 2024

mivanit commented Aug 20, 2024

		@@ -6,7 +6,7 @@
		from maze_transformer.training.config import ConfigHolder


		@freeze

Support maze dataset tokenizers update #214

Are you sure you want to change the base?

Support maze dataset tokenizers update #214

Conversation

aaron-sandoval commented May 17, 2024 • edited by mivanit Loading

review-notebook-app bot commented May 17, 2024

aaron-sandoval Jul 26, 2024

Choose a reason for hiding this comment

mivanit commented Aug 20, 2024

aaron-sandoval commented May 17, 2024 •

edited by mivanit

Loading