Merge pull request #38 from understanding-search/tokenizer-overhaul-i…

…ntegration Tokenizer overhaul (complete) # TODOs: - `checks.yml`: ![status](https://github.com/understanding-search/maze-dataset/actions/workflows/checks.yml/badge.svg?branch=tokenizer-overhaul-integration) - `all-tok-checks.yml`: ![status](https://github.com/understanding-search/maze-dataset/actions/workflows/all-tok-checks.yml/badge.svg?branch=tokenizer-overhaul-integration) - [x] depends on #37 being merged into `tokenizer-overhaul-integration` first - [x] depends on tests passing in [`maze-transformer` #214](understanding-search/maze-transformer#214) - PR: understanding-search/maze-transformer#214 - PR checks status: ![status](https://github.com/understanding-search/maze-transformer/actions/workflows/checks.yml/badge.svg?branch=update-maze-dataset-tokenizers-step2) - [x] ~~make sure that tokenizer hash numpy file is shipped with `maze-dataset` package~~ - [x] ~~maybe make it optional, only if `tokenization` extra is requested?~~ - [x] no need to ship this, its only required for tests - [x] go over existing `# TODO` comments, see if anything is critical - [x] parametrize some tests in `tests/unit/maze_dataset/generation/test_maze_dataset.py` - [x] clean up various superfluous warnings - [x] consider optimizations to `all_instances` and `save_hashes` - [x] unclear if `save_hashes` is being properly parallelized :/ - [x] see if any more utils should be moved to muutils # commit log: aaron-sandoval and others added 30 commits 3 months ago @aaron-sandoval Implemented validation_funcs in all_instances, added docs, added … 5c79d22 @aaron-sandoval Minor unit test fixes. zanj loading of StepSequencer is broken on t… a72680b @aaron-sandoval Added StepSizes.Straightaways. test_path_tokenizers passing. 72380c0 @aaron-sandoval Added some PathTokenizers and tests f184c1b @aaron-sandoval docs 3c35da7 @mivanit minor assert 5d5b405 @aaron-sandoval wip reversing import direction between maze_tokenizer.py and lattice_… f49d0f2 @aaron-sandoval Refactored from_tokens to remove import dependency. `test_from_toke… 7d56a1d @aaron-sandoval Moved util.py and token_utils.py out of tokenization directory … 1795f16 @aaron-sandoval Fixed hash collisions among different TokenizerElement instances. A… cee9eb4 @aaron-sandoval Added Cardinal. test_path_tokenizers passing 99c6083 @aaron-sandoval Added Forks. test_path_tokenizers passing. 782f204 @aaron-sandoval Test fuzzing for test_path_tokenizers db85262 @aaron-sandoval Added ForksAndStraightaways. test_path_tokenizers passing be642e3 @aaron-sandoval Compacted test_path_tokenizers 5788cf9 @aaron-sandoval Added Relative. test_path_tokenizers passing. Had to comment out … f961707 @aaron-sandoval Temporarily deactivated StepTokenizers.Cardinal and `StepTokenizers… c1e8b52 @aaron-sandoval Ran make unit. Only test_is_tested_tokenizer and `test_zanj_save_re… 438deae @aaron-sandoval Fixed one non-fuzzed use of ALL_TOKENIZERS in unit tests. That migh… 9eb338f @aaron-sandoval Deleted extraneous files c3aa60b @aaron-sandoval Copied AdjListTokenizers skeleton from Notion 7767ad1 @aaron-sandoval AdjListTokenizer docs and wip implementation 243a118 @aaron-sandoval Docs, added new VOCAB items 35f2052 @aaron-sandoval Docs 7479445 @aaron-sandoval Added numpy_rng, refactored ChessboardSublattice into EdgePermuters. bb66156 @aaron-sandoval EdgePermuter._permute working in debugger for both subclasses f934fe0 @aaron-sandoval wip test_edge_permuters f982e99 @aaron-sandoval test_edge_permuters passing 81fb7b8 @aaron-sandoval minor things 3efbc6c @aaron-sandoval Added lattice_connection_array f680b1f aaron-sandoval and others added 19 commits 3 months ago @aaron-sandoval AllLatticeEdges.get_edges written, not tested ba6404a @aaron-sandoval Updated .gitignore for pyenv environment file fa58ef5 @aaron-sandoval Bug in CPython raises exception when building constants._VOCAB_BASE… f77f077 @aaron-sandoval Updated poetry.lock 9a52cc7 @aaron-sandoval test_manhattan_distance passing 00bd0a7 @aaron-sandoval test_edge_subsets passing c4253ec @aaron-sandoval Made method skeleton for EdgeGrouping and `AdjListTokenizer.to_toke… 973df5a @aaron-sandoval wip test_edge_subsets 64dd2ff @aaron-sandoval test_edge_subsets passing 26db9f6 @mivanit fix local function 03ae196 @mivanit underscore prefix for tokenizer element namespace ABCs, fix ABC 976c361 @mivanit better error on MazeTokenizer2.from_tokens() 66d8cfa @mivanit fix type hints for get_all_subclasses b01aca9 @mivanit minor cleanup to flatten(), added some tests for tokenizer utils 9bffdd3 @mivanit run format 9c43aa5 @mivanit cleanup of connection_list_to_adj_list 313b1f8 @mivanit rename and fix unpackable_if_true_attribute 2395b5b @mivanit update poetry lock b006afe @mivanit Merge pull request #35 from understanding-search/tokenizer-overhaul 33afb1b @review-notebook-appReview Notebook App review-notebook-app bot commented on Jun 10 Check out this pull request on ReviewNB See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB mivanit and others added 40 commits 2 months ago @mivanit changed workflow to run checks on any PR e14ba3d @mivanit merge commit? 999b251 @mivanit format 042923d @mivanit update lockfile 6223013 @mivanit mostly naming changes 0e63156 @mivanit format ad703b0 @mivanit minor fix to test_isinstance_by_type_name 1511108 @mivanit fix import in test_util.py 32c59d5 @mivanit moved tests requiring all tokenizers into a separate dir c194d69 @mivanit run format 4c52c4b @mivanit kill the all_tokenizers call after fixed time (TEMP) a04f5be @mivanit (TEMP) ignore tokenizer deprecation warnings in tests 37314bd @mivanit wip 1f16761 @mivanit wip merging 4be1957 @mivanit run format 6a719e5 @mivanit some imports 2aa17e9 @mivanit unit tests collect, but don't pass d418dfb @mivanit format 5ff09a2 @mivanit Merge pull request #40 from understanding-search/tokenizer-overhaul-m… 0f116cd @mivanit ....wip 0d372f5 @mivanit merge main into tokenizer-overhaul-integration 17c3209 @aaron-sandoval Update dependency 27eb7db @aaron-sandoval Fixed incomplete refactor of TokenizerElement abstract subclasses. … c0da82d @aaron-sandoval Added missing @serializable_dataclass decorators` 58b893f @aaron-sandoval Cleaned up _tokenize_edge_grouping, to_tokens interfaces. Impleme… f72a974 @aaron-sandoval test_lattice_connection_arrray passing f843a0c @aaron-sandoval Added testing_utils.py. Not tested yet 675d844 @aaron-sandoval test_is_connection passing 27de83b @aaron-sandoval test_edge_groupings passing with an added assertion. `_tokenize_edg… 633c8c0 @aaron-sandoval Removed duplicated functionality in BothCoords and AllLatticeEdges 00d57b3 @aaron-sandoval wip debugging _tokenize_edge_grouping 67cda3d @aaron-sandoval AdjListCoord.to_tokens slightly hand-tested, unit tests wip 6e6db5f @aaron-sandoval wip AdjListCardinal._tokenize_edge_grouping. Snapshot before refact… cdeeac0 @aaron-sandoval Refactored _tokenize_edge_grouping. All `_AdjListTokenizer.to_token… a5aa3d9 @aaron-sandoval wip test_adjlist_tokenizers 76d66de @aaron-sandoval wip test_adjlist_tokenizers. Problems with random seeds changing py… 754c759 @aaron-sandoval Random seed hackily fixed, pytest running 188309c @aaron-sandoval wip test_adjlist_tokenizers. Snapshot before starting `LatticeMaze.… 1c7dbe6 @aaron-sandoval test_adjlist_tokenizers passing with 20 tokenizers 6c575d5 @aaron-sandoval test_adjlist_tokenizers passing with 100 tokenizers dac7888 aaron-sandoval added 20 commits 2 months ago @aaron-sandoval Moved various TokenizerElement tests back to test_tokenizer.py. 16 … ca401ff @aaron-sandoval Cleanup, added extra pruning to _get_all_tokenizers, but probably i… c27f9df @aaron-sandoval Renamed MazeTokenizer2 >> MazeTokenizerModular. `test_path_tokeni… 5512fd4 @aaron-sandoval wip refactoring all_instances for concision. Now accepts regular `d… 62e6539 @aaron-sandoval wip test_all_instances not quite passing. Enhanced all_instances … 9078f6b @aaron-sandoval test_all_tokenizers passing except for case 1-_PathTokenizer, whi… 0f2fb8a @aaron-sandoval Updated muutils and zanj, dc_eq still has bug ff9f086 @aaron-sandoval Deleted duplicate utilities fd8ce29 @aaron-sandoval Minor generic type fixes, added some unit tests, not run yet 0142fa6 @aaron-sandoval Refactored MazeTokenizer to MazeTokenizerModular throughout the l… bd79997 @aaron-sandoval Set muutils dependency to last working version until latest bugs are … b1e1fba @aaron-sandoval Readded pandas, which got removed from dependencies at some point in … 91c7680 @aaron-sandoval wip updating nb bf83a19 @aaron-sandoval Did some profiling, removed a currerntly redundant validation from `_… cd5a455 @aaron-sandoval wip refactoring uses of all_instances to work with Generator output c82b3e2 @aaron-sandoval Remove temporary pruning of _StepTokenizers space. `_get_all_tokeni… 6951895 @aaron-sandoval Reinstated pruning 02e39ef @aaron-sandoval Small additions for demo nbs 91ada63 @aaron-sandoval all_instances bugfixes, more unit tests. wip reworking `TokenizerEl… ddc72de @aaron-sandoval make unit all passing 2f03963 @mivanit mivanit added enhancement tokenization labels on Jul 3 mivanit added 13 commits last month @mivanit merge in integration branch ffab764 @mivanit run format a45c9d0 @mivanit new lock file 09e7212 @mivanit fix an import ad1506d @mivanit update muutils and zanj deps 4619d2c @mivanit won't check types on applied_filters or maze_ctor func 19dfb6e @mivanit wip 0fdb2f2 @mivanit wip 057ab14 @mivanit wip 147b6d7 @mivanit wip d2042d6 @mivanit Format c66ac97 @mivanit random sample of tokenizers to make things run in the notebook 437e34a @mivanit one line fix to tokenizers 3371303 @mivanit mivanit mentioned this pull request on Jul 5 Add tokenization schemes #30 Open mivanit and others added 24 commits last month @mivanit Merge branch 'main' into fix-save-load-profile cd1eeb3 @mivanit format 9783610 @mivanit Merge branch 'main' into tokenizer-overhaul-integration 290f55d @aaron-sandoval Bugfix in test_has_element. Can't figure out why `ALL_TOKENIZER_HAS… 826ea87 @aaron-sandoval test_zanj_save_read passing f1e937e @aaron-sandoval Added properties_to_serialize to MazeTokenizerModular. `test_zanj… 26fb821 @aaron-sandoval Demo nb milestone, fixed a couple of small bugs 673b986 @aaron-sandoval wip filtering demo 37c1c75 @aaron-sandoval wip filtering demo 069ceb3 @mivanit tentatively marking some tokenizer elements as not supported da0f197 @aaron-sandoval Updated muutils dependency 0.6.3 73d2fdb @aaron-sandoval Removed utilities migrated to muutils, re-pointed imports 6f0f62c @aaron-sandoval Using stable_hash 929ba19 @aaron-sandoval Swapped in @mark_as_unsupported for old @AbstractMethod temporary… 0f86ee4 @mivanit some more removed tokenizers 2293ee9 @aaron-sandoval Hashing I think is fixed and stable. all_instances still broken a16b4e4 @aaron-sandoval Hashing works with the __init_subclass__ hidden field workaround. S… dc2f2c1 @aaron-sandoval test_zanj_save_read passing, ...hashes.npy is out of date so `tes… 6e15a5c @aaron-sandoval wip demo nb fdd4089 @aaron-sandoval Added caching back into all_instances while it still uses generator… ac84ae1 @aaron-sandoval Think I finished the demo nb. Can't test it yet due to bug in muutils 1c2662f @aaron-sandoval Refactored TokenizerElement to _TokenizerElement. Untested for sa… 98727aa @aaron-sandoval Added overall token counter test. Can't run due to muutils bug 2677bff @aaron-sandoval Consolidated util.py into token_utils.py. Imports seem to resolve wit… c6153b1 aaron-sandoval and others added 43 commits last month @aaron-sandoval Deleted tests which were migrated to muutils b52c962 @aaron-sandoval Tried moving connection_list_to_adj_list, is_connection to utils.… 2b1ffdb @mivanit Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… 850c24f @mivanit merge from main, update deps ff7f4da @mivanit Merge branch 'main' into tokenizer-overhaul-integration 22c9dda @mivanit update deps d96248c @mivanit poetry update lockfile bf1701b @mivanit run format 3ab6760 @mivanit BothCoords is depended on, no longer marked as unsupported 13460d8 @mivanit missing regex import 2efc29f @mivanit format import re c304f67 @mivanit fixed that applied_filters list/tuple mismatch bug once and for all 4bf2844 @mivanit minor import fixes and re-run 13f8593 @mivanit fixes to endpoint_kwargs 3a201f2 @mivanit run format ff9ecd1 @mivanit wip 8dd038f @mivanit re-run notebook 1e15fd9 @mivanit save_tok_hashes as prereq of test_all_tok 46b3ec7 @mivanit update deps for spinner 164e91f @mivanit run save_tok_hashes be5ad67 @mivanit fix has_element() call causing test_all_tok fail 43c3653 @mivanit change .has_element() to take *args bfebb29 @mivanit parallelize test_all_tok w/ pytest-xdist 3740164 @mivanit improvements & parallelization to save_hashes 8c8a096 @mivanit faster hashing of new tokenizers c6a0b7d @mivanit blake2b hash func, zipped hashes, more spinners f2e3be4 @mivanit format e8d2758 @mivanit attempt speed up all_instances d8afc7d @mivanit less verbose test_all_tok b263564 @mivanit has_element() fix, back to longer pytest traces 2510d26 @mivanit test fix 9f5166e @mivanit test optimizations c00338c @mivanit test a larger random sample of tokenizers ab246ae @mivanit set n of tokenizers to test via env e3ce590 @mivanit better CI, optimizations to testing all toks 5149058 @aaron-sandoval Added one more condition in test_token_stability 34782a9 @aaron-sandoval Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… 2108e3a @mivanit tokenizer hashes now checked differently b25869e @mivanit Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… afea7c2 @aaron-sandoval Bugfix in test_token_stability, not all mazes have a "<PATH_END>" 8befacc @mivanit format c09ebc5 @aaron-sandoval Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… a4e67ab @aaron-sandoval format 03ed351 This was linked to issues on Jul 25 Add tokenization schemes #30 Open Duplication between test_coord_str_tuple.py and test_token_utils.py #32 Open TokenizationMode and MazeTokenizer ugly and not scalable #33 Open @mivanit mivanit marked this pull request as ready for review last month aaron-sandoval and others added 13 commits last month @aaron-sandoval Docs, a bit of cleanup fa5c691 @aaron-sandoval Format 51b77ae @aaron-sandoval Put back __init_subclass__ 92da244 @aaron-sandoval Format cd08d53 @aaron-sandoval Merge pull request #37 from understanding-search/tokenizer-overhaul-s… 9ded238 @mivanit Merge branch 'tokenizer-overhaul-integration' into fix-save-load-profile 63c21c9 @mivanit fixed long-standing issue with serialization cfg mismatch bfcaeca @mivanit fix exception for generation meta mismatch bb9b69c @mivanit Merge pull request #36 from understanding-search/fix-save-load-profile 7aa97b8 @mivanit reworked CI 2c16c8d @mivanit added comments to save_hashes, testing CI dispatch 6c471ea @mivanit will run all-tok-checks if all_tokenizers tests changed 27cd942 @mivanit all tok modifications 0c16cd6 @mivanit mivanit mentioned this pull request on Jul 26 Support maze dataset tokenizers update understanding-search/maze-transformer#214 Open mivanit and others added 5 commits last month @mivanit reduce default number of tokenizers tested 4edd476 @mivanit Merge branch 'main' into tokenizer-overhaul-integration, bump version 2c19e57 @mivanit summary should be json-serializable 19644d3 @mivanit fix notebook 4c8adbf @aaron-sandoval test commit 2582299 @aaron-sandoval Member aaron-sandoval commented 27 days ago • see if any more utils should be moved to muutils Except for all_instances, I don't think any more of the utilities in utils.py which were newly added in this PR should be moved due to their more narrow applicability to mazes. There are a few older ones which could be moved. I didn't think you wanted to move elements of the old API if it could be helped, but if you're open to it here are my recommendations. I only checked for uses in the 2 libraries. Function Where Used Recommendation all_instances maze-dataset Move bool_array_from_string maze-dataset, maze-transformer Move adj_list_to_nested_set maze-dataset, maze-transformer Maybe keep in m-d since it's at least lattice graph-specific apply_mapping Nowhere Move apply_mapping_chain Nowhere Move @aaron-sandoval Member aaron-sandoval commented 27 days ago • unclear if save_hashes is being properly parallelized :/ Idk a proper way to verify this. As something quick, I just ran a few trials on my new desktop with different numbers of processes in the Pool. Here are the wall clock times reported by the spinner on list(pool.map(hash, all_tokenizers)). # Processes Spinner Runtime [sec] No Pool; list(map(hash, all_tokenizers)) 59 1 75 2 39 4 21 6 19 12 19 24 21 The multiprocessing is clearly helping compared to a single core, but idk why it plateaus so early. I tried using Pool.apply_async, and that was way slower. IMO, runtime doesn't seem long enough to warrant sinking time into this now since neither of us seem to have the knowledge atm. This is the kind of backend optimization that can be done after the 1.0.0 release. @aaron-sandoval Cleaned up test_maze_dataset.py bfeba59 @aaron-sandoval Member aaron-sandoval commented 2 weeks ago @mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal types to artificially restrict the space returned by all_instances. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal types allow that direct construction, but the current implementation of mark_as_unsupported doesn't since the dummy abstract method precludes any construction of that class. What if instead mark_as_unsupported just overwrote that class's is_valid method to just return False all the time? That would make those classes still constructable. Calls to get_all_tokenizers would still ignore them just the same, and direct use of all_instances would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS included by default in calls to all_tokenizers so that users have to consciously opt out of those validation funcs. I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it. aaron-sandoval added 3 commits 2 weeks ago @aaron-sandoval Added a TODO 8ad6f3a @aaron-sandoval bugfix: Shuffling and permuting edges c9e3bc8 @aaron-sandoval Added feature to color_tokens_rgb to automate line breaks for repor… 1aeb408 @mivanit Member Author mivanit commented 2 weeks ago @mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal types to artificially restrict the space returned by all_instances. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal types allow that direct construction, but the current implementation of mark_as_unsupported doesn't since the dummy abstract method precludes any construction of that class. What if instead mark_as_unsupported just overwrote that class's is_valid method to just return False all the time? That would make those classes still constructable. Calls to get_all_tokenizers would still ignore them just the same, and direct use of all_instances would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS included by default in calls to all_tokenizers so that users have to consciously opt out of those validation funcs. I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it. You're right that the current implementation would require people to manually edit their locally installed version of the package to get the unsupported tokenizers to work -- I actually think this behavior is fine and serves as a "make sure you know what you're doing filer", but if you think this is a quick edit then I think it's fine to make as long as we add a warning when initializing an unsupported tokenizer. mivanit added 10 commits last week @mivanit better doc for make test_all_tok dded010 @mivanit update dep 3ca392f @mivanit removed testing length of ALL_TOKENIZERS 2b8b2a5 @mivanit lower n_to_test to see if pass, better conditions e41c417 @mivanit format c3c9635 @mivanit action was not correctly reading num of tokenizers and parallelization cd9e069 @mivanit oops b7ee88b @mivanit poetry update 9ea828d @mivanit option to skip hashes test so we can run it separately 08bf285 @mivanit test_all_tok was always skipping 90612dc @mivanit Member Author mivanit commented last week • making a note for profiling and figuring out how many tokenizers we can test Time (MM:SS) Time (s) n tokenizers tested Link 9:54 594 100 action 10:05 605 300 action 11:15 675 1000 action 13:33 813 3000 action 16:41 1001 5000 action image @mivanit bumping default NUM_TOKENIZERS_TO_TEST to 300 f3d2a72 mivanit and others added 18 commits last week @mivanit bump default NUM_TOKENIZERS_TO_TEST to 1000 62a02b7 @mivanit bump default NUM_TOKENIZERS_TO_TEST to 3000 57cd1d1 @mivanit test 5k tokenizers, expect test to take just over 16 min 3a9701d @mivanit bump muutils to 0.6.8, slight updates to spinners fd6827f @aaron-sandoval Modified mark_as_unsupported to allow construction of all unsupport… 599b955 @aaron-sandoval wip getting nb to work ad11705 @aaron-sandoval Merge branch 'tokenizer-overhaul-integration' into tokenizer-overhaul… f03cf2b @aaron-sandoval Removed warnings, added unit tests for classes with `mark_as_unsuppor… 5c7de66 @mivanit slight clean up to .summary() code c095d05 @mivanit change _type to _type_ in MMT to avoid wandb issue 729bf9a @mivanit format 7ed866d @mivanit Merge pull request #44 from understanding-search/tokenizer-overhaul-s… d5f453b @mivanit bump muutils dep to 0.6.9 ebf9a25 @mivanit Merge branch 'tokenizer-overhaul-integration' of https://github.com/u… ee01ade @mivanit remove deprecation warning for is_UT and is_AOTP eb86a61 @mivanit reverts and fixes 95c693e @mivanit bump muutils to 0.6.10 be7dbfd @mivanit MAZE_DATASET_CONFIGS gives a copy, not the original e3decdb @aaron-sandoval aaron-sandoval linked an issue last week that may be closed by this pull request add option to remove randomization in as_tokens #34 Open mivanit added 10 commits 5 days ago @mivanit fix MAZE_DATASET_CONFIGS interface 5e3d846 @mivanit testing some changes to loading tokenizer hashes a1a75d4 @mivanit update dep a2daf8e @mivanit revert how we load the tokenizer hashes 645fec4 @mivanit ALL_TOKENIZER_HASHES now private and only gotten when required 7316455 @mivanit update dep 2fb57c2 @mivanit removed old setup.py 6a2b8e9 @mivanit minor notebook cleanup 7b73303 @mivanit a test no longer breaks collection, fixed a TODO 87c130e @mivanit bump version to 1.0.0 a638ef8 mivanit mivanit commented 1 minute ago Member Author mivanit left a comment Approving -- this is finally done!
understanding-search · Aug 28, 2024 · d5a20d6 · d5a20d6
2 parents aac2402 + a638ef8
commit d5a20d6
Show file tree

Hide file tree

Showing 42 changed files with 7,703 additions and 2,307 deletions.
diff --git a/.github/workflows/all-tok-checks.yml b/.github/workflows/all-tok-checks.yml
@@ -0,0 +1,52 @@
+name: All Tokenizer Checks
+
+on:
+  push:
+    paths:
+      - 'maze_dataset/utils.py' # temporary
+      - 'maze_dataset/token_utils.py' # temporary
+      - 'maze_dataset/constants.py'
+      - 'maze_dataset/tokenization/*.py'
+      - 'maze_dataset/tokenization/MazeTokenizerModular_hashes.npz'
+      - 'notebooks/demo_mazetokenizermodular.ipynb'
+      - 'tests/all_tokenizers/*.py'
+      - 'pyproject.toml' # on new version or update deps
+      - '.github/workflows/all-tok-checks.yml' # changing this file
+      - '.lastversion' # on new release
+  workflow_dispatch:
+    inputs:
+      n_to_test:
+        description: 'Number of tokenizers to test'
+        required: false
+        default: 5000
+        type: number
+      pytest_parallel:
+        description: '1 to parallelize tests with -n auto, to run without parallelization'
+        required: false
+        default: 1
+        type: number
+
+jobs:
+  all_tok_test:
+    name: All Tokenizer Tests
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+
+      - name: Install dependencies
+        run: |
+          curl -sSL https://install.python-poetry.org | python3 -
+          poetry lock --check
+          poetry install
+      - name: tokenizer hash tests
+        run: |
+          make test_tok_hashes
+      - name: all tokenizer tests
+        run: |
+          N_TO_TEST=${{ github.event.inputs.n_to_test || '5000' }}
+          PYTEST_PARALLEL=${{ github.event.inputs.pytest_parallel || '1' }}
+          make test_all_tok SKIP_HASH_TEST=1 NUM_TOKENIZERS_TO_TEST=$N_TO_TEST PYTEST_PARALLEL=$PYTEST_PARALLEL
diff --git a/.github/workflows/checks.yml b/.github/workflows/checks.yml
@@ -4,7 +4,7 @@ on:
   workflow_dispatch:
   pull_request:
     branches:
-      - main
+      - '*'
   push:
     branches:
       - main
@@ -31,6 +31,7 @@ jobs:
         run: |
           pip install black
           python -m black --check .
+
   test:
     name: Test
     env:
@@ -70,3 +71,6 @@ jobs:
 
       - name: Notebook tests
         run: make test_notebooks
+
+      - name: Test Tokenizer Hashes
+        run: make test_tok_hashes
diff --git a/.gitignore b/.gitignore
@@ -5,6 +5,7 @@ tests/_temp/**
 htmlcov/
 .vscode/**
 notebooks/data/**
+*.bak
 
 .pypi-token
 .commit_log
@@ -36,6 +37,7 @@ share/python-wheels/
 .installed.cfg
 *.egg
 MANIFEST
+.python-version
 
 # PyInstaller
 #  Usually these files are written by a python script from a template

diff --git a/README.md b/README.md
@@ -75,8 +75,8 @@ m.as_ascii()
 # RGB image, optionally without solution or endpoints, suitable for CNNs
 m.as_pixels() 
 # text format for autoreregressive transformers
-from maze_dataset.tokenization import MazeTokenizer, TokenizationMode
-m.as_tokens(maze_tokenizer=MazeTokenizer(
+from maze_dataset.tokenization import MazeTokenizerModular, TokenizationMode
+m.as_tokens(maze_tokenizer=MazeTokenizerModular(
     tokenization_mode=TokenizationMode.AOTP_UT_rasterized, max_grid_size=100,
 ))
 # advanced visualization with many features

diff --git a/docs/mazeplot_3x3.png b/docs/mazeplot_3x3.png
diff --git a/makefile b/makefile
@@ -65,12 +65,18 @@ check-format: clean
 # coverage reports & benchmarks
 # --------------------------------------------------
 # whether to run pytest with coverage report generation
+PYTEST_OPTIONS ?=
+
 COV ?= 1
 
 ifeq ($(COV),1)
-    PYTEST_OPTIONS=--cov=.
-else
-    PYTEST_OPTIONS=
+    PYTEST_OPTIONS+=--cov=.
+endif
+
+PYTEST_PARALLEL ?= 0
+
+ifeq ($(PYTEST_PARALLEL),1)
+	PYTEST_OPTIONS+=-n auto
 endif
 
 .PHONY: cov
@@ -94,6 +100,30 @@ unit:
 	@echo "run unit tests"
 	$(POETRY_RUN_PYTHON) -m pytest $(PYTEST_OPTIONS) tests/unit
 
+.PHONY: save_tok_hashes
+save_tok_hashes:
+	@echo "generate and save tokenizer hashes"
+	$(POETRY_RUN_PYTHON) -m maze_dataset.tokenization.save_hashes -p
+
+.PHONY: test_tok_hashes
+test_tok_hashes:
+	@echo "re-run tokenizer hashes and compare"
+	$(POETRY_RUN_PYTHON) -m maze_dataset.tokenization.save_hashes -p --check
+
+
+.PHONY: test_all_tok
+test_all_tok:
+	@echo "run tests on all tokenizers. can pass NUM_TOKENIZERS_TO_TEST arg or SKIP_HASH_TEST"
+	@echo "NUM_TOKENIZERS_TO_TEST=$(NUM_TOKENIZERS_TO_TEST)"
+	@if [ "$(SKIP_HASH_TEST)" != "1" ]; then \
+		echo "Running tokenizer hash tests"; \
+		$(MAKE) test_tok_hashes; \
+	else \
+		echo "Skipping tokenizer hash tests"; \
+	fi
+	$(POETRY_RUN_PYTHON) -m pytest $(PYTEST_OPTIONS) --verbosity=-1 --durations=50 tests/all_tokenizers
+
+
 
 .PHONY: convert_notebooks
 convert_notebooks:

diff --git a/maze_dataset/__init__.py b/maze_dataset/__init__.py
@@ -1,5 +1,11 @@
 from maze_dataset.constants import (
     SPECIAL_TOKENS,
+    VOCAB,
+    VOCAB_LIST,
+    VOCAB_TOKEN_TO_INDEX,
+    Connection,
+    ConnectionArray,
+    ConnectionList,
     Coord,
     CoordArray,
     CoordList,
@@ -15,15 +21,22 @@
     set_serialize_minimal_threshold,
 )
 from maze_dataset.generation.generators import LatticeMazeGenerators
-from maze_dataset.maze.lattice_maze import LatticeMaze, SolvedMaze
+from maze_dataset.maze.lattice_maze import LatticeMaze, SolvedMaze, TargetedLatticeMaze
 
 __all__ = [
     "Coord",
     "CoordTup",
     "CoordList",
     "CoordArray",
+    "Connection",
+    "ConnectionList",
+    "ConnectionArray",
     "SPECIAL_TOKENS",
+    "VOCAB",
+    "VOCAB_LIST",
+    "VOCAB_TOKEN_TO_INDEX",
     "LatticeMaze",
+    "TargetedLatticeMaze",
     "SolvedMaze",
     "MazeDataset",
     "MazeDatasetConfig",

diff --git a/maze_dataset/constants.py b/maze_dataset/constants.py
@@ -1,13 +1,18 @@
 import warnings
-from dataclasses import dataclass
+from dataclasses import dataclass, field, make_dataclass
 
 import numpy as np
-from jaxtyping import Int8
+from jaxtyping import Bool, Int8
 
-Coord = Int8[np.ndarray, "x y"]
+from maze_dataset.utils import corner_first_ndindex
+
+Coord = Int8[np.ndarray, "row_col"]
 CoordTup = tuple[int, int]
-CoordArray = Int8[np.ndarray, "coord x y"]
+CoordArray = Int8[np.ndarray, "coord row_col"]
 CoordList = list[CoordTup]
+Connection = Int8[np.ndarray, "coord=2 row_col=2"]
+ConnectionList = Bool[np.ndarray, "lattice_dim=2 row col"]
+ConnectionArray = Int8[np.ndarray, "edges leading_trailing_coord=2 row_col=2"]
 
 
 class SpecialTokensError(Exception):
@@ -115,3 +120,75 @@ def keys(self):
         [-1, 0],  # left
     ]
 )
+
+
+_VOCAB_FIELDS: list = [
+    # *[(k, str, field(default=v)) for k, v in SPECIAL_TOKENS.items()],
+    ("COORD_PRE", str, field(default="(")),
+    ("COORD_INTRA", str, field(default=",")),
+    ("COORD_POST", str, field(default=")")),
+    ("TARGET_INTRA", str, field(default="=")),
+    ("TARGET_POST", str, field(default="||")),
+    ("PATH_INTRA", str, field(default=":")),
+    ("PATH_POST", str, field(default="THEN")),
+    ("NEGATIVE", str, field(default="-")),
+    ("UNKNOWN", str, field(default="<UNK>")),
+    *[
+        (f"TARGET_{a}", str, field(default=f"TARGET_{a}"))
+        for a in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
+    ],
+    ("TARGET_NORTH", str, field(default="TARGET_NORTH")),
+    ("TARGET_SOUTH", str, field(default="TARGET_SOUTH")),
+    ("TARGET_EAST", str, field(default="TARGET_EAST")),
+    ("TARGET_WEST", str, field(default="TARGET_WEST")),
+    ("TARGET_NORTHEAST", str, field(default="TARGET_NORTHEAST")),
+    ("TARGET_NORTHWEST", str, field(default="TARGET_NORTHWEST")),
+    ("TARGET_SOUTHEAST", str, field(default="TARGET_SOUTHEAST")),
+    ("TARGET_SOUTHWEST", str, field(default="TARGET_SOUTHWEST")),
+    ("TARGET_CENTER", str, field(default="TARGET_CENTER")),
+    ("PATH_NORTH", str, field(default="NORTH")),
+    ("PATH_SOUTH", str, field(default="SOUTH")),
+    ("PATH_EAST", str, field(default="EAST")),
+    ("PATH_WEST", str, field(default="WEST")),
+    ("PATH_FORWARD", str, field(default="FORWARD")),
+    ("PATH_BACKWARD", str, field(default="BACKWARD")),
+    ("PATH_LEFT", str, field(default="LEFT")),
+    ("PATH_RIGHT", str, field(default="RIGHT")),
+    ("PATH_STAY", str, field(default="STAY")),
+    *[
+        (f"I_{i:03}", str, field(default=f"+{i}")) for i in range(256)
+    ],  # General purpose positive int tokens. Used by `StepTokenizers.Distance`.
+    *[
+        (f"CTT_{i}", str, field(default=f"{i}")) for i in range(128)
+    ],  # Coord tuple tokens
+    *[
+        (f"I_N{-i:03}", str, field(default=f"{i}")) for i in range(-256, 0)
+    ],  # General purpose negative int tokens
+    ("PATH_PRE", str, field(default="STEP")),
+    ("ADJLIST_PRE", str, field(default="ADJ_GROUP")),
+    ("ADJLIST_INTRA", str, field(default="&")),
+    ("ADJLIST_WALL", str, field(default="<XX>")),
+    *[(f"RESERVE_{i}", str, field(default=f"<RESERVE_{i}>")) for i in range(708, 1596)],
+    *[
+        (f"UT_{x:02}_{y:02}", str, field(default=f"({x},{y})"))
+        for x, y in corner_first_ndindex(50)
+    ],
+]
+
+
+_VOCAB_BASE: type = make_dataclass(
+    "_VOCAB_BASE", fields=_VOCAB_FIELDS, bases=(_SPECIAL_TOKENS_BASE,), frozen=True
+)
+# TODO: edit __getitem__ to add warning for accessing a RESERVE token
+
+VOCAB: _VOCAB_BASE = _VOCAB_BASE()
+VOCAB_LIST: list[str] = list(VOCAB.values())
+VOCAB_TOKEN_TO_INDEX: dict[str, int] = {token: i for i, token in enumerate(VOCAB_LIST)}
+
+# CARDINAL_MAP: Maps tuple(coord1 - coord0) : cardinal direction
+CARDINAL_MAP: dict[tuple[int, int], str] = {
+    (-1, 0): VOCAB.PATH_NORTH,
+    (1, 0): VOCAB.PATH_SOUTH,
+    (0, -1): VOCAB.PATH_WEST,
+    (0, 1): VOCAB.PATH_EAST,
+}
diff --git a/maze_dataset/dataset/configs.py b/maze_dataset/dataset/configs.py
@@ -1,7 +1,10 @@
+import copy
+from typing import Mapping
+
 from maze_dataset.dataset.maze_dataset import MazeDatasetConfig
 from maze_dataset.generation.generators import LatticeMazeGenerators
 
-MAZE_DATASET_CONFIGS: dict[str, MazeDatasetConfig] = {
+_MAZE_DATASET_CONFIGS_SRC: dict[str, MazeDatasetConfig] = {
     cfg.to_fname(): cfg
     for cfg in [
         MazeDatasetConfig(
@@ -24,3 +27,31 @@
         ),
     ]
 }
+
+
+class _MazeDatsetConfigsWrapper(Mapping[str, MazeDatasetConfig]):
+    def __init__(self, configs: dict[str, MazeDatasetConfig]):
+        self._configs = configs
+
+    def __getitem__(self, item: str) -> MazeDatasetConfig:
+        return self._configs[item]
+
+    def __len__(self) -> int:
+        return len(self._configs)
+
+    def __iter__(self):
+        return iter(self._configs)
+
+    def keys(self):
+        return self._configs.keys()
+
+    def items(self):
+        return [(k, copy.deepcopy(v)) for k, v in self._configs.items()]
+
+    def values(self):
+        return [copy.deepcopy(v) for v in self._configs.values()]
+
+
+MAZE_DATASET_CONFIGS: _MazeDatsetConfigsWrapper = _MazeDatsetConfigsWrapper(
+    _MAZE_DATASET_CONFIGS_SRC
+)
diff --git a/maze_dataset/dataset/dataset.py b/maze_dataset/dataset/dataset.py
@@ -88,6 +88,7 @@ def __post_init__(self):
 
     def summary(self) -> dict:
         """return a summary of the config"""
+        # do we run this to make sure it doesn't error?
         self_ser: dict = self.serialize()
         return dict(
             name=self.name,