Skip to content

Commit

Permalink
Merge pull request #38 from understanding-search/tokenizer-overhaul-i…
Browse files Browse the repository at this point in the history
…ntegration

Tokenizer overhaul (complete)

# TODOs:

- `checks.yml`: ![status](https://github.com/understanding-search/maze-dataset/actions/workflows/checks.yml/badge.svg?branch=tokenizer-overhaul-integration)
- `all-tok-checks.yml`: ![status](https://github.com/understanding-search/maze-dataset/actions/workflows/all-tok-checks.yml/badge.svg?branch=tokenizer-overhaul-integration)

- [x] depends on #37 being merged into `tokenizer-overhaul-integration` first
- [x] depends on tests passing in [`maze-transformer` #214](understanding-search/maze-transformer#214)
    - PR: understanding-search/maze-transformer#214
    - PR checks status: ![status](https://github.com/understanding-search/maze-transformer/actions/workflows/checks.yml/badge.svg?branch=update-maze-dataset-tokenizers-step2)
- [x] ~~make sure that tokenizer hash numpy file is shipped with `maze-dataset` package~~
  - [x] ~~maybe make it optional, only if `tokenization` extra is requested?~~
  - [x] no need to ship this, its only required for tests
- [x] go over existing `# TODO` comments, see if anything is critical
  - [x] parametrize some tests in `tests/unit/maze_dataset/generation/test_maze_dataset.py`
- [x] clean up various superfluous warnings
- [x] consider optimizations to `all_instances` and `save_hashes`
  - [x] unclear if `save_hashes` is being properly parallelized :/
- [x] see if any more utils should be moved to muutils

# commit log:

aaron-sandoval and others added 30 commits 3 months ago
@aaron-sandoval
Implemented validation_funcs in all_instances, added docs, added … 
5c79d22
@aaron-sandoval
Minor unit test fixes. zanj loading of StepSequencer is broken on t… 
a72680b
@aaron-sandoval
Added StepSizes.Straightaways. test_path_tokenizers passing.
72380c0
@aaron-sandoval
Added some PathTokenizers and tests
f184c1b
@aaron-sandoval
docs
3c35da7
@mivanit
minor assert
5d5b405
@aaron-sandoval
wip reversing import direction between maze_tokenizer.py and lattice_… 
f49d0f2
@aaron-sandoval
Refactored from_tokens to remove import dependency. `test_from_toke… 
7d56a1d
@aaron-sandoval
Moved util.py and token_utils.py out of tokenization directory … 
1795f16
@aaron-sandoval
Fixed hash collisions among different TokenizerElement instances. A… 
cee9eb4
@aaron-sandoval
Added Cardinal. test_path_tokenizers passing
99c6083
@aaron-sandoval
Added Forks. test_path_tokenizers passing.
782f204
@aaron-sandoval
Test fuzzing for test_path_tokenizers
db85262
@aaron-sandoval
Added ForksAndStraightaways. test_path_tokenizers passing
be642e3
@aaron-sandoval
Compacted test_path_tokenizers
5788cf9
@aaron-sandoval
Added Relative. test_path_tokenizers passing. Had to comment out … 
f961707
@aaron-sandoval
Temporarily deactivated StepTokenizers.Cardinal and `StepTokenizers… 
c1e8b52
@aaron-sandoval
Ran make unit. Only test_is_tested_tokenizer and `test_zanj_save_re… 
438deae
@aaron-sandoval
Fixed one non-fuzzed use of ALL_TOKENIZERS in unit tests. That migh… 
9eb338f
@aaron-sandoval
Deleted extraneous files
c3aa60b
@aaron-sandoval
Copied AdjListTokenizers skeleton from Notion
7767ad1
@aaron-sandoval
AdjListTokenizer docs and wip implementation
243a118
@aaron-sandoval
Docs, added new VOCAB items
35f2052
@aaron-sandoval
Docs
7479445
@aaron-sandoval
Added numpy_rng, refactored ChessboardSublattice into EdgePermuters.
bb66156
@aaron-sandoval
EdgePermuter._permute working in debugger for both subclasses
f934fe0
@aaron-sandoval
wip test_edge_permuters
f982e99
@aaron-sandoval
test_edge_permuters passing
81fb7b8
@aaron-sandoval
minor things
3efbc6c
@aaron-sandoval
Added lattice_connection_array
f680b1f
aaron-sandoval and others added 19 commits 3 months ago
@aaron-sandoval
AllLatticeEdges.get_edges written, not tested
ba6404a
@aaron-sandoval
Updated .gitignore for pyenv environment file
fa58ef5
@aaron-sandoval
Bug in CPython raises exception when building constants._VOCAB_BASE… 
f77f077
@aaron-sandoval
Updated poetry.lock
9a52cc7
@aaron-sandoval
test_manhattan_distance passing
00bd0a7
@aaron-sandoval
test_edge_subsets passing
c4253ec
@aaron-sandoval
Made method skeleton for EdgeGrouping and `AdjListTokenizer.to_toke… 
973df5a
@aaron-sandoval
wip test_edge_subsets
64dd2ff
@aaron-sandoval
test_edge_subsets passing
26db9f6
@mivanit
fix local function 
03ae196
@mivanit
underscore prefix for tokenizer element namespace ABCs, fix ABC 
976c361
@mivanit
better error on MazeTokenizer2.from_tokens() 
66d8cfa
@mivanit
fix type hints for get_all_subclasses 
b01aca9
@mivanit
minor cleanup to flatten(), added some tests for tokenizer utils
9bffdd3
@mivanit
run format
9c43aa5
@mivanit
cleanup of connection_list_to_adj_list 
313b1f8
@mivanit
rename and fix unpackable_if_true_attribute 
2395b5b
@mivanit
update poetry lock
b006afe
@mivanit
Merge pull request #35 from understanding-search/tokenizer-overhaul 
33afb1b
@review-notebook-appReview Notebook App
review-notebook-app bot commented on Jun 10
Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

mivanit and others added 40 commits 2 months ago
@mivanit
changed workflow to run checks on any PR
e14ba3d
@mivanit
merge commit?
999b251
@mivanit
format
042923d
@mivanit
update lockfile
6223013
@mivanit
mostly naming changes 
0e63156
@mivanit
format
ad703b0
@mivanit
minor fix to test_isinstance_by_type_name 
1511108
@mivanit
fix import in test_util.py
32c59d5
@mivanit
moved tests requiring all tokenizers into a separate dir 
c194d69
@mivanit
run format
4c52c4b
@mivanit
kill the all_tokenizers call after fixed time (TEMP)
a04f5be
@mivanit
(TEMP) ignore tokenizer deprecation warnings in tests
37314bd
@mivanit
wip
1f16761
@mivanit
wip merging
4be1957
@mivanit
run format
6a719e5
@mivanit
some imports
2aa17e9
@mivanit
unit tests collect, but don't pass
d418dfb
@mivanit
format
5ff09a2
@mivanit
Merge pull request #40 from understanding-search/tokenizer-overhaul-m… 
0f116cd
@mivanit
....wip
0d372f5
@mivanit
merge main into tokenizer-overhaul-integration
17c3209
@aaron-sandoval
Update dependency
27eb7db
@aaron-sandoval
Fixed incomplete refactor of TokenizerElement abstract subclasses. … 
c0da82d
@aaron-sandoval
Added missing @serializable_dataclass decorators`
58b893f
@aaron-sandoval
Cleaned up _tokenize_edge_grouping, to_tokens interfaces. Impleme… 
f72a974
@aaron-sandoval
test_lattice_connection_arrray passing
f843a0c
@aaron-sandoval
Added testing_utils.py. Not tested yet
675d844
@aaron-sandoval
test_is_connection passing
27de83b
@aaron-sandoval
test_edge_groupings passing with an added assertion. `_tokenize_edg… 
633c8c0
@aaron-sandoval
Removed duplicated functionality in BothCoords and AllLatticeEdges
00d57b3
@aaron-sandoval
wip debugging _tokenize_edge_grouping
67cda3d
@aaron-sandoval
AdjListCoord.to_tokens slightly hand-tested, unit tests wip
6e6db5f
@aaron-sandoval
wip AdjListCardinal._tokenize_edge_grouping. Snapshot before refact… 
cdeeac0
@aaron-sandoval
Refactored _tokenize_edge_grouping. All `_AdjListTokenizer.to_token… 
a5aa3d9
@aaron-sandoval
wip test_adjlist_tokenizers
76d66de
@aaron-sandoval
wip test_adjlist_tokenizers. Problems with random seeds changing py… 
754c759
@aaron-sandoval
Random seed hackily fixed, pytest running
188309c
@aaron-sandoval
wip test_adjlist_tokenizers. Snapshot before starting `LatticeMaze.… 
1c7dbe6
@aaron-sandoval
test_adjlist_tokenizers passing with 20 tokenizers
6c575d5
@aaron-sandoval
test_adjlist_tokenizers passing with 100 tokenizers
dac7888
aaron-sandoval added 20 commits 2 months ago
@aaron-sandoval
Moved various TokenizerElement tests back to test_tokenizer.py. 16 … 
ca401ff
@aaron-sandoval
Cleanup, added extra pruning to _get_all_tokenizers, but probably i… 
c27f9df
@aaron-sandoval
Renamed MazeTokenizer2 >> MazeTokenizerModular. `test_path_tokeni… 
5512fd4
@aaron-sandoval
wip refactoring all_instances for concision. Now accepts regular `d… 
62e6539
@aaron-sandoval
wip test_all_instances not quite passing. Enhanced all_instances … 
9078f6b
@aaron-sandoval
test_all_tokenizers passing except for case 1-_PathTokenizer, whi… 
0f2fb8a
@aaron-sandoval
Updated muutils and zanj, dc_eq still has bug
ff9f086
@aaron-sandoval
Deleted duplicate utilities
fd8ce29
@aaron-sandoval
Minor generic type fixes, added some unit tests, not run yet
0142fa6
@aaron-sandoval
Refactored MazeTokenizer to MazeTokenizerModular throughout the l… 
bd79997
@aaron-sandoval
Set muutils dependency to last working version until latest bugs are … 
b1e1fba
@aaron-sandoval
Readded pandas, which got removed from dependencies at some point in … 
91c7680
@aaron-sandoval
wip updating nb
bf83a19
@aaron-sandoval
Did some profiling, removed a currerntly redundant validation from `_… 
cd5a455
@aaron-sandoval
wip refactoring uses of all_instances to work with Generator output
c82b3e2
@aaron-sandoval
Remove temporary pruning of _StepTokenizers space. `_get_all_tokeni… 
6951895
@aaron-sandoval
Reinstated pruning
02e39ef
@aaron-sandoval
Small additions for demo nbs
91ada63
@aaron-sandoval
all_instances bugfixes, more unit tests. wip reworking `TokenizerEl… 
ddc72de
@aaron-sandoval
make unit all passing
2f03963
@mivanit mivanit added enhancement tokenization labels on Jul 3
mivanit added 13 commits last month
@mivanit
merge in integration branch
ffab764
@mivanit
run format
a45c9d0
@mivanit
new lock file
09e7212
@mivanit
fix an import
ad1506d
@mivanit
update muutils and zanj deps
4619d2c
@mivanit
won't check types on applied_filters or maze_ctor func 
19dfb6e
@mivanit
wip
0fdb2f2
@mivanit
wip
057ab14
@mivanit
wip
147b6d7
@mivanit
wip
d2042d6
@mivanit
Format
c66ac97
@mivanit
random sample of tokenizers to make things run in the notebook
437e34a
@mivanit
one line fix to tokenizers
3371303
@mivanit mivanit mentioned this pull request on Jul 5
Add tokenization schemes #30
Open
mivanit and others added 24 commits last month
@mivanit
Merge branch 'main' into fix-save-load-profile
cd1eeb3
@mivanit
format
9783610
@mivanit
Merge branch 'main' into tokenizer-overhaul-integration
290f55d
@aaron-sandoval
Bugfix in test_has_element. Can't figure out why `ALL_TOKENIZER_HAS… 
826ea87
@aaron-sandoval
test_zanj_save_read passing
f1e937e
@aaron-sandoval
Added properties_to_serialize to MazeTokenizerModular. `test_zanj… 
26fb821
@aaron-sandoval
Demo nb milestone, fixed a couple of small bugs
673b986
@aaron-sandoval
wip filtering demo
37c1c75
@aaron-sandoval
wip filtering demo
069ceb3
@mivanit
tentatively marking some tokenizer elements as not supported
da0f197
@aaron-sandoval
Updated muutils dependency 0.6.3
73d2fdb
@aaron-sandoval
Removed utilities migrated to muutils, re-pointed imports
6f0f62c
@aaron-sandoval
Using stable_hash
929ba19
@aaron-sandoval
Swapped in @mark_as_unsupported for old @AbstractMethod temporary… 
0f86ee4
@mivanit
some more removed tokenizers
2293ee9
@aaron-sandoval
Hashing I think is fixed and stable. all_instances still broken
a16b4e4
@aaron-sandoval
Hashing works with the __init_subclass__ hidden field workaround. S… 
dc2f2c1
@aaron-sandoval
test_zanj_save_read passing, ...hashes.npy is out of date so `tes… 
6e15a5c
@aaron-sandoval
wip demo nb
fdd4089
@aaron-sandoval
Added caching back into all_instances while it still uses generator… 
ac84ae1
@aaron-sandoval
Think I finished the demo nb. Can't test it yet due to bug in muutils
1c2662f
@aaron-sandoval
Refactored TokenizerElement to _TokenizerElement. Untested for sa… 
98727aa
@aaron-sandoval
Added overall token counter test. Can't run due to muutils bug
2677bff
@aaron-sandoval
Consolidated util.py into token_utils.py. Imports seem to resolve wit… 
c6153b1
aaron-sandoval and others added 43 commits last month
@aaron-sandoval
Deleted tests which were migrated to muutils
b52c962
@aaron-sandoval
Tried moving connection_list_to_adj_list, is_connection to utils.… 
2b1ffdb
@mivanit
Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… 
850c24f
@mivanit
merge from main, update deps
ff7f4da
@mivanit
Merge branch 'main' into tokenizer-overhaul-integration
22c9dda
@mivanit
update deps
d96248c
@mivanit
poetry update lockfile
bf1701b
@mivanit
run format
3ab6760
@mivanit
BothCoords is depended on, no longer marked as unsupported
13460d8
@mivanit
missing regex import
2efc29f
@mivanit
format import re
c304f67
@mivanit
fixed that applied_filters list/tuple mismatch bug once and for all
4bf2844
@mivanit
minor import fixes and re-run
13f8593
@mivanit
fixes to endpoint_kwargs
3a201f2
@mivanit
run format
ff9ecd1
@mivanit
wip
8dd038f
@mivanit
re-run notebook
1e15fd9
@mivanit
save_tok_hashes as prereq of test_all_tok
46b3ec7
@mivanit
update deps for spinner
164e91f
@mivanit
run save_tok_hashes
be5ad67
@mivanit
fix has_element() call causing test_all_tok fail
43c3653
@mivanit
change .has_element() to take *args
bfebb29
@mivanit
parallelize test_all_tok w/ pytest-xdist
3740164
@mivanit
improvements & parallelization to save_hashes
8c8a096
@mivanit
faster hashing of new tokenizers
c6a0b7d
@mivanit
blake2b hash func, zipped hashes, more spinners
f2e3be4
@mivanit
format
e8d2758
@mivanit
attempt speed up all_instances
d8afc7d
@mivanit
less verbose test_all_tok
b263564
@mivanit
has_element() fix, back to longer pytest traces
2510d26
@mivanit
test fix
9f5166e
@mivanit
test optimizations
c00338c
@mivanit
test a larger random sample of tokenizers
ab246ae
@mivanit
set n of tokenizers to test via env
e3ce590
@mivanit
better CI, optimizations to testing all toks
5149058
@aaron-sandoval
Added one more condition in test_token_stability
34782a9
@aaron-sandoval
Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… 
2108e3a
@mivanit
tokenizer hashes now checked differently
b25869e
@mivanit
Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… 
afea7c2
@aaron-sandoval
Bugfix in test_token_stability, not all mazes have a "<PATH_END>"
8befacc
@mivanit
format
c09ebc5
@aaron-sandoval
Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… 
a4e67ab
@aaron-sandoval
format
03ed351
This was linked to issues on Jul 25
Add tokenization schemes #30
Open
Duplication between test_coord_str_tuple.py and test_token_utils.py #32
Open
TokenizationMode and MazeTokenizer ugly and not scalable #33
Open
@mivanit mivanit marked this pull request as ready for review last month
aaron-sandoval and others added 13 commits last month
@aaron-sandoval
Docs, a bit of cleanup
fa5c691
@aaron-sandoval
Format
51b77ae
@aaron-sandoval
Put back __init_subclass__
92da244
@aaron-sandoval
Format
cd08d53
@aaron-sandoval
Merge pull request #37 from understanding-search/tokenizer-overhaul-s… 
9ded238
@mivanit
Merge branch 'tokenizer-overhaul-integration' into fix-save-load-profile
63c21c9
@mivanit
fixed long-standing issue with serialization cfg mismatch
bfcaeca
@mivanit
fix exception for generation meta mismatch
bb9b69c
@mivanit
Merge pull request #36 from understanding-search/fix-save-load-profile 
7aa97b8
@mivanit
reworked CI
2c16c8d
@mivanit
added comments to save_hashes, testing CI dispatch
6c471ea
@mivanit
will run all-tok-checks if all_tokenizers tests changed
27cd942
@mivanit
all tok modifications 
0c16cd6
@mivanit mivanit mentioned this pull request on Jul 26
Support maze dataset tokenizers update understanding-search/maze-transformer#214
 Open
mivanit and others added 5 commits last month
@mivanit
reduce default number of tokenizers tested
4edd476
@mivanit
Merge branch 'main' into tokenizer-overhaul-integration, bump version 
2c19e57
@mivanit
summary should be json-serializable
19644d3
@mivanit
fix notebook
4c8adbf
@aaron-sandoval
test commit
2582299
@aaron-sandoval
Member
aaron-sandoval commented 27 days ago • 
see if any more utils should be moved to muutils

Except for all_instances, I don't think any more of the utilities in utils.py which were newly added in this PR should be moved due to their more narrow applicability to mazes. There are a few older ones which could be moved. I didn't think you wanted to move elements of the old API if it could be helped, but if you're open to it here are my recommendations. I only checked for uses in the 2 libraries.

Function	Where Used	Recommendation
all_instances	maze-dataset	Move
bool_array_from_string	maze-dataset, maze-transformer	Move
adj_list_to_nested_set	maze-dataset, maze-transformer	Maybe keep in m-d since it's at least lattice graph-specific
apply_mapping	Nowhere	Move
apply_mapping_chain	Nowhere	Move
@aaron-sandoval
Member
aaron-sandoval commented 27 days ago • 
unclear if save_hashes is being properly parallelized :/

Idk a proper way to verify this. As something quick, I just ran a few trials on my new desktop with different numbers of processes in the Pool. Here are the wall clock times reported by the spinner on list(pool.map(hash, all_tokenizers)).

# Processes	Spinner Runtime [sec]
No Pool; list(map(hash, all_tokenizers))	59
1	75
2	39
4	21
6	19
12	19
24	21
The multiprocessing is clearly helping compared to a single core, but idk why it plateaus so early. I tried using Pool.apply_async, and that was way slower. IMO, runtime doesn't seem long enough to warrant sinking time into this now since neither of us seem to have the knowledge atm. This is the kind of backend optimization that can be done after the 1.0.0 release.

@aaron-sandoval
Cleaned up test_maze_dataset.py
bfeba59
@aaron-sandoval
Member
aaron-sandoval commented 2 weeks ago
@mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal types to artificially restrict the space returned by all_instances. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal types allow that direct construction, but the current implementation of mark_as_unsupported doesn't since the dummy abstract method precludes any construction of that class.
What if instead mark_as_unsupported just overwrote that class's is_valid method to just return False all the time? That would make those classes still constructable. Calls to get_all_tokenizers would still ignore them just the same, and direct use of all_instances would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS included by default in calls to all_tokenizers so that users have to consciously opt out of those validation funcs.
I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it.

aaron-sandoval added 3 commits 2 weeks ago
@aaron-sandoval
Added a TODO
8ad6f3a
@aaron-sandoval
bugfix: Shuffling and permuting edges
c9e3bc8
@aaron-sandoval
Added feature to color_tokens_rgb to automate line breaks for repor… 
1aeb408
@mivanit
Member
Author
mivanit commented 2 weeks ago
@mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal types to artificially restrict the space returned by all_instances. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal types allow that direct construction, but the current implementation of mark_as_unsupported doesn't since the dummy abstract method precludes any construction of that class. What if instead mark_as_unsupported just overwrote that class's is_valid method to just return False all the time? That would make those classes still constructable. Calls to get_all_tokenizers would still ignore them just the same, and direct use of all_instances would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS included by default in calls to all_tokenizers so that users have to consciously opt out of those validation funcs. I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it.

You're right that the current implementation would require people to manually edit their locally installed version of the package to get the unsupported tokenizers to work -- I actually think this behavior is fine and serves as a "make sure you know what you're doing filer", but if you think this is a quick edit then I think it's fine to make as long as we add a warning when initializing an unsupported tokenizer.

mivanit added 10 commits last week
@mivanit
better doc for make test_all_tok
dded010
@mivanit
update dep
3ca392f
@mivanit
removed testing length of ALL_TOKENIZERS 
2b8b2a5
@mivanit
lower n_to_test to see if pass, better conditions
e41c417
@mivanit
format
c3c9635
@mivanit
action was not correctly reading num of tokenizers and parallelization
cd9e069
@mivanit
oops
b7ee88b
@mivanit
poetry update
9ea828d
@mivanit
option to skip hashes test so we can run it separately
08bf285
@mivanit
test_all_tok was always skipping
90612dc
@mivanit
Member
Author
mivanit commented last week • 
making a note for profiling and figuring out how many tokenizers we can test

Time (MM:SS)	Time (s)	n tokenizers tested	Link
9:54	594	100	action
10:05	605	300	action
11:15	675	1000	action
13:33	813	3000	action
16:41	1001	5000	action
image

@mivanit
bumping default NUM_TOKENIZERS_TO_TEST to 300
f3d2a72
mivanit and others added 18 commits last week
@mivanit
bump default NUM_TOKENIZERS_TO_TEST to 1000
62a02b7
@mivanit
bump default NUM_TOKENIZERS_TO_TEST to 3000
57cd1d1
@mivanit
test 5k tokenizers, expect test to take just over 16 min
3a9701d
@mivanit
bump muutils to 0.6.8, slight updates to spinners
fd6827f
@aaron-sandoval
Modified mark_as_unsupported to allow construction of all unsupport… 
599b955
@aaron-sandoval
wip getting nb to work
ad11705
@aaron-sandoval
Merge branch 'tokenizer-overhaul-integration' into tokenizer-overhaul… 
f03cf2b
@aaron-sandoval
Removed warnings, added unit tests for classes with `mark_as_unsuppor… 
5c7de66
@mivanit
slight clean up to .summary() code
c095d05
@mivanit
change _type to _type_ in MMT to avoid wandb issue 
729bf9a
@mivanit
format
7ed866d
@mivanit
Merge pull request #44 from understanding-search/tokenizer-overhaul-s… 
d5f453b
@mivanit
bump muutils dep to 0.6.9
ebf9a25
@mivanit
Merge branch 'tokenizer-overhaul-integration' of https://github.com/u… 
ee01ade
@mivanit
remove deprecation warning for is_UT and is_AOTP 
eb86a61
@mivanit
reverts and fixes
95c693e
@mivanit
bump muutils to 0.6.10
be7dbfd
@mivanit
MAZE_DATASET_CONFIGS gives a copy, not the original
e3decdb
@aaron-sandoval aaron-sandoval linked an issue last week that may be closed by this pull request
add option to remove randomization in as_tokens #34
Open
mivanit added 10 commits 5 days ago
@mivanit
fix MAZE_DATASET_CONFIGS interface
5e3d846
@mivanit
testing some changes to loading tokenizer hashes
a1a75d4
@mivanit
update dep
a2daf8e
@mivanit
revert how we load the tokenizer hashes
645fec4
@mivanit
ALL_TOKENIZER_HASHES now private and only gotten when required
7316455
@mivanit
update dep
2fb57c2
@mivanit
removed old setup.py
6a2b8e9
@mivanit
minor notebook cleanup
7b73303
@mivanit
a test no longer breaks collection, fixed a TODO
87c130e
@mivanit
bump version to 1.0.0
a638ef8
mivanit
mivanit commented 1 minute ago
Member
Author
mivanit left a comment
Approving -- this is finally done!
  • Loading branch information
mivanit authored Aug 28, 2024
2 parents aac2402 + a638ef8 commit d5a20d6
Show file tree
Hide file tree
Showing 42 changed files with 7,703 additions and 2,307 deletions.
52 changes: 52 additions & 0 deletions .github/workflows/all-tok-checks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: All Tokenizer Checks

on:
push:
paths:
- 'maze_dataset/utils.py' # temporary
- 'maze_dataset/token_utils.py' # temporary
- 'maze_dataset/constants.py'
- 'maze_dataset/tokenization/*.py'
- 'maze_dataset/tokenization/MazeTokenizerModular_hashes.npz'
- 'notebooks/demo_mazetokenizermodular.ipynb'
- 'tests/all_tokenizers/*.py'
- 'pyproject.toml' # on new version or update deps
- '.github/workflows/all-tok-checks.yml' # changing this file
- '.lastversion' # on new release
workflow_dispatch:
inputs:
n_to_test:
description: 'Number of tokenizers to test'
required: false
default: 5000
type: number
pytest_parallel:
description: '1 to parallelize tests with -n auto, to run without parallelization'
required: false
default: 1
type: number

jobs:
all_tok_test:
name: All Tokenizer Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.10"

- name: Install dependencies
run: |
curl -sSL https://install.python-poetry.org | python3 -
poetry lock --check
poetry install
- name: tokenizer hash tests
run: |
make test_tok_hashes
- name: all tokenizer tests
run: |
N_TO_TEST=${{ github.event.inputs.n_to_test || '5000' }}
PYTEST_PARALLEL=${{ github.event.inputs.pytest_parallel || '1' }}
make test_all_tok SKIP_HASH_TEST=1 NUM_TOKENIZERS_TO_TEST=$N_TO_TEST PYTEST_PARALLEL=$PYTEST_PARALLEL
6 changes: 5 additions & 1 deletion .github/workflows/checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ on:
workflow_dispatch:
pull_request:
branches:
- main
- '*'
push:
branches:
- main
Expand All @@ -31,6 +31,7 @@ jobs:
run: |
pip install black
python -m black --check .
test:
name: Test
env:
Expand Down Expand Up @@ -70,3 +71,6 @@ jobs:

- name: Notebook tests
run: make test_notebooks

- name: Test Tokenizer Hashes
run: make test_tok_hashes
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ tests/_temp/**
htmlcov/
.vscode/**
notebooks/data/**
*.bak

.pypi-token
.commit_log
Expand Down Expand Up @@ -36,6 +37,7 @@ share/python-wheels/
.installed.cfg
*.egg
MANIFEST
.python-version

# PyInstaller
# Usually these files are written by a python script from a template
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,8 @@ m.as_ascii()
# RGB image, optionally without solution or endpoints, suitable for CNNs
m.as_pixels()
# text format for autoreregressive transformers
from maze_dataset.tokenization import MazeTokenizer, TokenizationMode
m.as_tokens(maze_tokenizer=MazeTokenizer(
from maze_dataset.tokenization import MazeTokenizerModular, TokenizationMode
m.as_tokens(maze_tokenizer=MazeTokenizerModular(
tokenization_mode=TokenizationMode.AOTP_UT_rasterized, max_grid_size=100,
))
# advanced visualization with many features
Expand Down
Binary file added docs/mazeplot_3x3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
36 changes: 33 additions & 3 deletions makefile
Original file line number Diff line number Diff line change
Expand Up @@ -65,12 +65,18 @@ check-format: clean
# coverage reports & benchmarks
# --------------------------------------------------
# whether to run pytest with coverage report generation
PYTEST_OPTIONS ?=
COV ?= 1
ifeq ($(COV),1)
PYTEST_OPTIONS=--cov=.
else
PYTEST_OPTIONS=
PYTEST_OPTIONS+=--cov=.
endif
PYTEST_PARALLEL ?= 0
ifeq ($(PYTEST_PARALLEL),1)
PYTEST_OPTIONS+=-n auto
endif
.PHONY: cov
Expand All @@ -94,6 +100,30 @@ unit:
@echo "run unit tests"
$(POETRY_RUN_PYTHON) -m pytest $(PYTEST_OPTIONS) tests/unit
.PHONY: save_tok_hashes
save_tok_hashes:
@echo "generate and save tokenizer hashes"
$(POETRY_RUN_PYTHON) -m maze_dataset.tokenization.save_hashes -p
.PHONY: test_tok_hashes
test_tok_hashes:
@echo "re-run tokenizer hashes and compare"
$(POETRY_RUN_PYTHON) -m maze_dataset.tokenization.save_hashes -p --check
.PHONY: test_all_tok
test_all_tok:
@echo "run tests on all tokenizers. can pass NUM_TOKENIZERS_TO_TEST arg or SKIP_HASH_TEST"
@echo "NUM_TOKENIZERS_TO_TEST=$(NUM_TOKENIZERS_TO_TEST)"
@if [ "$(SKIP_HASH_TEST)" != "1" ]; then \
echo "Running tokenizer hash tests"; \
$(MAKE) test_tok_hashes; \
else \
echo "Skipping tokenizer hash tests"; \
fi
$(POETRY_RUN_PYTHON) -m pytest $(PYTEST_OPTIONS) --verbosity=-1 --durations=50 tests/all_tokenizers
.PHONY: convert_notebooks
convert_notebooks:
Expand Down
15 changes: 14 additions & 1 deletion maze_dataset/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
from maze_dataset.constants import (
SPECIAL_TOKENS,
VOCAB,
VOCAB_LIST,
VOCAB_TOKEN_TO_INDEX,
Connection,
ConnectionArray,
ConnectionList,
Coord,
CoordArray,
CoordList,
Expand All @@ -15,15 +21,22 @@
set_serialize_minimal_threshold,
)
from maze_dataset.generation.generators import LatticeMazeGenerators
from maze_dataset.maze.lattice_maze import LatticeMaze, SolvedMaze
from maze_dataset.maze.lattice_maze import LatticeMaze, SolvedMaze, TargetedLatticeMaze

__all__ = [
"Coord",
"CoordTup",
"CoordList",
"CoordArray",
"Connection",
"ConnectionList",
"ConnectionArray",
"SPECIAL_TOKENS",
"VOCAB",
"VOCAB_LIST",
"VOCAB_TOKEN_TO_INDEX",
"LatticeMaze",
"TargetedLatticeMaze",
"SolvedMaze",
"MazeDataset",
"MazeDatasetConfig",
Expand Down
85 changes: 81 additions & 4 deletions maze_dataset/constants.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,18 @@
import warnings
from dataclasses import dataclass
from dataclasses import dataclass, field, make_dataclass

import numpy as np
from jaxtyping import Int8
from jaxtyping import Bool, Int8

Coord = Int8[np.ndarray, "x y"]
from maze_dataset.utils import corner_first_ndindex

Coord = Int8[np.ndarray, "row_col"]
CoordTup = tuple[int, int]
CoordArray = Int8[np.ndarray, "coord x y"]
CoordArray = Int8[np.ndarray, "coord row_col"]
CoordList = list[CoordTup]
Connection = Int8[np.ndarray, "coord=2 row_col=2"]
ConnectionList = Bool[np.ndarray, "lattice_dim=2 row col"]
ConnectionArray = Int8[np.ndarray, "edges leading_trailing_coord=2 row_col=2"]


class SpecialTokensError(Exception):
Expand Down Expand Up @@ -115,3 +120,75 @@ def keys(self):
[-1, 0], # left
]
)


_VOCAB_FIELDS: list = [
# *[(k, str, field(default=v)) for k, v in SPECIAL_TOKENS.items()],
("COORD_PRE", str, field(default="(")),
("COORD_INTRA", str, field(default=",")),
("COORD_POST", str, field(default=")")),
("TARGET_INTRA", str, field(default="=")),
("TARGET_POST", str, field(default="||")),
("PATH_INTRA", str, field(default=":")),
("PATH_POST", str, field(default="THEN")),
("NEGATIVE", str, field(default="-")),
("UNKNOWN", str, field(default="<UNK>")),
*[
(f"TARGET_{a}", str, field(default=f"TARGET_{a}"))
for a in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
],
("TARGET_NORTH", str, field(default="TARGET_NORTH")),
("TARGET_SOUTH", str, field(default="TARGET_SOUTH")),
("TARGET_EAST", str, field(default="TARGET_EAST")),
("TARGET_WEST", str, field(default="TARGET_WEST")),
("TARGET_NORTHEAST", str, field(default="TARGET_NORTHEAST")),
("TARGET_NORTHWEST", str, field(default="TARGET_NORTHWEST")),
("TARGET_SOUTHEAST", str, field(default="TARGET_SOUTHEAST")),
("TARGET_SOUTHWEST", str, field(default="TARGET_SOUTHWEST")),
("TARGET_CENTER", str, field(default="TARGET_CENTER")),
("PATH_NORTH", str, field(default="NORTH")),
("PATH_SOUTH", str, field(default="SOUTH")),
("PATH_EAST", str, field(default="EAST")),
("PATH_WEST", str, field(default="WEST")),
("PATH_FORWARD", str, field(default="FORWARD")),
("PATH_BACKWARD", str, field(default="BACKWARD")),
("PATH_LEFT", str, field(default="LEFT")),
("PATH_RIGHT", str, field(default="RIGHT")),
("PATH_STAY", str, field(default="STAY")),
*[
(f"I_{i:03}", str, field(default=f"+{i}")) for i in range(256)
], # General purpose positive int tokens. Used by `StepTokenizers.Distance`.
*[
(f"CTT_{i}", str, field(default=f"{i}")) for i in range(128)
], # Coord tuple tokens
*[
(f"I_N{-i:03}", str, field(default=f"{i}")) for i in range(-256, 0)
], # General purpose negative int tokens
("PATH_PRE", str, field(default="STEP")),
("ADJLIST_PRE", str, field(default="ADJ_GROUP")),
("ADJLIST_INTRA", str, field(default="&")),
("ADJLIST_WALL", str, field(default="<XX>")),
*[(f"RESERVE_{i}", str, field(default=f"<RESERVE_{i}>")) for i in range(708, 1596)],
*[
(f"UT_{x:02}_{y:02}", str, field(default=f"({x},{y})"))
for x, y in corner_first_ndindex(50)
],
]


_VOCAB_BASE: type = make_dataclass(
"_VOCAB_BASE", fields=_VOCAB_FIELDS, bases=(_SPECIAL_TOKENS_BASE,), frozen=True
)
# TODO: edit __getitem__ to add warning for accessing a RESERVE token

VOCAB: _VOCAB_BASE = _VOCAB_BASE()
VOCAB_LIST: list[str] = list(VOCAB.values())
VOCAB_TOKEN_TO_INDEX: dict[str, int] = {token: i for i, token in enumerate(VOCAB_LIST)}

# CARDINAL_MAP: Maps tuple(coord1 - coord0) : cardinal direction
CARDINAL_MAP: dict[tuple[int, int], str] = {
(-1, 0): VOCAB.PATH_NORTH,
(1, 0): VOCAB.PATH_SOUTH,
(0, -1): VOCAB.PATH_WEST,
(0, 1): VOCAB.PATH_EAST,
}
33 changes: 32 additions & 1 deletion maze_dataset/dataset/configs.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
import copy
from typing import Mapping

from maze_dataset.dataset.maze_dataset import MazeDatasetConfig
from maze_dataset.generation.generators import LatticeMazeGenerators

MAZE_DATASET_CONFIGS: dict[str, MazeDatasetConfig] = {
_MAZE_DATASET_CONFIGS_SRC: dict[str, MazeDatasetConfig] = {
cfg.to_fname(): cfg
for cfg in [
MazeDatasetConfig(
Expand All @@ -24,3 +27,31 @@
),
]
}


class _MazeDatsetConfigsWrapper(Mapping[str, MazeDatasetConfig]):
def __init__(self, configs: dict[str, MazeDatasetConfig]):
self._configs = configs

def __getitem__(self, item: str) -> MazeDatasetConfig:
return self._configs[item]

def __len__(self) -> int:
return len(self._configs)

def __iter__(self):
return iter(self._configs)

def keys(self):
return self._configs.keys()

def items(self):
return [(k, copy.deepcopy(v)) for k, v in self._configs.items()]

def values(self):
return [copy.deepcopy(v) for v in self._configs.values()]


MAZE_DATASET_CONFIGS: _MazeDatsetConfigsWrapper = _MazeDatsetConfigsWrapper(
_MAZE_DATASET_CONFIGS_SRC
)
1 change: 1 addition & 0 deletions maze_dataset/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ def __post_init__(self):

def summary(self) -> dict:
"""return a summary of the config"""
# do we run this to make sure it doesn't error?
self_ser: dict = self.serialize()
return dict(
name=self.name,
Expand Down
Loading

0 comments on commit d5a20d6

Please sign in to comment.