Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support maze dataset tokenizers update #214

Open
wants to merge 58 commits into
base: main
Choose a base branch
from

Conversation

aaron-sandoval
Copy link
Member

@aaron-sandoval aaron-sandoval commented May 17, 2024

Update to support maze-dataset #38

@aaron-sandoval aaron-sandoval added the enhancement New feature or request label May 17, 2024
@aaron-sandoval aaron-sandoval self-assigned this May 17, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@@ -6,7 +6,7 @@
from maze_transformer.training.config import ConfigHolder


@freeze
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like freeze can only act on objects, not types. It raises an exception when it recursively calls freeze(TRAIN_SAVE_FILES.__dict__), since it can't act on a mappingproxy.

@mivanit mivanit marked this pull request as ready for review July 26, 2024 17:28
@mivanit mivanit changed the title Support maze dataset PR #37: Tokenizers Step 2 Support maze dataset tokenizers update Jul 26, 2024
@mivanit mivanit mentioned this pull request Aug 20, 2024
turns out that the outputs were actually equal to within tolerances,
but argsort for large enough vocabularies fails (since weight recovery is not perfect and introduces some error)
old model had no tokenizer saved, caused issues
- demo cfg is an actual training run with 10k samples, too long for CI
- test cfg is much shorter
- also re-ran the notebook
@mivanit
Copy link
Member

mivanit commented Aug 20, 2024

so integration tests are hanging after they are complete.

pretty sure I've narrowed this down to a wandb issue, since 8311eb8 makes test_eval_model.py not hang just by moving test_model_loading

mivanit added a commit to understanding-search/maze-dataset that referenced this pull request Aug 21, 2024
wandb fails in the backend when passing data with this key and a value
of longer than 64 chars, see:
wandb/wandb#8159

this was causing integration tests to hang idefinitely in understanding-search/maze-transformer#214
- `MAZE_DATASET_CONFIGS` key is something totally different from value -- asked for 100 mazes, get just 1
- it only breaks for test_model_loading, no other functions
- only breaks for 100 -- 99 and 101 both work?
- maze dataset configs were being gotten from `MAZE_DATASET_CONFIGS` by ref
  - this is also modified upstream to avoid this issue
- configs were then being passed by ref again in validation dataset creation step
- configs then get updated to match new dataset size, this modifies the original
- first test works but subsequent tests fail

passing the (cheap) configs around by copy fixes this, I think
mivanit added a commit to understanding-search/maze-dataset that referenced this pull request Aug 28, 2024
…ntegration

Tokenizer overhaul (complete)

# TODOs:

- `checks.yml`: ![status](https://github.com/understanding-search/maze-dataset/actions/workflows/checks.yml/badge.svg?branch=tokenizer-overhaul-integration)
- `all-tok-checks.yml`: ![status](https://github.com/understanding-search/maze-dataset/actions/workflows/all-tok-checks.yml/badge.svg?branch=tokenizer-overhaul-integration)

- [x] depends on #37 being merged into `tokenizer-overhaul-integration` first
- [x] depends on tests passing in [`maze-transformer` #214](understanding-search/maze-transformer#214)
    - PR: understanding-search/maze-transformer#214
    - PR checks status: ![status](https://github.com/understanding-search/maze-transformer/actions/workflows/checks.yml/badge.svg?branch=update-maze-dataset-tokenizers-step2)
- [x] ~~make sure that tokenizer hash numpy file is shipped with `maze-dataset` package~~
  - [x] ~~maybe make it optional, only if `tokenization` extra is requested?~~
  - [x] no need to ship this, its only required for tests
- [x] go over existing `# TODO` comments, see if anything is critical
  - [x] parametrize some tests in `tests/unit/maze_dataset/generation/test_maze_dataset.py`
- [x] clean up various superfluous warnings
- [x] consider optimizations to `all_instances` and `save_hashes`
  - [x] unclear if `save_hashes` is being properly parallelized :/
- [x] see if any more utils should be moved to muutils

# commit log:

aaron-sandoval and others added 30 commits 3 months ago
@aaron-sandoval
Implemented validation_funcs in all_instances, added docs, added … 
5c79d22
@aaron-sandoval
Minor unit test fixes. zanj loading of StepSequencer is broken on t… 
a72680b
@aaron-sandoval
Added StepSizes.Straightaways. test_path_tokenizers passing.
72380c0
@aaron-sandoval
Added some PathTokenizers and tests
f184c1b
@aaron-sandoval
docs
3c35da7
@mivanit
minor assert
5d5b405
@aaron-sandoval
wip reversing import direction between maze_tokenizer.py and lattice_… 
f49d0f2
@aaron-sandoval
Refactored from_tokens to remove import dependency. `test_from_toke… 
7d56a1d
@aaron-sandoval
Moved util.py and token_utils.py out of tokenization directory … 
1795f16
@aaron-sandoval
Fixed hash collisions among different TokenizerElement instances. A… 
cee9eb4
@aaron-sandoval
Added Cardinal. test_path_tokenizers passing
99c6083
@aaron-sandoval
Added Forks. test_path_tokenizers passing.
782f204
@aaron-sandoval
Test fuzzing for test_path_tokenizers
db85262
@aaron-sandoval
Added ForksAndStraightaways. test_path_tokenizers passing
be642e3
@aaron-sandoval
Compacted test_path_tokenizers
5788cf9
@aaron-sandoval
Added Relative. test_path_tokenizers passing. Had to comment out … 
f961707
@aaron-sandoval
Temporarily deactivated StepTokenizers.Cardinal and `StepTokenizers… 
c1e8b52
@aaron-sandoval
Ran make unit. Only test_is_tested_tokenizer and `test_zanj_save_re… 
438deae
@aaron-sandoval
Fixed one non-fuzzed use of ALL_TOKENIZERS in unit tests. That migh… 
9eb338f
@aaron-sandoval
Deleted extraneous files
c3aa60b
@aaron-sandoval
Copied AdjListTokenizers skeleton from Notion
7767ad1
@aaron-sandoval
AdjListTokenizer docs and wip implementation
243a118
@aaron-sandoval
Docs, added new VOCAB items
35f2052
@aaron-sandoval
Docs
7479445
@aaron-sandoval
Added numpy_rng, refactored ChessboardSublattice into EdgePermuters.
bb66156
@aaron-sandoval
EdgePermuter._permute working in debugger for both subclasses
f934fe0
@aaron-sandoval
wip test_edge_permuters
f982e99
@aaron-sandoval
test_edge_permuters passing
81fb7b8
@aaron-sandoval
minor things
3efbc6c
@aaron-sandoval
Added lattice_connection_array
f680b1f
aaron-sandoval and others added 19 commits 3 months ago
@aaron-sandoval
AllLatticeEdges.get_edges written, not tested
ba6404a
@aaron-sandoval
Updated .gitignore for pyenv environment file
fa58ef5
@aaron-sandoval
Bug in CPython raises exception when building constants._VOCAB_BASE… 
f77f077
@aaron-sandoval
Updated poetry.lock
9a52cc7
@aaron-sandoval
test_manhattan_distance passing
00bd0a7
@aaron-sandoval
test_edge_subsets passing
c4253ec
@aaron-sandoval
Made method skeleton for EdgeGrouping and `AdjListTokenizer.to_toke… 
973df5a
@aaron-sandoval
wip test_edge_subsets
64dd2ff
@aaron-sandoval
test_edge_subsets passing
26db9f6
@mivanit
fix local function 
03ae196
@mivanit
underscore prefix for tokenizer element namespace ABCs, fix ABC 
976c361
@mivanit
better error on MazeTokenizer2.from_tokens() 
66d8cfa
@mivanit
fix type hints for get_all_subclasses 
b01aca9
@mivanit
minor cleanup to flatten(), added some tests for tokenizer utils
9bffdd3
@mivanit
run format
9c43aa5
@mivanit
cleanup of connection_list_to_adj_list 
313b1f8
@mivanit
rename and fix unpackable_if_true_attribute 
2395b5b
@mivanit
update poetry lock
b006afe
@mivanit
Merge pull request #35 from understanding-search/tokenizer-overhaul 
33afb1b
@review-notebook-appReview Notebook App
review-notebook-app bot commented on Jun 10
Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

mivanit and others added 40 commits 2 months ago
@mivanit
changed workflow to run checks on any PR
e14ba3d
@mivanit
merge commit?
999b251
@mivanit
format
042923d
@mivanit
update lockfile
6223013
@mivanit
mostly naming changes 
0e63156
@mivanit
format
ad703b0
@mivanit
minor fix to test_isinstance_by_type_name 
1511108
@mivanit
fix import in test_util.py
32c59d5
@mivanit
moved tests requiring all tokenizers into a separate dir 
c194d69
@mivanit
run format
4c52c4b
@mivanit
kill the all_tokenizers call after fixed time (TEMP)
a04f5be
@mivanit
(TEMP) ignore tokenizer deprecation warnings in tests
37314bd
@mivanit
wip
1f16761
@mivanit
wip merging
4be1957
@mivanit
run format
6a719e5
@mivanit
some imports
2aa17e9
@mivanit
unit tests collect, but don't pass
d418dfb
@mivanit
format
5ff09a2
@mivanit
Merge pull request #40 from understanding-search/tokenizer-overhaul-m… 
0f116cd
@mivanit
....wip
0d372f5
@mivanit
merge main into tokenizer-overhaul-integration
17c3209
@aaron-sandoval
Update dependency
27eb7db
@aaron-sandoval
Fixed incomplete refactor of TokenizerElement abstract subclasses. … 
c0da82d
@aaron-sandoval
Added missing @serializable_dataclass decorators`
58b893f
@aaron-sandoval
Cleaned up _tokenize_edge_grouping, to_tokens interfaces. Impleme… 
f72a974
@aaron-sandoval
test_lattice_connection_arrray passing
f843a0c
@aaron-sandoval
Added testing_utils.py. Not tested yet
675d844
@aaron-sandoval
test_is_connection passing
27de83b
@aaron-sandoval
test_edge_groupings passing with an added assertion. `_tokenize_edg… 
633c8c0
@aaron-sandoval
Removed duplicated functionality in BothCoords and AllLatticeEdges
00d57b3
@aaron-sandoval
wip debugging _tokenize_edge_grouping
67cda3d
@aaron-sandoval
AdjListCoord.to_tokens slightly hand-tested, unit tests wip
6e6db5f
@aaron-sandoval
wip AdjListCardinal._tokenize_edge_grouping. Snapshot before refact… 
cdeeac0
@aaron-sandoval
Refactored _tokenize_edge_grouping. All `_AdjListTokenizer.to_token… 
a5aa3d9
@aaron-sandoval
wip test_adjlist_tokenizers
76d66de
@aaron-sandoval
wip test_adjlist_tokenizers. Problems with random seeds changing py… 
754c759
@aaron-sandoval
Random seed hackily fixed, pytest running
188309c
@aaron-sandoval
wip test_adjlist_tokenizers. Snapshot before starting `LatticeMaze.… 
1c7dbe6
@aaron-sandoval
test_adjlist_tokenizers passing with 20 tokenizers
6c575d5
@aaron-sandoval
test_adjlist_tokenizers passing with 100 tokenizers
dac7888
aaron-sandoval added 20 commits 2 months ago
@aaron-sandoval
Moved various TokenizerElement tests back to test_tokenizer.py. 16 … 
ca401ff
@aaron-sandoval
Cleanup, added extra pruning to _get_all_tokenizers, but probably i… 
c27f9df
@aaron-sandoval
Renamed MazeTokenizer2 >> MazeTokenizerModular. `test_path_tokeni… 
5512fd4
@aaron-sandoval
wip refactoring all_instances for concision. Now accepts regular `d… 
62e6539
@aaron-sandoval
wip test_all_instances not quite passing. Enhanced all_instances … 
9078f6b
@aaron-sandoval
test_all_tokenizers passing except for case 1-_PathTokenizer, whi… 
0f2fb8a
@aaron-sandoval
Updated muutils and zanj, dc_eq still has bug
ff9f086
@aaron-sandoval
Deleted duplicate utilities
fd8ce29
@aaron-sandoval
Minor generic type fixes, added some unit tests, not run yet
0142fa6
@aaron-sandoval
Refactored MazeTokenizer to MazeTokenizerModular throughout the l… 
bd79997
@aaron-sandoval
Set muutils dependency to last working version until latest bugs are … 
b1e1fba
@aaron-sandoval
Readded pandas, which got removed from dependencies at some point in … 
91c7680
@aaron-sandoval
wip updating nb
bf83a19
@aaron-sandoval
Did some profiling, removed a currerntly redundant validation from `_… 
cd5a455
@aaron-sandoval
wip refactoring uses of all_instances to work with Generator output
c82b3e2
@aaron-sandoval
Remove temporary pruning of _StepTokenizers space. `_get_all_tokeni… 
6951895
@aaron-sandoval
Reinstated pruning
02e39ef
@aaron-sandoval
Small additions for demo nbs
91ada63
@aaron-sandoval
all_instances bugfixes, more unit tests. wip reworking `TokenizerEl… 
ddc72de
@aaron-sandoval
make unit all passing
2f03963
@mivanit mivanit added enhancement tokenization labels on Jul 3
mivanit added 13 commits last month
@mivanit
merge in integration branch
ffab764
@mivanit
run format
a45c9d0
@mivanit
new lock file
09e7212
@mivanit
fix an import
ad1506d
@mivanit
update muutils and zanj deps
4619d2c
@mivanit
won't check types on applied_filters or maze_ctor func 
19dfb6e
@mivanit
wip
0fdb2f2
@mivanit
wip
057ab14
@mivanit
wip
147b6d7
@mivanit
wip
d2042d6
@mivanit
Format
c66ac97
@mivanit
random sample of tokenizers to make things run in the notebook
437e34a
@mivanit
one line fix to tokenizers
3371303
@mivanit mivanit mentioned this pull request on Jul 5
Add tokenization schemes #30
Open
mivanit and others added 24 commits last month
@mivanit
Merge branch 'main' into fix-save-load-profile
cd1eeb3
@mivanit
format
9783610
@mivanit
Merge branch 'main' into tokenizer-overhaul-integration
290f55d
@aaron-sandoval
Bugfix in test_has_element. Can't figure out why `ALL_TOKENIZER_HAS… 
826ea87
@aaron-sandoval
test_zanj_save_read passing
f1e937e
@aaron-sandoval
Added properties_to_serialize to MazeTokenizerModular. `test_zanj… 
26fb821
@aaron-sandoval
Demo nb milestone, fixed a couple of small bugs
673b986
@aaron-sandoval
wip filtering demo
37c1c75
@aaron-sandoval
wip filtering demo
069ceb3
@mivanit
tentatively marking some tokenizer elements as not supported
da0f197
@aaron-sandoval
Updated muutils dependency 0.6.3
73d2fdb
@aaron-sandoval
Removed utilities migrated to muutils, re-pointed imports
6f0f62c
@aaron-sandoval
Using stable_hash
929ba19
@aaron-sandoval
Swapped in @mark_as_unsupported for old @AbstractMethod temporary… 
0f86ee4
@mivanit
some more removed tokenizers
2293ee9
@aaron-sandoval
Hashing I think is fixed and stable. all_instances still broken
a16b4e4
@aaron-sandoval
Hashing works with the __init_subclass__ hidden field workaround. S… 
dc2f2c1
@aaron-sandoval
test_zanj_save_read passing, ...hashes.npy is out of date so `tes… 
6e15a5c
@aaron-sandoval
wip demo nb
fdd4089
@aaron-sandoval
Added caching back into all_instances while it still uses generator… 
ac84ae1
@aaron-sandoval
Think I finished the demo nb. Can't test it yet due to bug in muutils
1c2662f
@aaron-sandoval
Refactored TokenizerElement to _TokenizerElement. Untested for sa… 
98727aa
@aaron-sandoval
Added overall token counter test. Can't run due to muutils bug
2677bff
@aaron-sandoval
Consolidated util.py into token_utils.py. Imports seem to resolve wit… 
c6153b1
aaron-sandoval and others added 43 commits last month
@aaron-sandoval
Deleted tests which were migrated to muutils
b52c962
@aaron-sandoval
Tried moving connection_list_to_adj_list, is_connection to utils.… 
2b1ffdb
@mivanit
Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… 
850c24f
@mivanit
merge from main, update deps
ff7f4da
@mivanit
Merge branch 'main' into tokenizer-overhaul-integration
22c9dda
@mivanit
update deps
d96248c
@mivanit
poetry update lockfile
bf1701b
@mivanit
run format
3ab6760
@mivanit
BothCoords is depended on, no longer marked as unsupported
13460d8
@mivanit
missing regex import
2efc29f
@mivanit
format import re
c304f67
@mivanit
fixed that applied_filters list/tuple mismatch bug once and for all
4bf2844
@mivanit
minor import fixes and re-run
13f8593
@mivanit
fixes to endpoint_kwargs
3a201f2
@mivanit
run format
ff9ecd1
@mivanit
wip
8dd038f
@mivanit
re-run notebook
1e15fd9
@mivanit
save_tok_hashes as prereq of test_all_tok
46b3ec7
@mivanit
update deps for spinner
164e91f
@mivanit
run save_tok_hashes
be5ad67
@mivanit
fix has_element() call causing test_all_tok fail
43c3653
@mivanit
change .has_element() to take *args
bfebb29
@mivanit
parallelize test_all_tok w/ pytest-xdist
3740164
@mivanit
improvements & parallelization to save_hashes
8c8a096
@mivanit
faster hashing of new tokenizers
c6a0b7d
@mivanit
blake2b hash func, zipped hashes, more spinners
f2e3be4
@mivanit
format
e8d2758
@mivanit
attempt speed up all_instances
d8afc7d
@mivanit
less verbose test_all_tok
b263564
@mivanit
has_element() fix, back to longer pytest traces
2510d26
@mivanit
test fix
9f5166e
@mivanit
test optimizations
c00338c
@mivanit
test a larger random sample of tokenizers
ab246ae
@mivanit
set n of tokenizers to test via env
e3ce590
@mivanit
better CI, optimizations to testing all toks
5149058
@aaron-sandoval
Added one more condition in test_token_stability
34782a9
@aaron-sandoval
Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… 
2108e3a
@mivanit
tokenizer hashes now checked differently
b25869e
@mivanit
Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… 
afea7c2
@aaron-sandoval
Bugfix in test_token_stability, not all mazes have a "<PATH_END>"
8befacc
@mivanit
format
c09ebc5
@aaron-sandoval
Merge branch 'tokenizer-overhaul-step2' of https://github.com/underst… 
a4e67ab
@aaron-sandoval
format
03ed351
This was linked to issues on Jul 25
Add tokenization schemes #30
Open
Duplication between test_coord_str_tuple.py and test_token_utils.py #32
Open
TokenizationMode and MazeTokenizer ugly and not scalable #33
Open
@mivanit mivanit marked this pull request as ready for review last month
aaron-sandoval and others added 13 commits last month
@aaron-sandoval
Docs, a bit of cleanup
fa5c691
@aaron-sandoval
Format
51b77ae
@aaron-sandoval
Put back __init_subclass__
92da244
@aaron-sandoval
Format
cd08d53
@aaron-sandoval
Merge pull request #37 from understanding-search/tokenizer-overhaul-s… 
9ded238
@mivanit
Merge branch 'tokenizer-overhaul-integration' into fix-save-load-profile
63c21c9
@mivanit
fixed long-standing issue with serialization cfg mismatch
bfcaeca
@mivanit
fix exception for generation meta mismatch
bb9b69c
@mivanit
Merge pull request #36 from understanding-search/fix-save-load-profile 
7aa97b8
@mivanit
reworked CI
2c16c8d
@mivanit
added comments to save_hashes, testing CI dispatch
6c471ea
@mivanit
will run all-tok-checks if all_tokenizers tests changed
27cd942
@mivanit
all tok modifications 
0c16cd6
@mivanit mivanit mentioned this pull request on Jul 26
Support maze dataset tokenizers update understanding-search/maze-transformer#214
 Open
mivanit and others added 5 commits last month
@mivanit
reduce default number of tokenizers tested
4edd476
@mivanit
Merge branch 'main' into tokenizer-overhaul-integration, bump version 
2c19e57
@mivanit
summary should be json-serializable
19644d3
@mivanit
fix notebook
4c8adbf
@aaron-sandoval
test commit
2582299
@aaron-sandoval
Member
aaron-sandoval commented 27 days ago • 
see if any more utils should be moved to muutils

Except for all_instances, I don't think any more of the utilities in utils.py which were newly added in this PR should be moved due to their more narrow applicability to mazes. There are a few older ones which could be moved. I didn't think you wanted to move elements of the old API if it could be helped, but if you're open to it here are my recommendations. I only checked for uses in the 2 libraries.

Function	Where Used	Recommendation
all_instances	maze-dataset	Move
bool_array_from_string	maze-dataset, maze-transformer	Move
adj_list_to_nested_set	maze-dataset, maze-transformer	Maybe keep in m-d since it's at least lattice graph-specific
apply_mapping	Nowhere	Move
apply_mapping_chain	Nowhere	Move
@aaron-sandoval
Member
aaron-sandoval commented 27 days ago • 
unclear if save_hashes is being properly parallelized :/

Idk a proper way to verify this. As something quick, I just ran a few trials on my new desktop with different numbers of processes in the Pool. Here are the wall clock times reported by the spinner on list(pool.map(hash, all_tokenizers)).

# Processes	Spinner Runtime [sec]
No Pool; list(map(hash, all_tokenizers))	59
1	75
2	39
4	21
6	19
12	19
24	21
The multiprocessing is clearly helping compared to a single core, but idk why it plateaus so early. I tried using Pool.apply_async, and that was way slower. IMO, runtime doesn't seem long enough to warrant sinking time into this now since neither of us seem to have the knowledge atm. This is the kind of backend optimization that can be done after the 1.0.0 release.

@aaron-sandoval
Cleaned up test_maze_dataset.py
bfeba59
@aaron-sandoval
Member
aaron-sandoval commented 2 weeks ago
@mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal types to artificially restrict the space returned by all_instances. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal types allow that direct construction, but the current implementation of mark_as_unsupported doesn't since the dummy abstract method precludes any construction of that class.
What if instead mark_as_unsupported just overwrote that class's is_valid method to just return False all the time? That would make those classes still constructable. Calls to get_all_tokenizers would still ignore them just the same, and direct use of all_instances would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS included by default in calls to all_tokenizers so that users have to consciously opt out of those validation funcs.
I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it.

aaron-sandoval added 3 commits 2 weeks ago
@aaron-sandoval
Added a TODO
8ad6f3a
@aaron-sandoval
bugfix: Shuffling and permuting edges
c9e3bc8
@aaron-sandoval
Added feature to color_tokens_rgb to automate line breaks for repor… 
1aeb408
@mivanit
Member
Author
mivanit commented 2 weeks ago
@mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal types to artificially restrict the space returned by all_instances. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal types allow that direct construction, but the current implementation of mark_as_unsupported doesn't since the dummy abstract method precludes any construction of that class. What if instead mark_as_unsupported just overwrote that class's is_valid method to just return False all the time? That would make those classes still constructable. Calls to get_all_tokenizers would still ignore them just the same, and direct use of all_instances would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS included by default in calls to all_tokenizers so that users have to consciously opt out of those validation funcs. I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it.

You're right that the current implementation would require people to manually edit their locally installed version of the package to get the unsupported tokenizers to work -- I actually think this behavior is fine and serves as a "make sure you know what you're doing filer", but if you think this is a quick edit then I think it's fine to make as long as we add a warning when initializing an unsupported tokenizer.

mivanit added 10 commits last week
@mivanit
better doc for make test_all_tok
dded010
@mivanit
update dep
3ca392f
@mivanit
removed testing length of ALL_TOKENIZERS 
2b8b2a5
@mivanit
lower n_to_test to see if pass, better conditions
e41c417
@mivanit
format
c3c9635
@mivanit
action was not correctly reading num of tokenizers and parallelization
cd9e069
@mivanit
oops
b7ee88b
@mivanit
poetry update
9ea828d
@mivanit
option to skip hashes test so we can run it separately
08bf285
@mivanit
test_all_tok was always skipping
90612dc
@mivanit
Member
Author
mivanit commented last week • 
making a note for profiling and figuring out how many tokenizers we can test

Time (MM:SS)	Time (s)	n tokenizers tested	Link
9:54	594	100	action
10:05	605	300	action
11:15	675	1000	action
13:33	813	3000	action
16:41	1001	5000	action
image

@mivanit
bumping default NUM_TOKENIZERS_TO_TEST to 300
f3d2a72
mivanit and others added 18 commits last week
@mivanit
bump default NUM_TOKENIZERS_TO_TEST to 1000
62a02b7
@mivanit
bump default NUM_TOKENIZERS_TO_TEST to 3000
57cd1d1
@mivanit
test 5k tokenizers, expect test to take just over 16 min
3a9701d
@mivanit
bump muutils to 0.6.8, slight updates to spinners
fd6827f
@aaron-sandoval
Modified mark_as_unsupported to allow construction of all unsupport… 
599b955
@aaron-sandoval
wip getting nb to work
ad11705
@aaron-sandoval
Merge branch 'tokenizer-overhaul-integration' into tokenizer-overhaul… 
f03cf2b
@aaron-sandoval
Removed warnings, added unit tests for classes with `mark_as_unsuppor… 
5c7de66
@mivanit
slight clean up to .summary() code
c095d05
@mivanit
change _type to _type_ in MMT to avoid wandb issue 
729bf9a
@mivanit
format
7ed866d
@mivanit
Merge pull request #44 from understanding-search/tokenizer-overhaul-s… 
d5f453b
@mivanit
bump muutils dep to 0.6.9
ebf9a25
@mivanit
Merge branch 'tokenizer-overhaul-integration' of https://github.com/u… 
ee01ade
@mivanit
remove deprecation warning for is_UT and is_AOTP 
eb86a61
@mivanit
reverts and fixes
95c693e
@mivanit
bump muutils to 0.6.10
be7dbfd
@mivanit
MAZE_DATASET_CONFIGS gives a copy, not the original
e3decdb
@aaron-sandoval aaron-sandoval linked an issue last week that may be closed by this pull request
add option to remove randomization in as_tokens #34
Open
mivanit added 10 commits 5 days ago
@mivanit
fix MAZE_DATASET_CONFIGS interface
5e3d846
@mivanit
testing some changes to loading tokenizer hashes
a1a75d4
@mivanit
update dep
a2daf8e
@mivanit
revert how we load the tokenizer hashes
645fec4
@mivanit
ALL_TOKENIZER_HASHES now private and only gotten when required
7316455
@mivanit
update dep
2fb57c2
@mivanit
removed old setup.py
6a2b8e9
@mivanit
minor notebook cleanup
7b73303
@mivanit
a test no longer breaks collection, fixed a TODO
87c130e
@mivanit
bump version to 1.0.0
a638ef8
mivanit
mivanit commented 1 minute ago
Member
Author
mivanit left a comment
Approving -- this is finally done!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants