-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BPE tokenizer Implementation #2
Comments
Hey, that would be awesome if you could contribute that |
The reason why the current implementation is incorrect is because it doesn't respect merge order. In BPE tokenization should be done on the merge order which corresponds to the most->least frequent merge pairs in the training data. For example, if we were to take BPE encoding of the string "00000000000000000" (0*17):
so final tokenization: The trie implementation only looks for longest prefix, so this would go and look down the path of The differences in these two approaches are basically:
|
@joshcarp what do you think about this implementation, Bpetokenizer, could I contribute to this? Or is this issue already addressed? Thanks. |
Originally I wanted it to be a no-dependency project to be used as an example of every moving part of the transformer without having any external dependencies, however i've been way too distracted with other things such that I haven't come back to this. If you want to raise a PR that removes all of the current tokenizer stuff and puts BPEtokenizer in that would be great, especially just to get the repo to a good working state. I'm still going to do some BPE implementation in this repo for plans for a blogpost, similar go the annotated transformer, but it doesn't need to be fully feature complete. |
that's great @joshcarp , let me raise a PR with the BPETokenizer( thinking to use the same GPT4 pattern to split the tokens if anyone wants to train their tokenizer ), thoughts on this? |
Hi,
I'm interested in contributing to implementing the BPE tokenizer.
Since we're using gpt-2 encoding (as shown in the preprocessors), I think we can use the original implementation of
tiktoken-go
. But, this means that we cannot yet make our very own Tokenizer training (to put out our own BPE ranks).This might be important if people want to experiment with changing the vocabulary size and "train" their own tokenizer. However, it might also be the case that it is out of the scope of this repo's (or llm.c) purpose.
Maybe to start, can you elaborate on why the existing trie encoding is incorrect? Is the merging of tokens not done properly according to gpt-2 original tokenizer spec? Is it a performance issue to trie structure in general?
The text was updated successfully, but these errors were encountered: