-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenizer.json modified after tokenizer.save_pretrained of OLMO models #34744
Comments
Hey @zzf1130, I believe this change is to make the tokenizer.json more flexible/future-proof, and is therefore a voluntary change. I will let @ArthurZucker comment on it, thank you! |
Hello. This change cause a error when I use vllm to run a server.
And I have to replace it with the original one. |
Hey! This is somewhat expected, we had a breaking change in |
I ran into this when quantizing a Llama model. This doubles the file size of This might require a lot of projects to add I ended up copying the tokenizer files from the base model as I didn't think it would be good practice to change the tokenizer for a quantized model. |
Ah yes that's an issue. This is related to the |
That should do the trick! |
9MB is really high already, the limit is 10MB for lfs, so any nudge will push it into LFS territory. Removed pretty printing and it's still at 12MB. Everything zipped is 2.5MB (both the 17MB and 12MB versions) showing how much redundant that data is...
Seems like a good strategy too. |
Linked to huggingface/tokenizers#1656. Personally I think if we're going to disable pretty printing I'd remove JSON altogether to get to this 2.5MB target directly. Edit: I don't think moving away from JSON is good, JSON is made to be readable editable, which is great. But making it unreadable by not pretty-printing doesn't yield enough wins it seems, and therefore if we care that much about size there are probably better options. |
So TLDR do you want to go with using ids instead of strings? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.45.0Who can help?
@ArthurZucker and @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When I load and then save the tokenizer with OLMO models, the tokenizer.json files appear different, particularly with the
merge
key.The code to reproduce that is :
Expected behavior
The original
tokenizer.json
and the savedtokenizer.json
should be the same.The text was updated successfully, but these errors were encountered: