-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization doesn't preserve diacritics #40
Comments
Can you share the segmentation outputs for this example (as well as the Gujarati example) you shared over mail? Please share the text (not the images)? |
import transformers
# instead of this : tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
# print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह")) # returns True if you use above line
# use this:
tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert', keep_accents=True)
print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह")) # returns False ^ use this snippet to initialize the tokenizer to preserve accents or diacritics This is explained in this issue #26 (There is also a note to this on our readme section in case you missed it) Please let us know if this works |
Thanks for pointing that out. That solves the issues with both Hindi and Gujarati. |
I was working recently with the IndicBERT SentencePiece tokenizer and found something which I was curious about. It turns out that when we encode sentences, a good amount of diacritics do not get encoded. So for example, in Hindi, the sentences - "मेंने उसकी गेंद दी।" and "मैने उसको गेंद दी।" have the same encodings despite one having the genitive and the other the dative marker. I have seen this for Gujarati and Hindi. The reason I think the diacritics are ignored is that when the encodings are decoded, some diacritics are missing.
I was curious to know why this happens and if there is a work-around.
The text was updated successfully, but these errors were encountered: