Tokenization doesn't preserve diacritics #40

caffeine96 · 2022-01-07T18:51:21Z

I was working recently with the IndicBERT SentencePiece tokenizer and found something which I was curious about. It turns out that when we encode sentences, a good amount of diacritics do not get encoded. So for example, in Hindi, the sentences - "मेंने उसकी गेंद दी।" and "मैने उसको गेंद दी।" have the same encodings despite one having the genitive and the other the dative marker. I have seen this for Gujarati and Hindi. The reason I think the diacritics are ignored is that when the encodings are decoded, some diacritics are missing.

I was curious to know why this happens and if there is a work-around.

anoopkunchukuttan · 2022-01-08T18:13:49Z

Can you share the segmentation outputs for this example (as well as the Gujarati example) you shared over mail? Please share the text (not the images)?

gowtham1997 · 2022-01-08T18:19:09Z

import transformers
# instead of this : tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
# print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह")) # returns True if you use above line
# use this:
tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert', keep_accents=True)
print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह")) # returns False

^ use this snippet to initialize the tokenizer to preserve accents or diacritics

This is explained in this issue #26 (There is also a note to this on our readme section in case you missed it)

Please let us know if this works

caffeine96 · 2022-01-08T19:48:03Z

Thanks for pointing that out. That solves the issues with both Hindi and Gujarati.

GokulNC transferred this issue from AI4Bharat/indicnlp_catalog Jan 8, 2022

gowtham1997 closed this as completed Jan 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization doesn't preserve diacritics #40

Tokenization doesn't preserve diacritics #40

caffeine96 commented Jan 7, 2022

anoopkunchukuttan commented Jan 8, 2022

gowtham1997 commented Jan 8, 2022 •

edited

Loading

caffeine96 commented Jan 8, 2022

Tokenization doesn't preserve diacritics #40

Tokenization doesn't preserve diacritics #40

Comments

caffeine96 commented Jan 7, 2022

anoopkunchukuttan commented Jan 8, 2022

gowtham1997 commented Jan 8, 2022 • edited Loading

caffeine96 commented Jan 8, 2022

gowtham1997 commented Jan 8, 2022 •

edited

Loading