-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
止める (とめる/やめる) not disambiguated #3
Comments
Hi @adamkolar ! You're right 止める is not in the heteronym dictionary, which I got from Sato et al 2022. It's not clear what algorithm they used to construct their list, but the list excludes all heteronyms which contain hiragana. Extracting the heteronyms directly from the corpus itself, or from UNIDIC, is a good idea, adding this word and words like it would definitely make Yomikata more useful. I don't have plans right now to release an upgraded version of Yomikata, but maybe at some point down the line :) |
Thank you for the quick response @passaglia ! Another thing that occurred to me is that it would be useful to provide a more seamless integration with tokenizers like fugashi, where yomikata would work under the hood to correct incorrectly picked morphemes after sentence has been tokenised, but the tokenization output would be otherwise identical to fugashi. But I haven't studied the api of yomikata in detail, so it's possible this functionality is already available. |
It would be great if you could take a stab at this! Happy to help in any way I can. Feel free to send me an email, available on my work webpage, if you want to chat offline. I took a quick look at the training dataset and did find that both {止/や}める and {止/と}める are present, so it should be possible to determine from the dataset that 止める is a heteronym. Using the corpus to find the heteronyms might be better than using UNIDIC since the model needs the heteronym to appear in the corpus to learn it anyways. In fact I just recalled that I did extract all heteronyms from unidic and sudachi, see the notebook yomikata.ipynb and the script pronunciation.py. Then the next step is to check how the BERT model that powers yomikata tokenizes 止める and its inflections. |
I tried to use this to disambiguate the readings of 方(ほう/かた)and 風(ふう/かぜ)for my TTS project, as they frequently appear in books and such. Unfortunately, as it turned out, Yomikata doesn’t support them. It definitely needs to support more words to be truly useful. It also doesn’t support unusual Kanji used, for example, Also, for some heteronyms like 変化 that were clearly in the dataset, some of the readings are missing: "変化": { 「どういう風に変化ったのじゃ。」 | 「どういう風[ふう]に変化[かわ]ったのじゃ。」 |
My understanding is that this means 止める is missing in the heteronym dictionary, but I wonder if it's not possible to generate a more comprehensive dictionary for training by taking all entries from UNIDIC that share the same surface but have a different reading. Assuming the model is being trained on a comprehensive reading annotated corpus. I'm guessing a bit about how yomikata works, so apologies if I've missed something.
The text was updated successfully, but these errors were encountered: