Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

止める (とめる/やめる) not disambiguated #3

Open
adamkolar opened this issue Aug 3, 2023 · 4 comments
Open

止める (とめる/やめる) not disambiguated #3

adamkolar opened this issue Aug 3, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@adamkolar
Copy link

My understanding is that this means 止める is missing in the heteronym dictionary, but I wonder if it's not possible to generate a more comprehensive dictionary for training by taking all entries from UNIDIC that share the same surface but have a different reading. Assuming the model is being trained on a comprehensive reading annotated corpus. I'm guessing a bit about how yomikata works, so apologies if I've missed something.

@passaglia passaglia added the enhancement New feature or request label Aug 3, 2023
@passaglia
Copy link
Owner

Hi @adamkolar ! You're right 止める is not in the heteronym dictionary, which I got from Sato et al 2022. It's not clear what algorithm they used to construct their list, but the list excludes all heteronyms which contain hiragana.

Extracting the heteronyms directly from the corpus itself, or from UNIDIC, is a good idea, adding this word and words like it would definitely make Yomikata more useful. I don't have plans right now to release an upgraded version of Yomikata, but maybe at some point down the line :)

@adamkolar
Copy link
Author

adamkolar commented Aug 3, 2023

Thank you for the quick response @passaglia !
I might give it a shot then. Are there any potential pitfalls I should be aware of when extracting the heteronyms from the corpus/UNIDIC?

Another thing that occurred to me is that it would be useful to provide a more seamless integration with tokenizers like fugashi, where yomikata would work under the hood to correct incorrectly picked morphemes after sentence has been tokenised, but the tokenization output would be otherwise identical to fugashi. But I haven't studied the api of yomikata in detail, so it's possible this functionality is already available.

@passaglia
Copy link
Owner

passaglia commented Aug 3, 2023

It would be great if you could take a stab at this! Happy to help in any way I can. Feel free to send me an email, available on my work webpage, if you want to chat offline.

I took a quick look at the training dataset and did find that both {止/や}める and {止/と}める are present, so it should be possible to determine from the dataset that 止める is a heteronym. Using the corpus to find the heteronyms might be better than using UNIDIC since the model needs the heteronym to appear in the corpus to learn it anyways.

In fact I just recalled that I did extract all heteronyms from unidic and sudachi, see the notebook yomikata.ipynb and the script pronunciation.py.

Then the next step is to check how the BERT model that powers yomikata tokenizes 止める and its inflections.

@Kamikadashi
Copy link

Kamikadashi commented Jan 11, 2024

I tried to use this to disambiguate the readings of 方(ほう/かた)and 風(ふう/かぜ)for my TTS project, as they frequently appear in books and such. Unfortunately, as it turned out, Yomikata doesn’t support them. It definitely needs to support more words to be truly useful.

It also doesn’t support unusual Kanji used, for example,
自分の勘の良さに嗤ってしまう。
自分の勘の良さに[UNK]てしまう。

Also, for some heteronyms like 変化 that were clearly in the dataset, some of the readings are missing:
どういう風に変化ったのじゃ。
どういう風に{変化/へんか}ったのじゃ。

"変化": {
"へんか": 87895,
"へんげ": 337
},

「どういう風に変化ったのじゃ。」 | 「どういう風[ふう]に変化[かわ]ったのじゃ。」

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants