止める (とめる/やめる) not disambiguated #3

adamkolar · 2023-08-03T09:46:03Z

My understanding is that this means 止める is missing in the heteronym dictionary, but I wonder if it's not possible to generate a more comprehensive dictionary for training by taking all entries from UNIDIC that share the same surface but have a different reading. Assuming the model is being trained on a comprehensive reading annotated corpus. I'm guessing a bit about how yomikata works, so apologies if I've missed something.

passaglia · 2023-08-03T12:04:12Z

Hi @adamkolar ! You're right 止める is not in the heteronym dictionary, which I got from Sato et al 2022. It's not clear what algorithm they used to construct their list, but the list excludes all heteronyms which contain hiragana.

Extracting the heteronyms directly from the corpus itself, or from UNIDIC, is a good idea, adding this word and words like it would definitely make Yomikata more useful. I don't have plans right now to release an upgraded version of Yomikata, but maybe at some point down the line :)

adamkolar · 2023-08-03T12:12:22Z

Thank you for the quick response @passaglia !
I might give it a shot then. Are there any potential pitfalls I should be aware of when extracting the heteronyms from the corpus/UNIDIC?

Another thing that occurred to me is that it would be useful to provide a more seamless integration with tokenizers like fugashi, where yomikata would work under the hood to correct incorrectly picked morphemes after sentence has been tokenised, but the tokenization output would be otherwise identical to fugashi. But I haven't studied the api of yomikata in detail, so it's possible this functionality is already available.

passaglia · 2023-08-03T13:47:31Z

It would be great if you could take a stab at this! Happy to help in any way I can. Feel free to send me an email, available on my work webpage, if you want to chat offline.

I took a quick look at the training dataset and did find that both {止/や}める and {止/と}める are present, so it should be possible to determine from the dataset that 止める is a heteronym. Using the corpus to find the heteronyms might be better than using UNIDIC since the model needs the heteronym to appear in the corpus to learn it anyways.

In fact I just recalled that I did extract all heteronyms from unidic and sudachi, see the notebook yomikata.ipynb and the script pronunciation.py.

Then the next step is to check how the BERT model that powers yomikata tokenizes 止める and its inflections.

Kamikadashi · 2024-01-11T18:10:08Z

I tried to use this to disambiguate the readings of 方（ほう/かた）and 風（ふう/かぜ）for my TTS project, as they frequently appear in books and such. Unfortunately, as it turned out, Yomikata doesn’t support them. It definitely needs to support more words to be truly useful.

It also doesn’t support unusual Kanji used, for example,
自分の勘の良さに嗤ってしまう。
自分の勘の良さに[UNK]てしまう。

Also, for some heteronyms like 変化 that were clearly in the dataset, some of the readings are missing:
どういう風に変化ったのじゃ。
どういう風に{変化/へんか}ったのじゃ。

"変化": {
"へんか": 87895,
"へんげ": 337
},

「どういう風に変化ったのじゃ。」 | 「どういう風[ふう]に変化[かわ]ったのじゃ。」

passaglia added the enhancement New feature or request label Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

止める (とめる/やめる) not disambiguated #3

止める (とめる/やめる) not disambiguated #3

adamkolar commented Aug 3, 2023

passaglia commented Aug 3, 2023

adamkolar commented Aug 3, 2023 •

edited

Loading

passaglia commented Aug 3, 2023 •

edited

Loading

Kamikadashi commented Jan 11, 2024 •

edited

Loading

止める (とめる/やめる) not disambiguated #3

止める (とめる/やめる) not disambiguated #3

Comments

adamkolar commented Aug 3, 2023

passaglia commented Aug 3, 2023

adamkolar commented Aug 3, 2023 • edited Loading

passaglia commented Aug 3, 2023 • edited Loading

Kamikadashi commented Jan 11, 2024 • edited Loading

adamkolar commented Aug 3, 2023 •

edited

Loading

passaglia commented Aug 3, 2023 •

edited

Loading

Kamikadashi commented Jan 11, 2024 •

edited

Loading