Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New user_pitchdb.csv #19

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

New user_pitchdb.csv #19

wants to merge 3 commits into from

Conversation

ithelor
Copy link

@ithelor ithelor commented Sep 4, 2021

Converted you-know-what (wink-wink) to .csv so the addon can use it as a custom DB.
Works fine, as far as I can tell.

@ithelor ithelor changed the title New user database New user_pitchdb.csv Sep 4, 2021
@IllDepence
Copy link
Owner

Hey,

thanks for converting the Kanjium pitch data.

I would like to keep the user_pitchdb.csv that ships with the add-on as is—i.e., a one line example showing users what the file and format is so that everyone can use it how they like.
As for the converted Kanjium pitch data, I would suggest the following: create a gist with the converted CSV, a mention of where the data is from (i.e. the Kanjium repo) and, in case you created some conversion script, maybe that as well. I would then put a link to the gist on the add-on page on ankiweb. Something along the lines of:

<short description of what the user_pitchdb.csv is for>
Pitch data alternatives to Wadoku: Kanjium (thanks to Ithelor for converting the data)

…cture being similar to Wadoku's DB. Leaving only first entries in the 3rd column for now
@ithelor
Copy link
Author

ithelor commented Sep 4, 2021

Thank you for the reply.
4 hours in and only now I notice I messed up quite a bit. The addon seems to be working alright with this data, but I would like to spend additional time putting things in order. Once I am completely (this time for sure) sure the data is completely alright I will do exactly as you suggest.

@ithelor
Copy link
Author

ithelor commented Sep 4, 2021

Well, I've spent some time rewriting the algorithm, testing the data and putting up the gist.

The current summary is:

  • added readings for kana-written expressions (Kanjium doesn't provide these)
  • at first, I didn't test this data enough; there is a problem with multiple entries for one expression
    Despite the correct data being in the table the addon only takes the last entry (row) it can find, thus 四 (し) being generated into スー, 足 (あし) being そく, 東 (ひがし) being ひんがし, even 見物 (けんぶつ) being みもの, etc. It looks like the algorythm doesn't use the reading field at all when looking through custom data.
  • multiple pitch accent patterns on the 3rd position also get the algorithm broken (which doesn't seem to be a problem in Wadoku's structure), which is why I had to limit them to just one
    Otherwise the graphs try to include all of the patterns simultaneously. You can grab full version here if you find some time to test it.

Also one question about "keeping the user_pitchdb.csv". Do you mean you want to keep it in the original package? So user can just replace it when installing alternative data?

Gist. If there are any problems other than listed please inform me.

Also since you're not planning on merging I will be committing to a separate repo, if that's OK. There's just some stuff not affiliated with this one.

@IllDepence
Copy link
Owner

Thanks for the extensive info and excuse the late reply.

I now added a remark pointing to the gist to the add-on page.

@ithelor
Copy link
Author

ithelor commented Sep 20, 2021

Well, my point was that the algorithm you use to work around the custom db structure doesn't seem to support multiple entries with same key. I may be wrong, but either way the addon only generates the last entry. As I said,

四 (し) being generated into スー, 足 (あし) being そく, 東 (ひがし) being ひんがし, even 見物 (けんぶつ) being みもの, etc.

It also doesn't look like it supports multiple values in a field, so I can't think of anything I can do to fix this. In the previous comment I wanted to let you know that you should probably revise the custom db structure you use, not post the data that works incorrectly.

@IllDepence
Copy link
Owner

IllDepence commented Sep 20, 2021

Sorry for not properly addressing your comment. I wasn't able to find any time for the add-on recently and since your comment had been sitting there for a while without any reaction from me I did a rush job on it.

If you want me to remove the link to your Gist on the add-on page for now let me know.

As for the points you raise:

The add-on integrates the user_pitchdb.csv as follows:

  • For each Japanese word one pitch accent is determined (see my comments in Multiple pitch accents support #20)
  • To determine the pitch accent the add-on first goes through the Wadoku data from top to bottom
  • Then it goies through the user_pitchdb.csv from top to bottom

By processing the user_pitchdb.csv after the Wadoku data it allows users to not only add additional pitch accent data (as shown by the 字面 example in the file as it ships w/ the add-on) but also to overwrite the Wadoku pitch accents. That is, if Wadoku says 乗る is LHL but a user inserts 乗る LHH into their user_pitchdb.csv, the latter will be used.

Reading your description I guess the Kanjium data lists a Japanese word multiple times starting with the most common reading/pitch followed by less and less common readings/pitch accent patterns. In that case the pragmatic way to make the best use of the Kanjium data before #20 is addressed would be to only keep the first entry for every word in the Kanjium data.

/edit:

Also one question about "keeping the user_pitchdb.csv". Do you mean you want to keep it in the original package? So user can just replace it when installing alternative data?

To quickly elaborate on this: the user_pitchdb.csv in its current form as is ships with the add-on (i.e. only including one line with the pitch data for 字面) is there to provide an example. A “Dear user, you can use this file to extend or overwrite the pitch data used by the plugin. Here's a minimal example how it works.” kind of thing. I feel keeping the user_pitchdb.csv this minimal gives users a good amout of freedom to use the file as they like. Want to use a completely different data source (e.g. Kanjium)? Sure, just replace the whole user_pitchdb.csv file. Want to manually add the odd word that isn't contained in the add-ons data (e.g. 字面)? Also prefectly fine.

@ithelor
Copy link
Author

ithelor commented Sep 20, 2021

Thank you for your reply.

I'd appreciate if you remove the link for now.
I'd also like to keep this discussion on if you don't mind.

The way to fix the data you suggest may be pragmatic but functionality is going to suffer greatly. It's not about just removing extra entries. For some words it would mean removing an entire reading type (i.e. 見物, which can be read in both systems, would be left over with either けんぶつ (on) or みもの (kun)).

I believe the current situation is that "expression" field acts as the only key to search through the custom DB. But the thing is, there is also "reading" field. You select both while bulk adding pitches in the app, but is the latter even used in the search?
Doesn't using both keys (expression and reading) literally solve the entire problem? I believe you have separate functions to work around these two DBs, so it should probably work alright.

@IllDepence
Copy link
Owner

Gotcha. Removed the link.

Regarding multiple readings:
The reading field is taken into consideration, but the CSV's structure and handling of wadoku_pitchdb.csv and user_pitchdb.csv differ.

  • wadoku_pitchdb.csv: Two levels of field separators (sth. like 汚れ | けがれ,よごれ | LHHL,LHHH ). Allowing the add-on to consider multiple readings and decide based on what's on the user's Anki card.
  • user_pitchdb.csv: One level of field separators (a normal TSV file to keep the format simple/easily understandable).

Because of this only the last entry for a word in user_pitchdb.csv is considered. This could be changed by either changing the processing of changing the format of the user_pitchdb.csv. I feel the former makes more sense. From the perspective of a user, entering

汚れ    けがれ    LHHL
汚れ    よごれ    LHHH

into the user_pitchdb.csv is probably more intuitive then some format with two levels of separators like \u241e and \u241f in the wadoku data.

@ithelor
Copy link
Author

ithelor commented Sep 20, 2021

I agree that current structure makes more sense.
The only thing I suggest to change is the searching algorithm that works with custom DB: making it unitizing two searching keys (note fields, in our case) instead of one eliminates any possible discrepancy as well as allows you to keep that simple format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants