-
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtodo.txt
130 lines (73 loc) · 4.56 KB
/
todo.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
Check words that are the same but have different ё positions so that one version is chosen
-> "alternative form of" which is equal when ё is converted to е
Add derived verb forms to database
Count determiners as pronouns
Add option to convert book back to source format
Add algorithm to handle connected color words (with dash) тёмно-серый
Write script to upload the program
Benchmark with grammar detection vs without
Add two textfields to GUI where simple text can be stressed
Have some books -> convert them to txt (simply while iterating through them print them to a file)
->Then use them as reference data set to make refactorings -> if the output changes, (diff changes), something got changed in the
stress placement algorithm
Clean more data to make database "even smaller" (maybe indices and so on)
-> delete all form_of entries without case_tags
-> then delete some forms of duplicates (where pos is equal and linkages are equal/one does not have linkages)
-> check a better form of making linkages:
do not make all possible linkages, instead make only one pos-retaining linkages
if only non-pos-retaining linkages remain, make all of them
-> maybe also clean out rows that aren't needed
-> Benchmark lookup with some indices deleted -> delete the ones that bring the least performance benefit
-> have a simple list of entries that are unambiguous
Compare new to old data (do some words)
load entire database in memory to make the program faster
Add direct FB2 support
Does the dictionary lookup work with words without correctly marked ё?
Update readme to reflect the new refactored EbookStresser class
Make variables of classes _private
Compare raw_data with post-processed russian kaikki data -> Compare resulting dictionary, maybe with sqldiff
and for adding more:
http://nskhuman.ru/unislov/udarenija.php?nlet=1
-> scrape every word that is not in any other sources
maybe also gde-udarenija
Also add data from https://github.com/Text-extend-tools/python-yoficator
REMOVE unstressed words - especially from OpenRussian (before refusing to add data from Wiktionary out of fear of duplicates)
short the description of the database so that space is saved
Add words in other areas from ru-wiktionary
Use frequency data in udar repo
Remove un-yoified from words extracted from Wiktionary-RU -> This could also be a source for the yoification data
Develop way to find out which stressed form is more common -> Possibly the corpus that I won't get access to.
Delete words that don't need context really from DB
Delete all words that only have one syllable and no е/ё
Make experiments that test if GPT-3 can resolve homographs with different stress
Write about future that stress might be detected from audio data.
чебурашки
Write to that silero habr post
Evaluate method that uses AI only for ambiguous words and a word list for the rest
Upload to Pypi
Straightforward way to create database, depend on other packages
Fix add_wikipedia_to_db so that it works properly
Clean up yo data
Test if information gets lost with delete_data
(for example for words that have only one syllable, but pronounced differently based on declination)
Delete archaic stress forms if an alternatives with the same POS exist (первого)
More tests
Integrate "alternative form" in ruwiktionary data
WSD: Handle upper and lowercase disambiguation
Add Disambiguation examples to BigBench (why not)
После нескольких часов глубокого и спокойного сна дети внезапно просыпаются и вскакивают с покрасневшим лицом, с потом на лбу,
Кровью и потом: Анаболики
Add more examples where the aspect matters to the benchmark
Disable parser if not needed
LRUCache?
Delete everything unused from database
Upload simple_cases to zip file
* Do we really want to delete base_word_id?
Fix уже + такой
For уже we need to analyze Degree=Pos Cmp
For example:
Вот видишь, – сказала она, – мне курить вредно, я уже падаю
И если вдруг тебе казалось, что твоя сестра и её муж угадывали твои мысли, то такой приём называется «холодное чтение».
fix bug here (курсе): "Тогда он сказал ей, что она ему понравилась ещё на первом курсе университета."
# Has the yofication data actually been added? I don't think so
диверсионно-развед́ывательная