-
Notifications
You must be signed in to change notification settings - Fork 131
How to change linguistic resources
- git clone Bling-Fire-Git-Path
- cd BlingFire
- mkdir Release
- cd Release
- cmake -DCMAKE_BUILD_TYPE=Release ..
- make
This will take a few minutes
Alternatively you can use a Visual Studio Code with CMake, CMake Tools and C/C++ plugins installed. Select Release mode for build and then your files are going to be in the build folder.
Now you need to install the tools into the location known in PATH or to set the PATH to see the BlingFire directory with the tools. For the later one run this command from the BlingFire directory:
- . ./scripts/set_env
Let's make sure that the tools are actually in the PATH, type:
fa_nfa2dfa --help
All tools respond to --help, so you should see something like:
Usage: fa_nfa2dfa [OPTION] [< input.txt] [> output.txt] This program converts non-deterministic finite-state machine into deterministic one. --in=input-file - reads input from the input-file, if omited stdin is used --out=output-file - writes output to the output-file, if omited stdout is used --out2=output-file - writes output to the output-file, if omited stdout is used --pos-nfa=input-file - reads reversed position NFA from input-file, needed for --fsm=pos-rs-nfa to store only ambiguous positions, if omited stores all positions --fsm=rs-nfa - makes convertion from Rabin-Scott NFA (is used by default) --fsm=pos-rs-nfa - makes convertion from Rabin-Scott position NFA, builds Moore Multi Dfa --fsm=mealy-nfa - makes convertion from Mealy NFA into a cascade of two Mealy Dfa (general case) or a single Mealy DFA (trivial case) --spec-any=N - treats input weight N as a special any symbol, if specified produces Dfa with the same symbol on arcs, which must be interpreted as any other --bi-machine - uses bi-machine for Mealy NFA determinization --no-output - does not do any output --verbose - prints out debug information, if supported
Let's change the working directory into the root for linguistic sources:
cd ldbsrc
Note: we will add separate documentation on different format of the linguistic resources, for the moment we will modify the tokenization logic only like this:
touch wbd/wbd.lex.utf8
And now to recompile the wbd directory (word boundary disambiguation) or word-breaking or tokenization logic is defined in this directory. We need simply type:
make -f Makefile.gnu lang=wbd all
You should see something like this one the screen:
fa_build_conf \ --in=wbd/ldb.conf.small \ --out=wbd/tmp/ldb.mmap.small.txt fa_fsm2fsm_pack --type=mmap \ --in=wbd/tmp/ldb.mmap.small.txt \ --out=wbd/tmp/ldb.conf.small.dump \ --auto-test fa_build_lex --dict-root=. --full-unicode --in=wbd/wbd.lex.utf8 \ --tagset=wbd/wbd.tagset.txt --out-fsa=wbd/tmp/wbd.rules.fsa.txt \ --out-fsa-iwmap=wbd/tmp/wbd.rules.fsa.iwmap.txt \ --out-map=wbd/tmp/wbd.rules.map.txt fa_fsm2fsm_pack --alg=triv --type=moore-dfa --remap-iws --use-iwia --in=wbd/tmp/wbd.rules.fsa.txt --iw-map=wbd/tmp/wbd.rules.fsa.iwmap.txt --out=wbd/tmp/wbd.fsa.small.dump fa_fsm2fsm_pack --alg=triv --type=mmap --in=wbd/tmp/wbd.rules.map.txt --out=wbd/tmp/wbd.mmap.small.dump --auto-test fa_merge_dumps --out=ldb/wbd.bin wbd/tmp/ldb.conf.small.dump wbd/tmp/wbd.fsa.small.dump wbd/tmp/wbd.mmap.small.dump
This means that make is doing it job and remaking all the dependent targets.
If you see "ERROR: XYZ" message on the screen, then find the one that appeared first and let try to understand which tool it came from, what was input to this tool and what were the command line parameters. Double check with --help that these parameters make sense. Let us know if you are stuck, we'll be happy to help.
- For the tokenizer you can use fa_lex tool. See fa_lex --help for more details
printf "Hi There! This is a simple test." | fa_lex --ldb=ldb/wbd.bin --tagset=wbd/wbd.tagset.txt
The output should be something like:
Hi/WORD There/WORD !/WORD This/WORD is/WORD a/WORD simple/WORD test/WORD ./WORD
- For the single token related transformation you can use test_ldb tool. See test_ldb --help for more details
- See tools.txt for details
The Linguistic Data Base (LDB) files are simply containers of combined together address independent memory dumps of different structures such as: maps, multi maps, finite state automata, arrays.
To avoid usage mistakes such as the dictionary was collected in case sensitive way and someone looks it up in case insensitive and similar which are difficult to find. The runtime options are also compiled into one of those maps (configuration map) and are a part of the final LDB file. The compiled configuration map is defines which functions the LDB has resources for and what parameters should be used for each function at runtime.
ldbsrc -- main LDB root Name_1 -- name of the project #1 ldb.conf.small -- runtime configuration parameters for the project #1, required file options.small -- LDB compilation options for the project #1, required file [other resources] Name_2 -- name of the project #2 ... ldb -- a root for all the compiled binary files name1.bin -- compiled binary for the project #1 name2.bin -- compiled binary for the project #2 ... Makefile.gnu -- make file for compilation