GitHub - jsc/tokenizer: Clone of Sebastian Nagel's tokenizer.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
COPYING		COPYING
INSTALL		INSTALL
LC_ISOcyrillic5.h		LC_ISOcyrillic5.h
LC_ISOlatin1.h		LC_ISOlatin1.h
LC_ascii.h		LC_ascii.h
LC_cp1251.h		LC_cp1251.h
LC_cp1252.h		LC_cp1252.h
LIESMICH		LIESMICH
Makefile.am		Makefile.am
Makefile.in		Makefile.in
README		README
TokenizeDe.l		TokenizeDe.l
TokenizeDeL1.c		TokenizeDeL1.c
TokenizeDeL1.l		TokenizeDeL1.l
TokenizeDeU8.c		TokenizeDeU8.c
TokenizeDeU8.l		TokenizeDeU8.l
TokenizeDeWin.c		TokenizeDeWin.c
TokenizeDeWin.l		TokenizeDeWin.l
TokenizeEn.l		TokenizeEn.l
TokenizeEnL1.c		TokenizeEnL1.c
TokenizeEnL1.l		TokenizeEnL1.l
TokenizeEnU8.c		TokenizeEnU8.c
TokenizeEnU8.l		TokenizeEnU8.l
TokenizeEnWin.c		TokenizeEnWin.c
TokenizeEnWin.l		TokenizeEnWin.l
TokenizeRu.l		TokenizeRu.l
TokenizeRuI5.c		TokenizeRuI5.c
TokenizeRuI5.l		TokenizeRuI5.l
TokenizeRuU8.c		TokenizeRuU8.c
TokenizeRuU8.l		TokenizeRuU8.l
TokenizeRuWin.c		TokenizeRuWin.c
TokenizeRuWin.l		TokenizeRuWin.l
Tokenizer.h		Tokenizer.h
TokenizerLang.h		TokenizerLang.h
TokenizerLexer.h		TokenizerLexer.h
aclocal.m4		aclocal.m4
config.h.in		config.h.in
configure		configure
configure.ac		configure.ac
depcomp		depcomp
htmlEnt2Char.c		htmlEnt2Char.c
htmlEnt2Char.l		htmlEnt2Char.l
install-sh		install-sh
missing		missing
mkinstalldirs		mkinstalldirs
putzer.c		putzer.c
putzer.l		putzer.l
run.sh		run.sh
test.txt		test.txt
tokenizer.c		tokenizer.c
torture_de_l1.txt		torture_de_l1.txt
ylwrap		ylwrap

Repository files navigation

tokenizer, putzer, htmlEnt2Char -- three tools for corpus processing
===================================================================

tokenizer    -- a tokenizer whith end-of-sentence detection (see "tokenizer -h")


putzer       -- remove unnecessary blanks etc. (``putzer'' is German and means `cleaner")
                (see "putzer -h")

htmlEnt2Char -- converts HTML-entities into characters (see "htmlEnt2Char -h")



Compile and install (see INSTALL):
 ./configure
 make && make install


I tried to write a fast, rule-based, and also to some extend robust
tokenizer and sentence segmenter.

Actually supported languages are:
 * German (see also file LIESMICH)
 * English (thanks also to Michaela Geierhos)
 * Russian
For each language the corresponding codepage of ISO and MS-Windows,
and partly UTF-8.


Features:

1. customizable through options
 - language and codepage
 - try to undo hyphenation
 - semantics of line breaks (paragraph separator or not)
 - etc.

2. problems and strategies for tokenization
 - hyphenated words are considered as one token
 - option -c concatenates words with hyphen at end-of-line.
   This may cause errors, although a small exception list is defined

3. end-of-sentence detection:
 - positive:
   * end-of-sentence marker followed by blank and uppercase letter
 - negative:
   * abbreviations (except for, e.g., "etc." which often occurs at EOS)
   * dates
 - positive:
   * negative followed by word usually used exclusively at BOS
     * capitalized determiners, conjunctions, etc.
 - try to handle additional punctuation symbols following the full-stop
   correctly (brackets, apostrophes etc.)
 - tests on the Brown corpus support an error rate of about 3%


Version history:

0.1 -- package with tokenizer, putzer, htmlEnt2Char

0.2 -- bug reported by js: when input contains long words or many following newlines
       tokenizer stops with "input buffer overflow". To avoid this use putzer as
       filter with newly introduced option -m!

0.3 -- optimization (inlines & macros): now about 10% faster

0.4 -- corrected some details in German EOS-detection, changed behaviour with option -sx:
       When a newline is recognized a space is printed on a separate line,
       instead of an empty line.

0.5 -- ':' now not considered as EOS-mark. Additions to German abbreviation list.

0.6 -- Added more German abbreviations, Roman numerals with point.
       Added rudimentary support for utf-8 in German.

0.7 -- Better EOS for English, thanks to Michaela Geierhos;
       rudimentary support for utf-8 in English

0.8 -- fixed a bug in the Russian part, which makes the tokenizer hanging

0.9 -- changes to German abbreviations
       rudimentary support for utf-8 in Russian

0.10 -- fixed bug raising a segfault for German language option
        short sequences in parenthesis are excluded from containit an end-of-sentence
        additions to German abbreviations

0.11 -- fixed a bug with options -C and -c.
        Introduced positive rules for German EOS : i.e. if a capitalized article, conjunction
        or prepositions follows an abbreviation or date, there should be an EOS.

0.12 -- better documentation (in English)
        Positive rules also for English: The text «The firm said it
        plans to sublease its current headquarters at 55 Water St. A
        spokesman declined to elaborate.» (Wall Street Journal) is now
        correctely splitted into two sentences

1.0  -- (almost) no changes
        GPL licensed now