Skip to content

jsc/tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tokenizer, putzer, htmlEnt2Char -- three tools for corpus processing
===================================================================

tokenizer    -- a tokenizer whith end-of-sentence detection (see "tokenizer -h")


putzer       -- remove unnecessary blanks etc. (``putzer'' is German and means `cleaner")
                (see "putzer -h")

htmlEnt2Char -- converts HTML-entities into characters (see "htmlEnt2Char -h")



Compile and install (see INSTALL):
 ./configure
 make && make install


I tried to write a fast, rule-based, and also to some extend robust
tokenizer and sentence segmenter.

Actually supported languages are:
 * German (see also file LIESMICH)
 * English (thanks also to Michaela Geierhos)
 * Russian
For each language the corresponding codepage of ISO and MS-Windows,
and partly UTF-8.


Features:

1. customizable through options
 - language and codepage
 - try to undo hyphenation
 - semantics of line breaks (paragraph separator or not)
 - etc.

2. problems and strategies for tokenization
 - hyphenated words are considered as one token
 - option -c concatenates words with hyphen at end-of-line.
   This may cause errors, although a small exception list is defined

3. end-of-sentence detection:
 - positive:
   * end-of-sentence marker followed by blank and uppercase letter
 - negative:
   * abbreviations (except for, e.g., "etc." which often occurs at EOS)
   * dates
 - positive:
   * negative followed by word usually used exclusively at BOS
     * capitalized determiners, conjunctions, etc.
 - try to handle additional punctuation symbols following the full-stop
   correctly (brackets, apostrophes etc.)
 - tests on the Brown corpus support an error rate of about 3%


Version history:

0.1 -- package with tokenizer, putzer, htmlEnt2Char

0.2 -- bug reported by js: when input contains long words or many following newlines
       tokenizer stops with "input buffer overflow". To avoid this use putzer as
       filter with newly introduced option -m!

0.3 -- optimization (inlines & macros): now about 10% faster

0.4 -- corrected some details in German EOS-detection, changed behaviour with option -sx:
       When a newline is recognized a space is printed on a separate line,
       instead of an empty line.

0.5 -- ':' now not considered as EOS-mark. Additions to German abbreviation list.

0.6 -- Added more German abbreviations, Roman numerals with point.
       Added rudimentary support for utf-8 in German.

0.7 -- Better EOS for English, thanks to Michaela Geierhos;
       rudimentary support for utf-8 in English

0.8 -- fixed a bug in the Russian part, which makes the tokenizer hanging

0.9 -- changes to German abbreviations
       rudimentary support for utf-8 in Russian

0.10 -- fixed bug raising a segfault for German language option
        short sequences in parenthesis are excluded from containit an end-of-sentence
        additions to German abbreviations

0.11 -- fixed a bug with options -C and -c.
        Introduced positive rules for German EOS : i.e. if a capitalized article, conjunction
        or prepositions follows an abbreviation or date, there should be an EOS.

0.12 -- better documentation (in English)
        Positive rules also for English: The text «The firm said it
        plans to sublease its current headquarters at 55 Water St. A
        spokesman declined to elaborate.» (Wall Street Journal) is now
        correctely splitted into two sentences

1.0  -- (almost) no changes
        GPL licensed now

About

Clone of Sebastian Nagel's tokenizer.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published