Skip to content

Commit

Permalink
Update unicharset_extractor.cpp (tesseract-ocr#1153)
Browse files Browse the repository at this point in the history
* change IsWhitespace to IsUTF8Whitespace

To solve "Phase UP: Generating unicharset and unichar properties files" ERROR tesseract-ocr#1147

please reference: [tesseract-ocr#1147](tesseract-ocr#1147)

* Update unicharset_extractor.cpp

fix the "Phase UP: Generating unicharset and unichar properties files" ERROR

* Update unicharset_extractor.cpp

fix "Phase UP: Generating unicharset and unichar properties files" ERROR tesseract-ocr#1147

* Update unicharset_extractor.cpp

fix the encoding invalid problem and fix the comment
  • Loading branch information
ivanzz1001 authored and zdenop committed Oct 13, 2017
1 parent 1b0379c commit fb359fc
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion training/unicharset_extractor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,9 @@ static void AddStringsToUnicharset(const GenericVector<STRING>& strings,
/*report_errors*/ true,
strings[i].string(), &normalized)) {
for (const string& normed : normalized) {
if (normed.empty() || IsWhitespace(normed[0])) continue;

// normed is a UTF-8 encoded string
if (normed.empty() || IsUTF8Whitespace(normed.c_str())) continue;
unicharset->unichar_insert(normed.c_str());
}
} else {
Expand Down

0 comments on commit fb359fc

Please sign in to comment.