Update unicharset_extractor.cpp (tesseract-ocr#1153)

* change IsWhitespace to IsUTF8Whitespace To solve "Phase UP: Generating unicharset and unichar properties files" ERROR tesseract-ocr#1147 please reference: [tesseract-ocr#1147](tesseract-ocr#1147) * Update unicharset_extractor.cpp fix the "Phase UP: Generating unicharset and unichar properties files" ERROR * Update unicharset_extractor.cpp fix "Phase UP: Generating unicharset and unichar properties files" ERROR tesseract-ocr#1147 * Update unicharset_extractor.cpp fix the encoding invalid problem and fix the comment
r92546024 · Oct 13, 2017 · fb359fc · fb359fc
1 parent 1b0379c
commit fb359fc
Showing 1 changed file with 3 additions and 1 deletion.
diff --git a/training/unicharset_extractor.cpp b/training/unicharset_extractor.cpp
@@ -50,7 +50,9 @@ static void AddStringsToUnicharset(const GenericVector<STRING>& strings,
                                      /*report_errors*/ true,
                                      strings[i].string(), &normalized)) {
       for (const string& normed : normalized) {
-        if (normed.empty() || IsWhitespace(normed[0])) continue;
+
+       // normed is a UTF-8 encoded string
+        if (normed.empty() || IsUTF8Whitespace(normed.c_str())) continue;
         unicharset->unichar_insert(normed.c_str());
       }
     } else {