-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character confidence threshold #3860
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@property | ||
def TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD(self) -> int: | ||
"""Tesseract predictions with confidence below this threshold are ignored""" | ||
return self._get_float("TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD", 0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, maybe we'd like to have some really low default threshold, i.e. 0.1, just to filter out complete garbage chars?
image: np.ndarray, | ||
lang: str = "eng", | ||
config: str = "", | ||
character_confidence_threshold: float = 0.5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are adding some default, so maybe let's also keep it in config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see below we again have 0.5 as a default in hocr_to_dataframe
, so either way, I would unify those
This change adds the ability to filter out characters predicted by Tesseract with low confidence scores.
Some notes: