Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character confidence threshold #3860

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Conversation

plutasnyy
Copy link
Contributor

@plutasnyy plutasnyy commented Jan 6, 2025

This change adds the ability to filter out characters predicted by Tesseract with low confidence scores.

Some notes:

  • I intentionally disabled it by default; I think some low score(like 0.9-0.95 for Tesseract) could be a safe choice though
  • I wanted to use character bboxes and combine them into word bbox later. However, a bug in Tesseract in some specific scenarios returns incorrect character bboxes (unit tests caught it 🥳 ). More in comment in the code

@plutasnyy plutasnyy marked this pull request as ready for review January 8, 2025 10:39
@plutasnyy plutasnyy requested review from badGarnet and MaksOpp January 8, 2025 10:39
Copy link
Contributor

@MaksOpp MaksOpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@property
def TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD(self) -> int:
"""Tesseract predictions with confidence below this threshold are ignored"""
return self._get_float("TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD", 0.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, maybe we'd like to have some really low default threshold, i.e. 0.1, just to filter out complete garbage chars?

image: np.ndarray,
lang: str = "eng",
config: str = "",
character_confidence_threshold: float = 0.5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are adding some default, so maybe let's also keep it in config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see below we again have 0.5 as a default in hocr_to_dataframe, so either way, I would unify those

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants