Skip to content

Is there a way to ignore hidden text when extracting words? #1233

Answered by jsvine
EdmundsEcho asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @EdmundsEcho, this will depend on the particular PDF in question. Can you share it here? If not, I'd suggest this general approach:

  • Examine the objects in page.chars. Do the invisible chars have different attributes than the visible ones?
  • If so, use page.filter(...).extract_words(...); if you search for "filter" in the discussions here, you can find some examples of this being used for similar situations/solutions.

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@EdmundsEcho
Comment options

@EdmundsEcho
Comment options

@EdmundsEcho
Comment options

Answer selected by EdmundsEcho
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author
2 participants