Is there a way to ignore hidden text when extracting words? #1233
Answered
by
jsvine
EdmundsEcho
asked this question in
Q&A
-
I have this function that uses the For instance:
Show/Hide Codedef extract_words(page, page_number, object_id_start) -> list[PreElement]:
words = []
object_id = object_id_start
for word in page.extract_words(x_tolerance=3):
words.append({
"ObjectId": object_id,
"Page": page_number,
"Text": word["text"],
"Bounds": make_bounds(word, page.height)
})
object_id += 1
return words |
Beta Was this translation helpful? Give feedback.
Answered by
jsvine
Dec 9, 2024
Replies: 1 comment 3 replies
-
Hi @EdmundsEcho, this will depend on the particular PDF in question. Can you share it here? If not, I'd suggest this general approach:
|
Beta Was this translation helpful? Give feedback.
3 replies
Answer selected by
EdmundsEcho
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi @EdmundsEcho, this will depend on the particular PDF in question. Can you share it here? If not, I'd suggest this general approach:
page.chars
. Do the invisible chars have different attributes than the visible ones?page.filter(...).extract_words(...)
; if you search for "filter" in the discussions here, you can find some examples of this being used for similar situations/solutions.