Can't I split a two-column document using extract_words? #1247
waterfert
started this conversation in
Ask for help with specific PDFs
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello
I found out that if a pdf document consists of two columns, I can separate it into a text file using page.extract_text(layout=True)
.
However, in this case, I parsed the pdf document using the extract_words function and table.bbox, etc., because I also got the text of the table.
If I use extract_words, it gets the two-column document as one sentence.
So I thought I could use the deviation of the x0 value of the word between the columns. However, I found out that the x0 value of the word in the left and right columns is larger than the x0 deviation value between the columns.
When getting the text using extract_text, didn't it use the coordinate value of the word? Am I mistaken?
Beta Was this translation helpful? Give feedback.
All reactions