Can't I split a two-column document using extract_words? #1247

waterfert · 2025-01-06T08:07:11Z

waterfert
Jan 6, 2025

Hello
I found out that if a pdf document consists of two columns, I can separate it into a text file using page.extract_text(layout=True)
.
However, in this case, I parsed the pdf document using the extract_words function and table.bbox, etc., because I also got the text of the table.
If I use extract_words, it gets the two-column document as one sentence.
So I thought I could use the deviation of the x0 value of the word between the columns. However, I found out that the x0 value of the word in the left and right columns is larger than the x0 deviation value between the columns.
When getting the text using extract_text, didn't it use the coordinate value of the word? Am I mistaken?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't I split a two-column document using extract_words? #1247

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Can't I split a two-column document using extract_words? #1247

waterfert Jan 6, 2025

Replies: 0 comments

waterfert
Jan 6, 2025