Extracting table does not use correct borders #1244
tz850
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment 1 reply
-
Hi @tz850, and thanks for providing the PDF and visual debugging output. This is an interesting edge case, where the page has a lot of other graphical objects ( im.reset().debug_tablefinder({
"snap_tolerance": 0,
}) ... seems to get you want you'd want: |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Describe the bug
When extracting the table content on the page, it does not follow the table borders, but seems to use the label of the chart below.
This problem causes the text extracted from the cell to be incorrect. For example, in the third and fourth columns of the first row, the correct text should be "
",
but the actual extract_tables function‘s result is "
"
This is the original page.
Here is an image of the debug_tablefinder output. The cell borders pointed to by the arrows in the figure are not the actual borders of the table.
Have you tried repairing the PDF?
I run ghostscript directly to output the repaired pdf file, and the problem is the same.
Code to reproduce the problem
PDF file
sample.pdf
Expected behavior
I hope that the division of table cells will not be affected by other tables or charts.
Actual behavior
What actually happened, instead?
Environment
Additional context
Add any other context/notes about the problem here.
Beta Was this translation helpful? Give feedback.
All reactions