Trouble locating and extracting a specific table if multiple same structured tables are present in the page #1231
Closed
sxhamT
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Actually I found the solution to be to first locate the heading text and using its y-coordinate, splitting the page in 2 parts. the first table in the part below the heading is the required table. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have these pdfs: college1.pdf , college2.pdf , college3.pdf
Table in question: under section-> "2.5.2.2 Number of students appeared in the examination conducted by the institution year wise
during the last five years" on pages 51 , 48 , 48 respectively
The code:
code mainly extract_table function
Expectation : to locate the heading in the pdfs, get the table associated with that heading (under the heading) with table structure 20xx-20xx for first row and numeric data in the 2nd row.
Currently: Its finding the heading correctly in the pdf, but it extracts the first table (of the required structure) it finds on the page containing the heading.
Now since the required table can be below 1 or more similar structured tables on the page, is there a way to specifically only target the table below the heading text. the table themselves don't have any distinguishing features. I tried using a table index to manually extract the table, but the location of the heading varies even along the page between PDFs, also the bounding boxes are a bit confusing in the pdf.
Beta Was this translation helpful? Give feedback.
All reactions