Trouble locating and extracting a specific table if multiple same structured tables are present in the page #1231

sxhamT · 2024-12-03T16:13:20Z

sxhamT
Dec 3, 2024

I have these pdfs: college1.pdf , college2.pdf , college3.pdf

Table in question: under section-> "2.5.2.2 Number of students appeared in the examination conducted by the institution year wise
during the last five years" on pages 51 , 48 , 48 respectively

The code:
code mainly extract_table function

Expectation : to locate the heading in the pdfs, get the table associated with that heading (under the heading) with table structure 20xx-20xx for first row and numeric data in the 2nd row.

Currently: Its finding the heading correctly in the pdf, but it extracts the first table (of the required structure) it finds on the page containing the heading.

Now since the required table can be below 1 or more similar structured tables on the page, is there a way to specifically only target the table below the heading text. the table themselves don't have any distinguishing features. I tried using a table index to manually extract the table, but the location of the heading varies even along the page between PDFs, also the bounding boxes are a bit confusing in the pdf.

                                      vv  What it should extract > Yellow / What it extracts -> Orange  vv

sxhamT · 2024-12-06T12:24:02Z

sxhamT
Dec 6, 2024
Author

Actually I found the solution to be to first locate the heading text and using its y-coordinate, splitting the page in 2 parts. the first table in the part below the heading is the required table.
BONUS: if the heading text and table is split between pages, I set a threshold in case of Y -coordinate >700 to look for the first table in the next page. (How do I close this discussion)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble locating and extracting a specific table if multiple same structured tables are present in the page #1231

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Trouble locating and extracting a specific table if multiple same structured tables are present in the page #1231

sxhamT Dec 3, 2024

Replies: 1 comment

sxhamT Dec 6, 2024 Author

sxhamT
Dec 3, 2024

sxhamT
Dec 6, 2024
Author