Why does not non-coloured cells get captured? #1228
Replies: 2 comments 2 replies
-
Using the The reason is that the non-colored cells aren't explicitly non-colored cells but rather just lack any graphical element at all. To extract this table, I'd suggest identifying all of the |
Beta Was this translation helpful? Give feedback.
-
I have found using If there is no Total Hours table on the page, you could try use the page footer for the "bottom" marker as has been suggested. page = ...
tables = page.find_tables()
schedule = None
total_hours = None
for table in tables:
rows = table.extract()
name = rows[0][0]
if name == "Schedule Details":
schedule = table
if name == "Total Hours and Statistics":
total_hours = table
if total_hours is None:
bottom = page.search(r"Generated on.*Page\s*\d+\s*of\s*\d")[0]["top"]
else:
bottom = total_hours.bbox[1]
bbox = list(schedule.bbox)
bbox[1] = schedule.cells[0][-1] # bottom of "Schedule Details" cell is the "top" of the crop area
bbox[-1] = bottom # either the "top" of "Total Hours" or the page footer
crop = page.crop(bbox)
# pick a table to use their vertical lines (+ right edge of table)
explicit_vertical_lines = [ cell[0] for cell in crop.find_tables()[-1].cells ] + [ bbox[2] ]
rows = crop.extract_table({"explicit_vertical_lines": explicit_vertical_lines})
df = pd.DataFrame(rows[1:], columns=rows[0]) |
Beta Was this translation helpful? Give feedback.
-
I have this PDF:
pdftest.pdf
I don't get why non-coloured cells don't get detected. This is what I have tried so far:
The table I'm interested in is Schedule details.
I have tried to find how to discriminate cells based on background colours without luck too.
Thanks for your time. Awesome library.
Beta Was this translation helpful? Give feedback.
All reactions