Why does not non-coloured cells get captured? #1228

ElSabio97 · 2024-11-22T17:34:34Z

ElSabio97
Nov 22, 2024

I have this PDF:
pdftest.pdf

I don't get why non-coloured cells don't get detected. This is what I have tried so far:

import pdfplumber

ruta_pdf = "/content/bea3.pdf"

tables = []

with pdfplumber.open(ruta_pdf) as pdf:
      for i, pagina in enumerate(pdf.pages):
        table = pagina.extract_tables()
        tables.append(table)

print(tables)

The table I'm interested in is Schedule details.

I have tried to find how to discriminate cells based on background colours without luck too.

Thanks for your time. Awesome library.

jsvine · 2024-11-24T13:58:39Z

jsvine
Nov 24, 2024
Maintainer

Using the page.to_image().debug_tablefinder() method, you can see this:

The reason is that the non-colored cells aren't explicitly non-colored cells but rather just lack any graphical element at all.

To extract this table, I'd suggest identifying all of the x coordinates in the page.rects values that correspond to cells, and then passing them to the "explicit_vertical_lines": [...] table extraction setting. That still leaves you with the issue of the final row, since there is no explicit bottom edge to it; for that, I'd suggest identifying the top of the footer text and passing that as "explicit_horizontal_lines": [...].

1 reply

ElSabio97 Nov 25, 2024
Author

Okay so. I have done some attempts and your suggestion works flawlesly. I have one last question to ask. Sometimes "Schedule Details" table has to share the same page with other tables, and where that happens things get a little funny when I extract the table.

Here is my attempt to extract the data, not yet fully processed because the last dataframe to get appended always get some empty columns in between valid data that I cannot manage to clean out.

import pdfplumber
import pandas as pd
import numpy as np

ruta_pdf = "/content/pdftest.pdf"

tables = []

# PDF coordinates for cropping
top_first_page = 595 - 481
top_rest_pages = 595 - 520
bottom = 595 - 35


# Table extraction
with pdfplumber.open(ruta_pdf) as pdf:
  # Difference in top coordinates because first page has different header
      for i, page in enumerate(pdf.pages):
        if i == 0:
          top = top_first_page
        else:
          top = top_rest_pages

        page = page.crop((0, top, page.width, bottom))
        table = page.extract_tables({
            # X coordinates of the begining of each cell of the Schedule Details table
        "explicit_vertical_lines": [12,77,153,260,325,434,509,566,622,675],
    })
        tables.append(table)

df = []

# Name of the columns I'm not interested in
words_to_search = ["Report times", "Debrief times", "Block hours", "Duty hours", "Indicators", "Crew"]

for i, table in enumerate(tables):
  df2 = pd.DataFrame(tables[i][0])

  # Filtering to only get the table Schedule Details
  if "Schedule Details" in df2[1].values:

    # Removing the rest of tables for the last page where "Schedule Details" is present
    # The "Total Hours and Statistics" table always comes after the "Schedule Details" table
    if "Total Hours and" in df2[1].values:     
      TotalHours_index = df2[df2[1] == "Total Hours and"].index[0]
      df2 = df2.iloc[:TotalHours_index]

    # Attempt to clean None values  
    df2 = df2.replace("", np.nan)
    df2 = df2.dropna(axis=1, how='all')
    df2 = df2.dropna(axis=0, how='all')

    # Column dropping
    columns_to_drop = df2.columns[df2.apply(lambda col: col.astype(str).str.contains('|'.join(words_to_search)).any())]  
    df2 = df2.drop(columns=columns_to_drop)

    df.append(df2)

df_combined = pd.concat(df, ignore_index=True)

# Dropping all rows were "Schedule Details" and column names show up to clear the table.
df_combined = df_combined[df_combined[1] != "Schedule Details"]
df_combined = df_combined[df_combined[1] != "Date"]

print(df_combined)

Any suggestion on how to improve this attempt and/or any ideas on why I don't manage to clean the NaN values of the last appended dataframe are hugely appreciated.

Seriously, awesome library.

cmdlineluser · 2024-11-26T02:21:02Z

cmdlineluser
Nov 26, 2024

I have found using page.find_tables() as in initial step useful as it gives back Table objects which gives you access to .cells and their bbox information which can be used for cropping.

If there is no Total Hours table on the page, you could try use the page footer for the "bottom" marker as has been suggested.

page = ...
tables = page.find_tables()

schedule = None
total_hours = None

for table in tables:
    rows = table.extract()
    name = rows[0][0] 
    if name == "Schedule Details":
        schedule = table
    if name == "Total Hours and Statistics":
        total_hours = table
        
if total_hours is None: 
    bottom = page.search(r"Generated on.*Page\s*\d+\s*of\s*\d")[0]["top"]
else:
    bottom = total_hours.bbox[1]
    
bbox = list(schedule.bbox)
bbox[1] = schedule.cells[0][-1] # bottom of "Schedule Details" cell is the "top" of the crop area
bbox[-1] = bottom               # either the "top" of "Total Hours" or the page footer

crop = page.crop(bbox)

# pick a table to use their vertical lines (+ right edge of table)
explicit_vertical_lines = [ cell[0] for cell in crop.find_tables()[-1].cells ] + [ bbox[2] ]

rows = crop.extract_table({"explicit_vertical_lines": explicit_vertical_lines})

df = pd.DataFrame(rows[1:], columns=rows[0])

1 reply

ElSabio97 Nov 30, 2024
Author

This works beautifully and I could implement it right away. Thanks for your help. Hugely appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does not non-coloured cells get captured? #1228

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why does not non-coloured cells get captured? #1228

ElSabio97 Nov 22, 2024

Replies: 2 comments · 2 replies

jsvine Nov 24, 2024 Maintainer

ElSabio97 Nov 25, 2024 Author

cmdlineluser Nov 26, 2024

ElSabio97 Nov 30, 2024 Author

ElSabio97
Nov 22, 2024

Replies: 2 comments 2 replies

jsvine
Nov 24, 2024
Maintainer

ElSabio97 Nov 25, 2024
Author

cmdlineluser
Nov 26, 2024

ElSabio97 Nov 30, 2024
Author