Once a table missed its top horizontal line, either method page.extract_tables() or page.find_tables() will omit the first row, and only return table data from second row. #1243

RickVincent · 2024-12-30T07:45:46Z

RickVincent
Dec 30, 2024

Describe the bug

Once a table (in page 33 as attached PDF file) missed its top horizontal line, both method page.extract_tables() and page.find_tables() will omit the first row, and only return the table with data from second row.

Have you tried repairing the PDF?

Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.
Yes, run it with or without repair=True, got same data. Log is as below.
Page 33 Table is [['万华化学\n（宁波）\n热电有限\n公司', '二氧化硫', '连续', '4', '主厂房北侧', '15.35mg/m3', '火电厂大气污染物排放\n标准GB13223—2011', '113.58', '420', '无'], [None, '氮氧化物', None, None, None, '30.48mg/m3', None, '221.667', '600', None], [None, '颗粒物', None, None, None, '1.77mg/m3', None, '10.035', '60', None], ['万华化学\n（福建）\n有限公司', '氮氧化物', '连续', '7', '各生产装置区', '79.167mg/m3', '石油化学工业污染物排\n放标准GB31571-2015', '0.3166', '0.88', '无'], [None, 'COD', '间歇', None, None, '20.08mg/l', None, '47.1904', '106.2', None], [None, '氨氮', None, None, None, '0.475mg/l', None, '1.0624', '14.16', None]]

Code to reproduce the problem

Paste it here, or attach a Python file.

import pdfplumber

REPORT_FILE = r'D:\Downloads\万华化学\万华化学.2017.年报.2018-03-13.pdff'

GS_PATH = r'D:\Downloads\Python\PDF\Ghostscript.10.04.0.x64\bin\gswin64.exe'

with pdfplumber.open(REPORT_FILE, gs_path = GS_PATH, repair = True) as pdf:
    for page_num, page in enumerate(pdf.pages):
        tables = page.extract_tables()

        if page_num < 31 or page_num > 34:
            continue

        for table_index, table in enumerate(tables):
            print(f'Page {page_num + 1} Table is {table}')

PDF file

Please attach any PDFs necessary to reproduce the problem.
万华化学.2017.年报.2018-03-13.pdf

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

What did you expect the result should have been?
Expects to have the table data with first row by method page.extract_tables() and/or page.find_tables(), or any approach to identify such kind of misformed PDF table? I need to merged splitted tables between pages. Any table missing the top horizontal line will not be abled to be detected as splitted table and getting to be merged.
Couple tables in different PDF files have same issue.

Actual behavior

What actually happened, instead?
Expects to have the table data with first row by method page.extract_tables() and/or page.find_tables(),

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

pdfplumber version: [0.11.4]
Python version: [3.13.1]
OS: [Windows11 24H2 x64]

Additional context

Add any other context/notes about the problem here.
pdfplumber is really a great lib! Works well for my project. Just need to figure out how to deal with some special cases (maybe misformed pdf). But don't want to hardcoding to deal with such special/misformed files on case by case basis. Pls advise! Any generic solutions are welcome! Thanks in advance...

jsvine · 2025-01-02T02:43:06Z

jsvine
Jan 2, 2025
Maintainer

A common strategy for dealing with that is to pass those positions explicitly using the "explicit_horizontal_lines": [...] table setting, and deriving those positions from the other graphical elements on the page, such as the vertical lines that do appear.

E.g., a similar approach to here: #1223 (comment)

0 replies

RickVincent · 2025-01-03T08:46:53Z

RickVincent
Jan 3, 2025
Author

@jsvine Thanks for prompt feedback and support! But after tried table_settings with explicit_horizontal_lines, or other settings, as below code. But no luck. This is the key issue blocking my progress. Could you advise with more specific table settings?

def single_table_splitted(self, obj) -> bool:
    if obj["object_type"] == "rect":
        return obj["height"] < 5 or obj["width"] < 5
    return True

page = pdf.pages[41]

filtered_page = page.filter(single_table_splitted)
table_settings = {
    'vertical_strategy': 'lines',
    #'horizontal_strategy': 'text',
    #'explicit_vertical_lines': [max(filtered_page.chars, key = lambda char: char['x1'])['x1'] + 3]
    'explicit_horizontal_lines': [min(filtered_page.chars, key = lambda char: char['bottom'])['bottom']]
}
tables = filtered_page.extract_tables(table_settings)

0 replies

RickVincent · 2025-01-04T15:05:44Z

RickVincent
Jan 4, 2025
Author

Just a quick update. After applied below table_settings (trying to amend the top border by minimum top distance of first text line in page), sound works!

def for_table_whhx():
    pdf = pdfplumber.open(REPORT_MISSED_TOP_BORDER_WHHX)
    page = pdf.pages[16]
    filtered_page = page.filter(is_thin_rectangle)

    first_text_line_obj = filtered_page.extract_text_lines()[1]

    im = filtered_page.to_image()

    table_settings = {
        'explicit_horizontal_lines': [min(first_text_line_obj.get('chars'), key = lambda char: char['top'])['top']]
    }

    im.debug_tablefinder(table_settings)
    im.show()
    im.save('WHHX.2020.Page.16.png')

    tables = page.extract_tables(table_settings)
    print(tables[0])

Really amazing pdfplumber, and amazing forum to get discussion! And learned a lot! Thanks all and @jsvine !

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Once a table missed its top horizontal line, either method page.extract_tables() or page.find_tables() will omit the first row, and only return table data from second row. #1243

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Once a table missed its top horizontal line, either method page.extract_tables() or page.find_tables() will omit the first row, and only return table data from second row. #1243

RickVincent Dec 30, 2024

Describe the bug

Have you tried repairing the PDF?

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Screenshots

Environment

Additional context

Replies: 3 comments

jsvine Jan 2, 2025 Maintainer

RickVincent Jan 3, 2025 Author

RickVincent Jan 4, 2025 Author

RickVincent
Dec 30, 2024

jsvine
Jan 2, 2025
Maintainer

RickVincent
Jan 3, 2025
Author

RickVincent
Jan 4, 2025
Author