Once a table missed its top horizontal line, either method page.extract_tables() or page.find_tables() will omit the first row, and only return table data from second row. #1243
Replies: 3 comments
-
A common strategy for dealing with that is to pass those positions explicitly using the "explicit_horizontal_lines": [...] table setting, and deriving those positions from the other graphical elements on the page, such as the vertical lines that do appear. E.g., a similar approach to here: #1223 (comment) |
Beta Was this translation helpful? Give feedback.
-
@jsvine Thanks for prompt feedback and support! But after tried table_settings with explicit_horizontal_lines, or other settings, as below code. But no luck. This is the key issue blocking my progress. Could you advise with more specific table settings?
|
Beta Was this translation helpful? Give feedback.
-
Just a quick update. After applied below table_settings (trying to amend the top border by minimum top distance of first text line in page), sound works!
Really amazing pdfplumber, and amazing forum to get discussion! And learned a lot! Thanks all and @jsvine ! |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
Once a table (in page 33 as attached PDF file) missed its top horizontal line, both method page.extract_tables() and page.find_tables() will omit the first row, and only return the table with data from second row.
Have you tried repairing the PDF?
Please try running your code with
pdfplumber.open(..., repair=True)
before submitting a bug report.Yes, run it with or without repair=True, got same data. Log is as below.
Page 33 Table is [['万华化学\n(宁波)\n热电有限\n公司', '二氧化硫', '连续', '4', '主厂房北侧', '15.35mg/m3', '火电厂大气污染物排放\n标准GB13223—2011', '113.58', '420', '无'], [None, '氮氧化物', None, None, None, '30.48mg/m3', None, '221.667', '600', None], [None, '颗粒物', None, None, None, '1.77mg/m3', None, '10.035', '60', None], ['万华化学\n(福建)\n有限公司', '氮氧化物', '连续', '7', '各生产装置区', '79.167mg/m3', '石油化学工业污染物排\n放标准GB31571-2015', '0.3166', '0.88', '无'], [None, 'COD', '间歇', None, None, '20.08mg/l', None, '47.1904', '106.2', None], [None, '氨氮', None, None, None, '0.475mg/l', None, '1.0624', '14.16', None]]
Code to reproduce the problem
Paste it here, or attach a Python file.
PDF file
Please attach any PDFs necessary to reproduce the problem.
万华化学.2017.年报.2018-03-13.pdf
If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.
Expected behavior
What did you expect the result should have been?
Expects to have the table data with first row by method page.extract_tables() and/or page.find_tables(), or any approach to identify such kind of misformed PDF table? I need to merged splitted tables between pages. Any table missing the top horizontal line will not be abled to be detected as splitted table and getting to be merged.
Couple tables in different PDF files have same issue.
Actual behavior
What actually happened, instead?
Expects to have the table data with first row by method page.extract_tables() and/or page.find_tables(),
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Additional context
Add any other context/notes about the problem here.
pdfplumber is really a great lib! Works well for my project. Just need to figure out how to deal with some special cases (maybe misformed pdf). But don't want to hardcoding to deal with such special/misformed files on case by case basis. Pls advise! Any generic solutions are welcome! Thanks in advance...
Beta Was this translation helpful? Give feedback.
All reactions