One single table extracted to be 7 separated tables #1242
Replies: 3 comments
-
`import pdfplumber REPORT_FILE = r'D:\Downloads\万华化学\万华化学.2021.年报.2022-03-15.pdf' GS_PATH = r'D:\Downloads\Python\PDF\Ghostscript.10.04.0.x64\bin\gswin64.exe' with pdfplumber.open(REPORT_FILE, gs_path = GS_PATH, repair = True) as pdf:
|
Beta Was this translation helpful? Give feedback.
-
Thank you for the details and for providing the PDF. Using the visual debugging tools: import pdfplumber
pdf = pdfplumber.open("2021.2022-03-15.pdf")
page = pdf.pages[40]
im = page.to_image()
im ... produces this: ... which shows that there are invisible rectangles inside the headers that are creating problems. Using def test(obj):
if obj["object_type"] == "rect":
return obj["height"] < 5 or obj["width"] < 5
return True
filtered = page.filter(test)
filtered.to_image().debug_tablefinder() ... and: filtered.extract_tables()
|
Beta Was this translation helpful? Give feedback.
-
@jsvine Thanks so much for prompt response and feedback! The approach works fine on the file. Appreciate! Really amazing... |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
A clear and concise description of what the bug is.
One single table extracted to be 7 separated tables
Have you tried repairing the PDF?
Yes
Please try running your code with
pdfplumber.open(..., repair=True)
before submitting a bug report.Below is the log after run with repair=True:
Page 41 Table is [['公司或子\n公司名称', '主要污染物及特\n征污染物的名称', '排放\n方式', '排放口数量', '排放口分布情况', '排放浓度', '执行的污染物排放标准', '排放总\n量/t', '核定的排\n放总量t/a', '超 标排\n放情况'], ['万华化学\n集团股份\n有限公司', '二氧化硫', '连续', '130', '各生产装置区', '0.015mg/m3', '区域性大气污染物综合\n排放标准DB37/2376-\n2019,挥发性有机物排放\n标准第6部分:有机化工\n行业DB37/2801.6-2018,\n危险废物焚烧污染控制\n标准GB 18484-2020', '0.986', '361.732', '无'], [None, '氮氧化物', None, None, None, '5.88mg/m3', None, '53.47', '1431.337', None], [None, '颗粒物', None, None, None, '1.07mg/m3', None, '1.992', '208.729', None], [None, 'VOCs', None, None, None, '3.313mg/m3', None, '14.20', '1650.9864', None], ['万华化学\n集团环保\n科技有限\n公司', '氨氮', '连续', '2', '污水处理区域', '6.81mg/l', '污水排入城镇下水道水\n质标准GB/T 31962-\n2015,流域水污染物综合\n排放标准第5部分:半岛\n流域DB37/3416.5-2018', '12.3', '419.33', '无'], [None, 'COD', None, None, None, '61.7mg/l', None, '112.3', '4050.63', None], ['万华化学\n(宁波)有\n限公司', '二氧化硫', '连续', '31', '各生产装置区', '3.49mg/m3', '危险废物焚烧污染控制\n标准GB18484-2001,石\n油化学工业污染物排放\n标准GB31571-2015', '16.10', '43.28', '无'], [None, '氮氧化物', None, None, None, '40.75mg/m3', None, '81.44', '208.18', None], [None, '氨氮', None, None, None, '1.99mg/l', None, '3.78', '27.3', None], [None, 'COD', None, None, None, '81.06mg/l', None, '149.98', '165.4', None], ['万华化学\n(烟台)\n氯碱热电\n有限公司', '二氧化硫', '连续', '19', '界区东北', '16.6mg/m3', '《火电厂大气污染物排\n放标准(DB37/664—\n2019》、《大气污染物\n综合排放标准GB16297-\n1996》、《挥发性有机\n物排放标准 第7部分', '148.6', '759.89', '无'], [None, '氮氧化物', None, None, None, '40.1mg/m3', None, '360', '1183.2', None]]
Page 41 Table is [['公司或子'], ['公司名称']]
Page 41 Table is [['主要污染物及特'], ['征污染物的名称']]
Page 41 Table is [['排放'], ['方式']]
Page 41 Table is [['排放总'], ['量/t']]
Page 41 Table is [['核定的排'], ['放总量t/a']]
Page 41 Table is [['超标排'], ['放情况']]
Code to reproduce the problem
Paste it here, or attach a Python file.
import pdfplumber
REPORT_FILE = r'D:\Downloads\万华化学\万华化学.2021.年报.2022-03-15.pdf'
GS_PATH = r'D:\Downloads\Python\PDF\Ghostscript.10.04.0.x64\bin\gswin64.exe'
with pdfplumber.open(REPORT_FILE, gs_path = GS_PATH, repair = True) as pdf:
for page_num, page in enumerate(pdf.pages):
tables = page.extract_tables()
PDF file
Please attach any PDFs necessary to reproduce the problem.
万华化学.2021.年报.2022-03-15.pdf
If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.
Expected behavior
What did you expect the result should have been?
Actual behavior
What actually happened, instead?
But the table in page 41 recognited as 7 separated tables
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Additional context
Add any other context/notes about the problem here.
N/A. Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions