PDF header and paragraph detection #868

danltw · 2023-04-20T06:44:15Z

danltw
Apr 20, 2023

I am currently working on a project which takes in PDF files as the input document. One of the use cases requires the extracted text to be segmented into headers and the corresponding paragraphs. Wondering if anybody has done something similar either using pdfplumber or pdfminer.six (I am sort of limited to these 2 due to licensing) and if they are able to share some code to get me started.

My current code uses the font size and font itself to detect headers but the precision and recall isn't great. I am open to other solutions as well.

Thanks in advance :)

Answered by petermr

Apr 23, 2023

The test/s are in TestPDFPumberTest in test/test_pdf.py in github.com/petermr/py4ami branch pmr15 But I would wait for a day and it should be clearer. I'll also try to create a discussion on the site.

View full answer

petermr · 2023-04-21T07:59:58Z

petermr
Apr 21, 2023

I am actively working on this topic and happy to share experiences/code.

It's generally not straightforward and depends on the style of the author and their tools. I am aiming at a result that (in HTML) looks something like:

<div>
  <h3>header</h3>
  <p>para 1...</p>
  <p>para 2...</p>
</div>

The biggest problem is that headers and paragraphs are not well defined and often depend on context/content. Here are two examples from our work on parsing the UN IPCC reports on climate change:

Here there are several levels of headers. In the first page the headers are indicated by bold and a terminating colon. (Note there is no whitespace after the header).

In the second page there is a running title (not a header) and then a large alpha-numbered header.

In the next example we see a large header followed by decimal sections with no explicit header, though clearly they are separate. (I turn the number into a header and also use it as an id.)

In some cases the first sentence of the following paragraph is bold, and this could be used as a header:

I might keep the para intact and duplicate the first sentence (perhaps truncated) as a header . Note that here its a figure caption with a regular structure.

But is this sentence a header?

And is this a paragraph?

I think it would be possible to come up with a set of templates which are fairly general and might give medium recall/precision on a range of document types. But it will never be 100%. For large corpora created with the same tools it's probably worth customising templates. For random small ones it may be that LLMs give useful results. Or they may garble it.

BTW are you (or anyone) interested in extracting the paragraphs into flowable text (i.e. without hard line breaks)? Because I'm also working on that and made good progress a year back. If no one else is I'll re-do it over the next do or two.

1 reply

danltw Apr 23, 2023
Author

@petermr thanks for the reply. Yes, I’m actually doing something that extracts texts for LLMs. Sorry, didn’t quite get what you meant by “hard line breaks”, but I would love to see what you have done

petermr · 2023-04-23T11:47:08Z

petermr
Apr 23, 2023

The test/s are in TestPDFPumberTest in test/test_pdf.py in github.com/petermr/py4ami branch pmr15 But I would wait for a day and it should be clearer. I'll also try to create a discussion on the site.

…

On Sun, Apr 23, 2023 at 12:03 PM Daniel Leong ***@***.***> wrote: @petermr <https://github.com/petermr> thanks for the reply. Yes, I’m actually doing something that extracts texts for LLMs. Sorry, didn’t quite get what you meant by “hard line breaks”, but I would love to see what you have done — Reply to this email directly, view it on GitHub <#868 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCSZTNN4LZWFDZGYIDQTXCUEBVANCNFSM6AAAAAAXFBLFRI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

2 replies

danltw May 3, 2023
Author

Thanks @petermr , I've managed to get what I want without reference to your code. However, I would like to say that your works are very interesting. cheers!

huzaifa-softoo Oct 29, 2024

Hi @danltw I am facing similar problem to split a pdf based on paragraphs. Can you guide or share code snippet related to how you solved the problem?

Thanks in advance :-)

petermr · 2024-10-29T12:54:49Z

petermr
Oct 29, 2024

I have done quite a lot of this. It's messy but semi-automatable. The general scheme is something like: * extract and normalize all styles. By default a paragraph consists of a single style. * identify variations of these styles, e.g. - italic and or bold. This is non-trivial as it depends on font-names - find any style changes which identify start or end of paras * identify lists within paras (may require finding the bullet symbol), also tables * find any other hint for start/end of para, e.g. interpara whitespace * find any (recursive) numbering, e.g 1.2.a (may involve upper/lower variation, roman, letters, etc. *...etc.) Then join adjacent lines within para skipping embedded lists and tables, relying on interline whitespace. This works fairly well for text-heavy documents. Very happy to collaborate on this if you have a clear project with many similar documents (one-offs are not easily manageable). Peter MR code is in https://github.com/petermr/amilib and you will find useful PDF tests under /.test.

…

On Tue, Oct 29, 2024 at 12:12 PM huzaifa-softoo ***@***.***> wrote: Hi @danltw <https://github.com/danltw> I am facing similar problem to split a pdf based on paragraphs. Can you guide or share code snippet related to how you solved the problem? Thanks in advance :-) — Reply to this email directly, view it on GitHub <#868 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCSYVRBI52NJOKUKVV23Z553RNAVCNFSM6AAAAABQZWLESGVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMBYGY2TEMQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

1 reply

huzaifa-softoo Oct 29, 2024

Thank you for your prompt response. I will review the information thoroughly and let you know.

petermr · 2024-10-29T14:55:35Z

petermr
Oct 29, 2024

Here's a complete analysis of the previous COP documents . They are fairly self-consistent. Example for COP27 Input: https://github.com/petermr/amilib/blob/main/test/resources/unfccc/unfcccdocuments1/CP_27/13_23_CP_27.pdf output (HTML): https://github.com/petermr/amilib/blob/main/test/resources/unfccc/unfcccdocuments1/CP_27/html/13_23_CP_27/total_pages.html (note that the HTML does not display on Github, you have to download it and display locally.) The output synthesises paragraphs automatically. In production these will all be given unique IDs.(Note the boxes outline the paragraphs and these will flow unlike the PDF). Will really value your comments - what you find useful and what extra you would like to see. Note that HTML can be searched and manipulated with xpath and regex. Peter

…

On Tue, Oct 29, 2024 at 1:06 PM huzaifa-softoo ***@***.***> wrote: Thank you for your prompt response. I will review the information thoroughly and let you know. — Reply to this email directly, view it on GitHub <#868 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCS5BYMFROYXNJJSHNV3Z56B4DAVCNFSM6AAAAABQZWLESGVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMBYG4YTENI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF header and paragraph detection #868

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

PDF header and paragraph detection #868

danltw Apr 20, 2023

Replies: 4 comments · 4 replies

petermr Apr 21, 2023

danltw Apr 23, 2023 Author

petermr Apr 23, 2023

danltw May 3, 2023 Author

huzaifa-softoo Oct 29, 2024

petermr Oct 29, 2024

huzaifa-softoo Oct 29, 2024

petermr Oct 29, 2024

danltw
Apr 20, 2023

Replies: 4 comments 4 replies

petermr
Apr 21, 2023

danltw Apr 23, 2023
Author

petermr
Apr 23, 2023

danltw May 3, 2023
Author

petermr
Oct 29, 2024

petermr
Oct 29, 2024