Replies: 4 comments 4 replies
-
I am actively working on this topic and happy to share experiences/code. It's generally not straightforward and depends on the style of the author and their tools. I am aiming at a result that (in HTML) looks something like:
The biggest problem is that headers and paragraphs are not well defined and often depend on context/content. Here are two examples from our work on parsing the UN IPCC reports on climate change: Here there are several levels of headers. In the first page the headers are indicated by bold and a terminating colon. (Note there is no whitespace after the header). In the second page there is a running title (not a header) and then a large alpha-numbered header. In the next example we see a large header followed by decimal sections with no explicit header, though clearly they are separate. (I turn the number into a header and also use it as an id.) In some cases the first sentence of the following paragraph is bold, and this could be used as a header: I might keep the para intact and duplicate the first sentence (perhaps truncated) as a header . Note that here its a figure caption with a regular structure. But is this sentence a header? And is this a paragraph? I think it would be possible to come up with a set of templates which are fairly general and might give medium recall/precision on a range of document types. But it will never be 100%. For large corpora created with the same tools it's probably worth customising templates. For random small ones it may be that LLMs give useful results. Or they may garble it. BTW are you (or anyone) interested in extracting the paragraphs into flowable text (i.e. without hard line breaks)? Because I'm also working on that and made good progress a year back. If no one else is I'll re-do it over the next do or two. |
Beta Was this translation helpful? Give feedback.
-
The test/s are in TestPDFPumberTest in test/test_pdf.py in
github.com/petermr/py4ami branch pmr15
But I would wait for a day and it should be clearer. I'll also try to
create a discussion on the site.
…On Sun, Apr 23, 2023 at 12:03 PM Daniel Leong ***@***.***> wrote:
@petermr <https://github.com/petermr> thanks for the reply. Yes, I’m
actually doing something that extracts texts for LLMs. Sorry, didn’t quite
get what you meant by “hard line breaks”, but I would love to see what you
have done
—
Reply to this email directly, view it on GitHub
<#868 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCSZTNN4LZWFDZGYIDQTXCUEBVANCNFSM6AAAAAAXFBLFRI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Beta Was this translation helpful? Give feedback.
-
I have done quite a lot of this. It's messy but semi-automatable. The
general scheme is something like:
* extract and normalize all styles. By default a paragraph consists of a
single style.
* identify variations of these styles, e.g.
- italic and or bold. This is non-trivial as it depends on font-names
- find any style changes which identify start or end of paras
* identify lists within paras (may require finding the bullet symbol),
also tables
* find any other hint for start/end of para, e.g. interpara whitespace
* find any (recursive) numbering, e.g 1.2.a (may involve
upper/lower variation, roman, letters, etc.
*...etc.)
Then join adjacent lines within para skipping embedded lists and tables,
relying on interline whitespace.
This works fairly well for text-heavy documents.
Very happy to collaborate on this if you have a clear project with many
similar documents (one-offs are not easily manageable).
Peter MR
code is in https://github.com/petermr/amilib and you will find useful PDF
tests under /.test.
…On Tue, Oct 29, 2024 at 12:12 PM huzaifa-softoo ***@***.***> wrote:
Hi @danltw <https://github.com/danltw> I am facing similar problem to
split a pdf based on paragraphs. Can you guide or share code snippet
related to how you solved the problem?
Thanks in advance :-)
—
Reply to this email directly, view it on GitHub
<#868 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCSYVRBI52NJOKUKVV23Z553RNAVCNFSM6AAAAABQZWLESGVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMBYGY2TEMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Beta Was this translation helpful? Give feedback.
-
Here's a complete analysis of the previous COP documents . They are fairly
self-consistent. Example for COP27
Input:
https://github.com/petermr/amilib/blob/main/test/resources/unfccc/unfcccdocuments1/CP_27/13_23_CP_27.pdf
output (HTML):
https://github.com/petermr/amilib/blob/main/test/resources/unfccc/unfcccdocuments1/CP_27/html/13_23_CP_27/total_pages.html
(note that the HTML does not display on Github, you have to download it and
display locally.)
The output synthesises paragraphs automatically. In production these will
all be given unique IDs.(Note the boxes outline the paragraphs and these
will flow unlike the PDF).
Will really value your comments - what you find useful and what extra you
would like to see.
Note that HTML can be searched and manipulated with xpath and regex.
Peter
…On Tue, Oct 29, 2024 at 1:06 PM huzaifa-softoo ***@***.***> wrote:
Thank you for your prompt response. I will review the information
thoroughly and let you know.
—
Reply to this email directly, view it on GitHub
<#868 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCS5BYMFROYXNJJSHNV3Z56B4DAVCNFSM6AAAAABQZWLESGVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMBYG4YTENI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Beta Was this translation helpful? Give feedback.
-
I am currently working on a project which takes in PDF files as the input document. One of the use cases requires the extracted text to be segmented into headers and the corresponding paragraphs. Wondering if anybody has done something similar either using pdfplumber or pdfminer.six (I am sort of limited to these 2 due to licensing) and if they are able to share some code to get me started.
My current code uses the font size and font itself to detect headers but the precision and recall isn't great. I am open to other solutions as well.
Thanks in advance :)
Beta Was this translation helpful? Give feedback.
All reactions