Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add option to segregate Table elements in their own chunk/split #3827

Open
dipanjanS opened this issue Dec 13, 2024 · 4 comments
Open

Comments

@dipanjanS
Copy link

dipanjanS commented Dec 13, 2024

Unfortunately the latest version of unstructured does not extract tables from files when chunking where it previously used to work.

Version I am using is 0.16.11 (latest which I get right now when installing it)

!wget https://sgp.fas.org/crs/misc/IF10244.pdf

from unstructured.partition.pdf import partition_pdf
filename = './IF10244.pdf'
elements = partition_pdf(filename=filename,
                               strategy='hi_res',
                               extract_images_in_pdf=True,
                               infer_table_structure=True,
                               chunking_strategy="by_title", # section-based chunking
                               max_characters=4000,
                               new_after_n_chars=4000, 
                               combine_text_under_n_chars=2000, 
                               mode='elements',
                               image_output_dir_path='./figures')

len(elements)

gives me 5 elements, earlier there used to be 7 (with 2 tables)

if I remove chunking I still get the table elements which means table detection works but somehow chunking is combining it and removing the table elements?

Any tips for fixing this? I do not want to mix my tables with the text. Even in 0.16.5 it used to work fine.

@dipanjanS dipanjanS added the bug Something isn't working label Dec 13, 2024
@scanny
Copy link
Collaborator

scanny commented Dec 14, 2024

@dipanjanS This "combine-table-elements-with-other-elements-when-they-will fit" behavior was purposely added in 0.16.11. Our metrics indicated it produced noticeably better recall, in particular when tables had a caption.

For the time being, if you prefer the "segregate-tables-during-chunking" behavior I think the only option would be for you to pin to <= 0.16.10.

Can you say more about why you prefer Table elements to be separate? Like what aspect in your application makes that desirable?

If it seems justified we can consider adding a chunking option to switch this on and off.

@scanny scanny added awaiting-response and removed bug Something isn't working labels Dec 14, 2024
@scanny scanny changed the title bug/chunking makes tables disappear in new version of unstructured feat: add option to segregate Table elements in their own chunk/split Dec 14, 2024
@dipanjanS
Copy link
Author

dipanjanS commented Dec 15, 2024

Hi @scanny I like your idea actually its a nice way to have some context along with the table but it would also be great if that option could be there to switch on and off.

Because in case of building multimodal RAG Systems, I often prefer to separate out the images, tables and text chunks and then create descriptive summaries of the tables and embed them instead to retrieve the actual tables then (tables and descriptions are linked) as you can see in this guide (where I've used unstructured also):
https://www.analyticsvidhya.com/blog/2024/09/guide-to-building-multimodal-rag-systems/

Here's the architecture as a TL;DR where basically we embed detailed description of the tables and images (especially charts etc.) so that is used to retrieve tables besides text \ image chunks:
image

I have also used this in some real-world settings and it works pretty well so having that option would be super useful (just like we have the function parameter to dump the images into a folder) and I will also for sure try with the new settings some time in my pipeline and see how it works.

As always, open to hearing your thoughts!

@cragwolfe
Copy link
Contributor

For what its worth, one can still work around the issue in code by:

  • calling partition without chunking options
  • split on the list of returned elements on element types you do not want chunked
  • for each list of elements from ^^, call partition with chunk parameters where the list of elements is represented as unstructured .json

But, fair point it would be nice if this behavior was available via chunk parameters in partition.

@dipanjanS
Copy link
Author

dipanjanS commented Dec 16, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants