-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add option to segregate Table
elements in their own chunk/split
#3827
Comments
@dipanjanS This "combine-table-elements-with-other-elements-when-they-will fit" behavior was purposely added in For the time being, if you prefer the "segregate-tables-during-chunking" behavior I think the only option would be for you to pin to Can you say more about why you prefer If it seems justified we can consider adding a chunking option to switch this on and off. |
Table
elements in their own chunk/split
Hi @scanny I like your idea actually its a nice way to have some context along with the table but it would also be great if that option could be there to switch on and off. Because in case of building multimodal RAG Systems, I often prefer to separate out the images, tables and text chunks and then create descriptive summaries of the tables and embed them instead to retrieve the actual tables then (tables and descriptions are linked) as you can see in this guide (where I've used unstructured also): Here's the architecture as a TL;DR where basically we embed detailed description of the tables and images (especially charts etc.) so that is used to retrieve tables besides text \ image chunks: I have also used this in some real-world settings and it works pretty well so having that option would be super useful (just like we have the function parameter to dump the images into a folder) and I will also for sure try with the new settings some time in my pipeline and see how it works. As always, open to hearing your thoughts! |
For what its worth, one can still work around the issue in code by:
But, fair point it would be nice if this behavior was available via chunk parameters in partition. |
Agreed, would be nice to have it as a parameter.
For now I'm using the method you mentioned. It does have the added manual
step to separate the elements but that's still not too bad. Thanks!
…On Tue, Dec 17, 2024, 00:04 cragwolfe ***@***.***> wrote:
For what its worth, one can still work around the issue in code by:
- calling partition without chunking options
- split on the list of returned elements on element types you do not
want chunked
- for each list of elements from ^^, call partition with chunk
parameters where the list of elements is represented as unstructured .json
But, fair point it would be nice if this behavior was available via chunk
parameters in partition.
—
Reply to this email directly, view it on GitHub
<#3827 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2J3R6B5PB57QND6HB2FED2F4MK5AVCNFSM6AAAAABTSGF2VOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBWGM2TQMZRGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Unfortunately the latest version of unstructured does not extract tables from files when chunking where it previously used to work.
Version I am using is 0.16.11 (latest which I get right now when installing it)
gives me 5 elements, earlier there used to be 7 (with 2 tables)
if I remove chunking I still get the table elements which means table detection works but somehow chunking is combining it and removing the table elements?
Any tips for fixing this? I do not want to mix my tables with the text. Even in 0.16.5 it used to work fine.
The text was updated successfully, but these errors were encountered: