Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In the text content read by TikaDocumentReader, text and metadata are mixed together #1999

Open
MusicBoooox opened this issue Dec 23, 2024 · 1 comment

Comments

@MusicBoooox
Copy link

This is my code

    List<Document> documents = new TikaDocumentReader(resource).read();

    return new TokenTextSplitter(knowledgeBaseFileSlice.getDefaultChunkSize(), knowledgeBaseFileSlice.getMinChunkSizeChars(),
            knowledgeBaseFileSlice.getMinChunkLengthToEmbed(), knowledgeBaseFileSlice.getMaxNumChunks(),
            knowledgeBaseFileSlice.isKeepSeparator()).apply(documents);

When I traverse the documents, getText() output text and metadata, I don't know about this

@MusicBoooox
Copy link
Author

MusicBoooox commented Dec 23, 2024

output like this:

docProps/app.xml
Normal.dotm 1 0 0 0 0 0 false false 0 WPS Office_6.11.0.8885_F1E327BC-269C-435d-A152-05C5408002CA 0

docProps/core.xml
2024-12-20T15:26:00Z HBN HBN 2024-12-20T15:27:20Z 1

docProps/custom.xml
2052-6.11.0.8885 1D2337B3384BA7C4251C65677B81A9CB_41

word/styles.xml

word/settings.xml

word/theme/theme1.xml

word/document.xml
RAG背景知识文档 RAG(Retrieval-Augmented Generation)是一种结合信息检索和生成模型的混合方法,旨在提高文本生成任务的质量和准确性。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant