In the text content read by TikaDocumentReader, text and metadata are mixed together #1999

MusicBoooox · 2024-12-23T02:07:02Z

This is my code

    List<Document> documents = new TikaDocumentReader(resource).read();

    return new TokenTextSplitter(knowledgeBaseFileSlice.getDefaultChunkSize(), knowledgeBaseFileSlice.getMinChunkSizeChars(),
            knowledgeBaseFileSlice.getMinChunkLengthToEmbed(), knowledgeBaseFileSlice.getMaxNumChunks(),
            knowledgeBaseFileSlice.isKeepSeparator()).apply(documents);

When I traverse the documents, getText() output text and metadata, I don't know about this

The text was updated successfully, but these errors were encountered:

MusicBoooox · 2024-12-23T02:08:12Z

output like this:

docProps/app.xml
Normal.dotm 1 0 0 0 0 0 false false 0 WPS Office_6.11.0.8885_F1E327BC-269C-435d-A152-05C5408002CA 0

docProps/core.xml
2024-12-20T15:26:00Z HBN HBN 2024-12-20T15:27:20Z 1

docProps/custom.xml
2052-6.11.0.8885 1D2337B3384BA7C4251C65677B81A9CB_41

word/styles.xml

word/settings.xml

word/theme/theme1.xml

word/document.xml
RAG背景知识文档 RAG（Retrieval-Augmented Generation）是一种结合信息检索和生成模型的混合方法，旨在提高文本生成任务的质量和准确性。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In the text content read by TikaDocumentReader, text and metadata are mixed together #1999

In the text content read by TikaDocumentReader, text and metadata are mixed together #1999

MusicBoooox commented Dec 23, 2024

MusicBoooox commented Dec 23, 2024 •

edited

Loading

In the text content read by TikaDocumentReader, text and metadata are mixed together #1999

In the text content read by TikaDocumentReader, text and metadata are mixed together #1999

Comments

MusicBoooox commented Dec 23, 2024

MusicBoooox commented Dec 23, 2024 • edited Loading

MusicBoooox commented Dec 23, 2024 •

edited

Loading