Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text-splitters: fix bug in TextSplitter's create_documents method #29504

Closed

Conversation

DmitrySirakov
Copy link

Description:
Fixed a bug in the TextSplitter class within the create_documents method. When _add_start_index is set to True, the length of the chunk was measured using the len() function (which counts characters) instead of the user-defined self.length_function(). This inconsistency could lead to incorrect values in metadata['start_index'].

Solution:
Replaced len(chunk) with self.length_function(chunk) to ensure that the chunk length is measured consistently using the user-provided function, thereby maintaining the correctness of metadata['start_index'].

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Jan 30, 2025
Copy link

vercel bot commented Jan 30, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Jan 30, 2025 10:23am

@dosubot dosubot bot added Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jan 30, 2025
Copy link
Collaborator

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I think the index is intended to be in characters as opposed to units of the custom length function:

  • index is calculated with text.find, which is counting characters
  • self._chunk_overlap also appears in the calculation, and this is documented as "Overlap in characters between chunks".

Closing for now but let me know if I've misunderstood this or if you can add a test case.

@ccurme ccurme closed this Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature size:XS This PR changes 0-9 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package
Projects
Status: Closed
Development

Successfully merging this pull request may close these issues.

2 participants