-
Notifications
You must be signed in to change notification settings - Fork 207
Implement LLM-based Document Splitting #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
How to go with this (https://python.langchain.com/docs/how_to/#text-splitters) and then use it with some LLM for splitting. |
This seems the most similar: https://python.langchain.com/docs/how_to/semantic-chunker/ But it depends on embeddings and relies on hard-coded criteria for splitting (i.e., 3 sentences). I also wonder; maybe we don't need chunks to be contiguous, as long as they are correct in reading order. For example, imagine the following 4 units in a transcript of a conversation: Suppose B is totally out of context (an aside from one of the members), but A, C, and D are part of the main topic. I think a valid chunk set is {{A + C + D}, {B}}. This set has 2 chunks, and each chunk's reading order is preserved. This chunk set will not get generated via the chunking strategy I proposed in the issue, but it could be useful for some downstream analysis... |
A downside to chunking on embedding cosine similarity space is that the chunks aren't related to any particular task. We want our chunking strategy to be tied to the task, defined by some user prompt. |
May I ask if this is still an issue. I would like to work on this, As I am working on Semantic Chunking already for my project. |
I think yes, this is still a open issue. Feel free to open a PR for it :) |
Yes, please free to take it! |
As requested by a member of the community, it would be cool to implement a new feature for splitting documents using an LLM nstead of our current token or delimiter-based methods. This will allow for more intelligent and context-aware splitting of documents.
Proposed Idea
split_criteria_prompt
that describes how to split the document (e.g., by topic).Technical Approach
Considerations
Proposed Interface Design
The text was updated successfully, but these errors were encountered: