-
Notifications
You must be signed in to change notification settings - Fork 585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/code-snippets-context #3271
Comments
@asm0dey what is the source file-format you're partitioning from? HTML? Markdown maybe? I think the first prerequisite would be recognizing and distinguishing code blocks during partitioning, and that would depend on how they were identified in each particular document format. |
Sorry, forgot to mention that it's markdown! |
@asm0dey after noodling this for a while I don't believe In particular, a code-snippet is going to parse as a distinct paragraph and so will map to a distinct If you were going to "group" narrative text (i.e. paragraph before and after code-snippet) you would need to do that as a post-processing step, perhaps as a custom chunker. You'd also need to work out what the chunks would be because it wouldn't be a narrative-text element anymore and also wouldn't be a All that said, identifying |
Yes, and also probably it makes sense to make it atomic, right? Because it never makes sense to split a code snippet |
Well, yes, but sense and document structure don't always have to do with each other. Also, if you're chunking and the snippet is bigger than the specified chunk size then it's going to get split. But in general, in HTML-partitioned formats (which includes Markdown), I'd expect to see the snippet as a single |
Is your feature request related to a problem? Please describe.
In a way. I'm trying to build a RAG-based assistant for our documentation. Our documentation is code-heavy (since we develop a JDK distribution). I really want code snippets to appear only in the context of a text—by themselves, they're useless.
Describe the solution you'd like
The perfect solution, I think, is for unstructured to recognize code snippets and have settings to put them in context. For example, code should always include at least one paragraph before and one paragraph after.
Describe alternatives you've considered
I tried to play with
max_characters
parameter as well as some others, but eventually I always end up with teared code blocks without context somewhere. Another alternative would be probably, to cleanly split a document by titles, not caring section sizes (obviously code can be big)The text was updated successfully, but these errors were encountered: