Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/code-snippets-context #3271

Open
asm0dey opened this issue Jun 21, 2024 · 5 comments
Open

feat/code-snippets-context #3271

asm0dey opened this issue Jun 21, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@asm0dey
Copy link

asm0dey commented Jun 21, 2024

Is your feature request related to a problem? Please describe.
In a way. I'm trying to build a RAG-based assistant for our documentation. Our documentation is code-heavy (since we develop a JDK distribution). I really want code snippets to appear only in the context of a text—by themselves, they're useless.

Describe the solution you'd like
The perfect solution, I think, is for unstructured to recognize code snippets and have settings to put them in context. For example, code should always include at least one paragraph before and one paragraph after.

Describe alternatives you've considered
I tried to play with max_characters parameter as well as some others, but eventually I always end up with teared code blocks without context somewhere. Another alternative would be probably, to cleanly split a document by titles, not caring section sizes (obviously code can be big)

@asm0dey asm0dey added the enhancement New feature or request label Jun 21, 2024
@scanny
Copy link
Collaborator

scanny commented Jun 21, 2024

@asm0dey what is the source file-format you're partitioning from? HTML? Markdown maybe?

I think the first prerequisite would be recognizing and distinguishing code blocks during partitioning, and that would depend on how they were identified in each particular document format.

@asm0dey
Copy link
Author

asm0dey commented Jun 22, 2024

Sorry, forgot to mention that it's markdown!

@scanny
Copy link
Collaborator

scanny commented Jul 3, 2024

@asm0dey after noodling this for a while I don't believe unstructured is going to be able to provide what you've asked for.

In particular, a code-snippet is going to parse as a distinct paragraph and so will map to a distinct CodeSnippet element.

If you were going to "group" narrative text (i.e. paragraph before and after code-snippet) you would need to do that as a post-processing step, perhaps as a custom chunker.

You'd also need to work out what the chunks would be because it wouldn't be a narrative-text element anymore and also wouldn't be a CodeSnippet element. Maybe CompositeElement would do the trick for you.

All that said, identifying CodeSnippet elements within Markdown/HTML would be a step in the right direction.

@asm0dey
Copy link
Author

asm0dey commented Jul 3, 2024

Yes, and also probably it makes sense to make it atomic, right? Because it never makes sense to split a code snippet

@scanny
Copy link
Collaborator

scanny commented Jul 4, 2024

Well, yes, but sense and document structure don't always have to do with each other. Also, if you're chunking and the snippet is bigger than the specified chunk size then it's going to get split.

But in general, in HTML-partitioned formats (which includes Markdown), I'd expect to see the snippet as a single <pre> element with no embedded block elements (<p> or <div> etc.), so that would naturally partition as a single document element.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants