-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow embedding model to capture full content of text blocks #28
Allow embedding model to capture full content of text blocks #28
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
For RAG, it was an experimental option. For here, it's not, and would become the default approach. I'm assuming that's fine haha As opposed to doing some sort of user testing or so to see which works better?
Aside from that too, this will need a new release, and then deployment of the pipeline, since it's all a manual process atm. I'm guessing Platform need to help with that?
Co-authored-by: Jesse Claven <[email protected]>
@jesse-c i think for now we can lean on the fact that this is conceptually better, and finetune the stride parameter later (or avoid this issue of chunks being greater than context window length entirely 🤣) And yes from a Platform & release side, but do we want to wait until a batch of changes are in for the project? I guess we didn't discuss this last week |
Haha makes sense. The context window constraint here being the LLM's context window?
Respdonding on Slack. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Proper boring, theoretical nit picking from here onwards, feel free to ignore!
How long is our longest passage? How many windows are we likely to create for the average passage? How many passages will this pooling affect? And for our long passages, how often does the meaning (ie the position in semantic space) change across the length of the passage?
There's a risk that towards the limit (ie very long passages, each containing loads of semantic context switches), taking the mean
will smooth out all of the meaningful spikes in the underlying embeddings, leaving each one looking like the average embedding, indistict in semantic space. I've found that in some cases, taking the max
or median
is a more appropriate pooling of vectors like this (depending on how the encoder's output embeddings are distributed)... Any thoughts? My guess is that this isn't a problem worth worrying about, but thought I'd flag anyway.
@harrisonpim all good questions! 😁 The answers to all are we don't know until we do the data quality metrics work. This PR also adds a much more efficient (I think) I think (hope) then we might revisit the idea of having passages that are way longer than the context window anyway. The So, if you're happy let's merge this and continue. If not, happy to hold off and tackle metrics work first, but I think we'll probably want to rip apart some of these processes just after then anyway. |
Yeah, very happy to let this work move ahead! It seems like an upgrade on the current state in a lot of ways, but I thought I'd flag the weird potential edge case effect while it was on my mind. One to address another day 😊 |
When text is longer than the underlying encoder's context window, take a sliding window of the text (with window length equal to context length, stride half that) and average those embeddings.
Different values of
stride
haven't been tested here – leaving that for later work.Note this has already been implemented in RAG (relevant PR).
Testing
Added tests for encoding texts including one obviously much greater than the sliding window length, and the function to calculate sliding windows on a string.