Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow embedding model to capture full content of text blocks #28

Conversation

kdutia
Copy link
Member

@kdutia kdutia commented Sep 16, 2024

When text is longer than the underlying encoder's context window, take a sliding window of the text (with window length equal to context length, stride half that) and average those embeddings.

Different values of stride haven't been tested here – leaving that for later work.

Note this has already been implemented in RAG (relevant PR).

Testing

Added tests for encoding texts including one obviously much greater than the sliding window length, and the function to calculate sliding windows on a string.

@kdutia kdutia changed the title Feature/pods 1543 allow embedding model to capture full content of text blocks Allow embedding model to capture full content of text blocks Sep 16, 2024
@kdutia kdutia requested review from harrisonpim and a team September 16, 2024 12:57
@kdutia kdutia marked this pull request as ready for review September 16, 2024 12:57
Copy link

@jesse-c jesse-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

For RAG, it was an experimental option. For here, it's not, and would become the default approach. I'm assuming that's fine haha As opposed to doing some sort of user testing or so to see which works better?

Aside from that too, this will need a new release, and then deployment of the pipeline, since it's all a manual process atm. I'm guessing Platform need to help with that?

src/ml.py Outdated Show resolved Hide resolved
Co-authored-by: Jesse Claven <[email protected]>
@kdutia
Copy link
Member Author

kdutia commented Sep 16, 2024

@jesse-c i think for now we can lean on the fact that this is conceptually better, and finetune the stride parameter later (or avoid this issue of chunks being greater than context window length entirely 🤣)

And yes from a Platform & release side, but do we want to wait until a batch of changes are in for the project? I guess we didn't discuss this last week

@jesse-c
Copy link

jesse-c commented Sep 16, 2024

@jesse-c i think for now we can lean on the fact that this is conceptually better, and finetune the stride parameter later (or avoid this issue of chunks being greater than context window length entirely 🤣)

Haha makes sense. The context window constraint here being the LLM's context window?

And yes from a Platform & release side, but do we want to wait until a batch of changes are in for the project? I guess we didn't discuss this last week

Respdonding on Slack.

@kdutia
Copy link
Member Author

kdutia commented Sep 16, 2024

being the LLM's context window
@jesse-c yes :)

Copy link

@harrisonpim harrisonpim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Proper boring, theoretical nit picking from here onwards, feel free to ignore!

How long is our longest passage? How many windows are we likely to create for the average passage? How many passages will this pooling affect? And for our long passages, how often does the meaning (ie the position in semantic space) change across the length of the passage?

There's a risk that towards the limit (ie very long passages, each containing loads of semantic context switches), taking the mean will smooth out all of the meaningful spikes in the underlying embeddings, leaving each one looking like the average embedding, indistict in semantic space. I've found that in some cases, taking the max or median is a more appropriate pooling of vectors like this (depending on how the encoder's output embeddings are distributed)... Any thoughts? My guess is that this isn't a problem worth worrying about, but thought I'd flag anyway.

@kdutia
Copy link
Member Author

kdutia commented Sep 17, 2024

@harrisonpim all good questions! 😁 The answers to all are we don't know until we do the data quality metrics work. This PR also adds a much more efficient (I think) get_n_tokens method than we used to have in metrics, so hopefully we'll be able to inspect these longer passages doing that.

I think (hope) then we might revisit the idea of having passages that are way longer than the context window anyway.

The max is interesting, but that seems like it makes more assumptions (maximum relevance of arbitrary window in text block?) than this.

So, if you're happy let's merge this and continue. If not, happy to hold off and tackle metrics work first, but I think we'll probably want to rip apart some of these processes just after then anyway.

@harrisonpim
Copy link

Yeah, very happy to let this work move ahead! It seems like an upgrade on the current state in a lot of ways, but I thought I'd flag the weird potential edge case effect while it was on my mind. One to address another day 😊

@kdutia kdutia merged commit f914500 into main Sep 17, 2024
2 checks passed
@harrisonpim harrisonpim deleted the feature/pods-1543-allow-embedding-model-to-capture-full-content-of-text-blocks branch September 17, 2024 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants