Allow embedding model to capture full content of text blocks #28

kdutia · 2024-09-16T12:50:29Z

When text is longer than the underlying encoder's context window, take a sliding window of the text (with window length equal to context length, stride half that) and average those embeddings.

Different values of stride haven't been tested here – leaving that for later work.

Note this has already been implemented in RAG (relevant PR).

Testing

Added tests for encoding texts including one obviously much greater than the sliding window length, and the function to calculate sliding windows on a string.

linear · 2024-09-16T12:50:33Z

PODS-1543 allow embedding model to capture full content of text blocks longer than its context window

jesse-c

LGTM

For RAG, it was an experimental option. For here, it's not, and would become the default approach. I'm assuming that's fine haha As opposed to doing some sort of user testing or so to see which works better?

Aside from that too, this will need a new release, and then deployment of the pipeline, since it's all a manual process atm. I'm guessing Platform need to help with that?

src/ml.py

Co-authored-by: Jesse Claven <[email protected]>

kdutia · 2024-09-16T15:40:46Z

@jesse-c i think for now we can lean on the fact that this is conceptually better, and finetune the stride parameter later (or avoid this issue of chunks being greater than context window length entirely 🤣)

And yes from a Platform & release side, but do we want to wait until a batch of changes are in for the project? I guess we didn't discuss this last week

jesse-c · 2024-09-16T15:59:18Z

@jesse-c i think for now we can lean on the fact that this is conceptually better, and finetune the stride parameter later (or avoid this issue of chunks being greater than context window length entirely 🤣)

Haha makes sense. The context window constraint here being the LLM's context window?

And yes from a Platform & release side, but do we want to wait until a batch of changes are in for the project? I guess we didn't discuss this last week

Respdonding on Slack.

kdutia · 2024-09-16T16:08:12Z

being the LLM's context window
@jesse-c yes :)

harrisonpim

Looks good!

Proper boring, theoretical nit picking from here onwards, feel free to ignore!

How long is our longest passage? How many windows are we likely to create for the average passage? How many passages will this pooling affect? And for our long passages, how often does the meaning (ie the position in semantic space) change across the length of the passage?

There's a risk that towards the limit (ie very long passages, each containing loads of semantic context switches), taking the mean will smooth out all of the meaningful spikes in the underlying embeddings, leaving each one looking like the average embedding, indistict in semantic space. I've found that in some cases, taking the max or median is a more appropriate pooling of vectors like this (depending on how the encoder's output embeddings are distributed)... Any thoughts? My guess is that this isn't a problem worth worrying about, but thought I'd flag anyway.

kdutia · 2024-09-17T08:16:54Z

@harrisonpim all good questions! 😁 The answers to all are we don't know until we do the data quality metrics work. This PR also adds a much more efficient (I think) get_n_tokens method than we used to have in metrics, so hopefully we'll be able to inspect these longer passages doing that.

I think (hope) then we might revisit the idea of having passages that are way longer than the context window anyway.

The max is interesting, but that seems like it makes more assumptions (maximum relevance of arbitrary window in text block?) than this.

So, if you're happy let's merge this and continue. If not, happy to hold off and tackle metrics work first, but I think we'll probably want to rip apart some of these processes just after then anyway.

harrisonpim · 2024-09-17T10:57:38Z

Yeah, very happy to let this work move ahead! It seems like an upgrade on the current state in a lot of ways, but I thought I'd flag the weird potential edge case effect while it was on my mind. One to address another day 😊

kdutia added 2 commits September 16, 2024 13:49

gitignore models cache dir

4e7dc40

change encoder to use sliding window

ab850b3

add tests

fca1214

kdutia changed the title ~~Feature/pods 1543 allow embedding model to capture full content of text blocks~~ Allow embedding model to capture full content of text blocks Sep 16, 2024

kdutia requested review from harrisonpim and a team September 16, 2024 12:57

kdutia marked this pull request as ready for review September 16, 2024 12:57

jesse-c approved these changes Sep 16, 2024

View reviewed changes

src/ml.py Outdated Show resolved Hide resolved

Update src/ml.py

0cf3b38

Co-authored-by: Jesse Claven <[email protected]>

harrisonpim approved these changes Sep 16, 2024

View reviewed changes

kdutia merged commit f914500 into main Sep 17, 2024
2 checks passed

harrisonpim deleted the feature/pods-1543-allow-embedding-model-to-capture-full-content-of-text-blocks branch September 17, 2024 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow embedding model to capture full content of text blocks #28

Allow embedding model to capture full content of text blocks #28

kdutia commented Sep 16, 2024 •

edited

Loading

linear bot commented Sep 16, 2024

jesse-c left a comment

kdutia commented Sep 16, 2024

jesse-c commented Sep 16, 2024

kdutia commented Sep 16, 2024

harrisonpim left a comment

kdutia commented Sep 17, 2024

harrisonpim commented Sep 17, 2024

Allow embedding model to capture full content of text blocks #28

Allow embedding model to capture full content of text blocks #28

Conversation

kdutia commented Sep 16, 2024 • edited Loading

Testing

linear bot commented Sep 16, 2024

jesse-c left a comment

Choose a reason for hiding this comment

kdutia commented Sep 16, 2024

jesse-c commented Sep 16, 2024

kdutia commented Sep 16, 2024

harrisonpim left a comment

Choose a reason for hiding this comment

kdutia commented Sep 17, 2024

harrisonpim commented Sep 17, 2024

kdutia commented Sep 16, 2024 •

edited

Loading