Skip to content

Conversation

jirastorza
Copy link

This pull request updates the sentence splitting logic by switching the SaT (Segment-any-Text) model used in _load_sat from "sat-3l-sm" to "sat-1l-sm" in src/raglite/_split_sentences.py.
The change reduces model size while maintaining comparable segmentation accuracy.

output

Benchmark results on the CUAD dataset (see attached plot) show that "sat-1l-sm" achieves similar performance to the larger "sat-3l-sm" model under the default Raglite Benchmark setup.
This makes it a better trade-off between efficiency and accuracy.

Model English Score Multilingual Score
sat-1l 88.5 84.3
sat-1l-sm 88.2 87.9
sat-3l-sm 96.5 93.5
source: https://github.com/segment-any-text/wtpsplit

We selected "sat-1l-sm" over "sat-1l" because it provides better multilingual performance with only a small trade-off in English accuracy.

@jirastorza
Copy link
Author

jirastorza commented Oct 15, 2025

Somehow changing the model has led to larger chunks, apparently leading to exceed the context window size (did not happen when benchmarking).

@lsorber
Copy link
Member

lsorber commented Oct 19, 2025

My guess at why the tests are failing is that sat-1l-sm yields larger sentences and/or chunks, which end up taking up more context than we have room for in the tests. EDIT: Just saw your comment too @jirastorza!

@emilradix
Copy link
Contributor

emilradix commented Oct 20, 2025

So if I understand well, the reason the test is failing, is because in the config for the tests we are not specifying max_tokens of the model?
Edit: actually this should be automatically inferred from:
https://github.com/superlinear-ai/raglite/blob/main/src/raglite/_litellm.py#L329-L348
Which seems to be failing.

If this was detected correctly in test_rag.py, the text would get clipped at the max tokens?
https://github.com/superlinear-ai/raglite/blob/main/src/raglite/_rag.py#L188

@lsorber @jirastorza @Robbe-Superlinear Is this correct?

if so this leads to maybe a more important point: Surely we do not want to be just allowing context length to go above max token size, and then just clip the input down to the context size. We need to have some more clever way to limit this.

F.e. if we are passing chunk spans, the actual chunk needs to be in there, which as far as I can tell is not guaranteed (f.e. chunks at the end of a section)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants