feat: change sentence splitter to sat-1l-sm #164

jirastorza · 2025-10-15T13:50:24Z

This pull request updates the sentence splitting logic by switching the SaT (Segment-any-Text) model used in _load_sat from "sat-3l-sm" to "sat-1l-sm" in src/raglite/_split_sentences.py.
The change reduces model size while maintaining comparable segmentation accuracy.

Benchmark results on the CUAD dataset (see attached plot) show that "sat-1l-sm" achieves similar performance to the larger "sat-3l-sm" model under the default Raglite Benchmark setup.
This makes it a better trade-off between efficiency and accuracy.

Model	English Score	Multilingual Score
sat-1l	88.5	84.3
sat-1l-sm	88.2	87.9
sat-3l-sm	96.5	93.5
source: https://github.com/segment-any-text/wtpsplit

We selected "sat-1l-sm" over "sat-1l" because it provides better multilingual performance with only a small trade-off in English accuracy.

jirastorza · 2025-10-15T14:15:44Z

Somehow changing the model has led to larger chunks, apparently leading to exceed the context window size (did not happen when benchmarking).

lsorber · 2025-10-19T12:52:34Z

My guess at why the tests are failing is that sat-1l-sm yields larger sentences and/or chunks, which end up taking up more context than we have room for in the tests. EDIT: Just saw your comment too @jirastorza!

emilradix · 2025-10-20T08:22:19Z

So if I understand well, the reason the test is failing, is because in the config for the tests we are not specifying max_tokens of the model?
Edit: actually this should be automatically inferred from:
https://github.com/superlinear-ai/raglite/blob/main/src/raglite/_litellm.py#L329-L348
Which seems to be failing.

If this was detected correctly in test_rag.py, the text would get clipped at the max tokens?
https://github.com/superlinear-ai/raglite/blob/main/src/raglite/_rag.py#L188

@lsorber @jirastorza @Robbe-Superlinear Is this correct?

if so this leads to maybe a more important point: Surely we do not want to be just allowing context length to go above max token size, and then just clip the input down to the context size. We need to have some more clever way to limit this.

F.e. if we are passing chunk spans, the actual chunk needs to be in there, which as far as I can tell is not guaranteed (f.e. chunks at the end of a section)

feat: change sentence splitter to sat-1l-sm

c1a2e78

jirastorza assigned jirastorza and Robbe-Superlinear Oct 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: change sentence splitter to sat-1l-sm #164

feat: change sentence splitter to sat-1l-sm #164

jirastorza commented Oct 15, 2025

Uh oh!

jirastorza commented Oct 15, 2025 •

edited

Loading

Uh oh!

lsorber commented Oct 19, 2025 •

edited

Loading

Uh oh!

emilradix commented Oct 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: change sentence splitter to sat-1l-sm #164

Are you sure you want to change the base?

feat: change sentence splitter to sat-1l-sm #164

Conversation

jirastorza commented Oct 15, 2025

Uh oh!

jirastorza commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lsorber commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emilradix commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jirastorza commented Oct 15, 2025 •

edited

Loading

lsorber commented Oct 19, 2025 •

edited

Loading

emilradix commented Oct 20, 2025 •

edited

Loading