Use less problematic whitespace token #916

lapp0 · 2024-05-23T20:21:52Z

Problem

A major problem, especially with smaller language models, is the repetition problem.

For example, let's say a model is generating json and must provide 12 space tokens for indentation in json output. Often a language model will assign a high probability to a 13th space token, and do the same for a 14th space, and then enter an infinite space generation loop.

This is a problem with NLG that has been known for half a decade, but only has mitigations (mirostat, repetition penalty, using hundreds of billions of weights, etc), no absolute solutions (except for structured generation)

Solution

For structured json generation, we set a sane default whitespace pattern of r"[ ]?". This removes all newlines and indentation. It disallows any syntactic whitespace beyond a single space separator.

Users can still set the argument whitespace_pattern= if they want different behavior

…lt whitespace_pattern

rlouf · 2024-05-24T06:12:44Z

Great, thank you!

timothylimyl · 2024-05-29T02:12:39Z

Thanks for making the PR!

lapp0 added 3 commits May 23, 2024 14:58

Use less problematic whitespace token

923f4a4

update whitespace_pattern docs

11a49e2

update test_json_schema.py to account for optional single-space defau…

1f42d6c

…lt whitespace_pattern

rlouf merged commit 411eaaf into dottxt-ai:main May 24, 2024
5 checks passed

This was referenced May 24, 2024

Are we able to structure JSON output into a single line with just one whitespace? #908

Closed

Length constraint causes infinite looping of generation #690

Closed

This was referenced Sep 16, 2024

WIP: Fix Various JSON-Schema Generation Bugs lapp0/outlines#88

Open

Fix Infinite Repetition in JSON Schemas Using Integer and String #1154

Draft

Prevent Infinite Repetition in JSON Schemas with number #1157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use less problematic whitespace token #916

Use less problematic whitespace token #916

lapp0 commented May 23, 2024 •

edited

Loading

rlouf commented May 24, 2024

timothylimyl commented May 29, 2024

Use less problematic whitespace token #916

Use less problematic whitespace token #916

Conversation

lapp0 commented May 23, 2024 • edited Loading

Problem

Solution

rlouf commented May 24, 2024

timothylimyl commented May 29, 2024

lapp0 commented May 23, 2024 •

edited

Loading