Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use less problematic whitespace token #916

Merged
merged 3 commits into from
May 24, 2024

Conversation

lapp0
Copy link
Contributor

@lapp0 lapp0 commented May 23, 2024

Fixes #839 #908 #690 #450

Problem

A major problem, especially with smaller language models, is the repetition problem.

For example, let's say a model is generating json and must provide 12 space tokens for indentation in json output. Often a language model will assign a high probability to a 13th space token, and do the same for a 14th space, and then enter an infinite space generation loop.

This is a problem with NLG that has been known for half a decade, but only has mitigations (mirostat, repetition penalty, using hundreds of billions of weights, etc), no absolute solutions (except for structured generation)

Solution

For structured json generation, we set a sane default whitespace pattern of r"[ ]?". This removes all newlines and indentation. It disallows any syntactic whitespace beyond a single space separator.

Users can still set the argument whitespace_pattern= if they want different behavior

@rlouf rlouf merged commit 411eaaf into dottxt-ai:main May 24, 2024
5 checks passed
@rlouf
Copy link
Member

rlouf commented May 24, 2024

Great, thank you!

@timothylimyl
Copy link

Thanks for making the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide more restrictive defaults for white space patterns in JSON
3 participants