Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper : add grammar-based sampling #1229

Merged
merged 11 commits into from
Nov 13, 2023
Merged

Conversation

ejones
Copy link
Contributor

@ejones ejones commented Aug 31, 2023

Ports grammar-based sampling from llama.cpp. Most of the code is simply copied over with s/llama/whisper/. Unlike llama.cpp, where sampling functions are part of the API, the grammar functionality here is wrapped up in whisper_full (the grammar state is attached to each whisper_decoder). More notably, the approach is more forgiving here: tokens not matching the grammar are scaled down rather than masked out entirely (grammar_penalty), special tokens are ignored, and parse failures simply (in theory) revert to unconstrained sampling.

To demonstrate the functionality, I've added grammars to command, e.g.:

./command -m models/... -t 8 --grammar 'root ::= " Ok Whisper, start listening for commands. " ("Red" | "Green" | "Blue") "."'

Probably needs more testing and refining but early results look promising! This demo shows constrained sampling on the left vs unconstrained on the right:

whisper-chess.mp4

Edit by @ggerganov:

More examples:

# another color recognizer
./command -m ./models/ggml-tiny.en.bin -t 8 --grammar ./grammars/colors.gbnf --prompt "red, green, blue," --context "green, red, blue,"

# recognize up to 3 consecutive chess moves
./command -m ./models/ggml-tiny.en.bin -t 8 --grammar ./grammars/chess.gbnf --prompt "rook to b4, f3," --context "d4 d5 knight to c3, pawn to a1, bishop to b2 king e8," --grammar-penalty 100

# voice assistant example
./command -m ./models/ggml-tiny.en.bin -t 8 --grammar ./grammars/assistant.gbnf --prompt "Ok Whisper, start listening for commands." --context "Whisper is a home assistant. It recognizes voice commands. Time is 11pm." --grammar-penalty 10

@ggerganov
Copy link
Owner

Oh boy, this is super cool! Thank you for doing it - can't wait to play with it.
I think this will have so many useful applications

@FrankenApps
Copy link

Sadly I am not exactly sure how to reproduce, but after some commands were recognized and I said something like "Thank you" instead of an actual command present in the grammar, I sometimes ran into this crash:

process_general_transcription: Speech detected! Processing ...
process_general_transcription: Command 'Green.', (t = 1728 ms)

process_general_transcription: Speech detected! Processing ...
libc++abi: terminating with uncaught exception of type std::out_of_range: basic_string
zsh: abort      ./command -m models/ggml-tiny.en.bin -t 8 --grammar 

I ran it with ./command -m models/ggml-tiny.en.bin -t 8 --grammar 'root ::= " Ok Whisper, start listening for commands. " ("Red" | "Green" | "Blue" | "Yellow") "."' and this is the hardware info:

whisper_init_from_file_no_state: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 1
whisper_model_load: mem required  =  201.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   73.62 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

main: processing, 8 threads, lang = en, task = transcribe, timestamps = 0 ...

init: found 1 capture devices:
init:    - Capture device #0: 'MacBook Pro-Mikrofon'
init: attempt to open default capture device ...
init: obtained spec for input device (SDL Id = 2):
init:     - sample rate:       16000
init:     - format:            33056 (required: 33056)
init:     - channels:          1 (required: 1)
init:     - samples per frame: 1024
main: grammar:
root ::= [ ] [O] [k] [ ] [W] [h] [i] [s] [p] [e] [r] [,] [ ] [s] [t] [a] [r] [t] [ ] [l] [i] [s] [t] [e] [n] [i] [n] [g] [ ] [f] [o] [r] [ ] [c] [o] [m] [m] [a] [n] [d] [s] [.] [ ] root_1 [.] 
root_1 ::= [R] [e] [d] | [G] [r] [e] [e] [n] | [B] [l] [u] [e] | [Y] [e] [l] [l] [o] [w] 


process_general_transcription: general-purpose mode

@ejones
Copy link
Contributor Author

ejones commented Sep 5, 2023

Thanks for reporting that - I believe I've seen this as well. Will look into it.

@ggerganov
Copy link
Owner

I managed to reproduce the exception - here is a stack trace:

 lldb ./bin/command
(lldb) target create "./bin/command"
Current executable set to '/Users/ggerganov/development/github/whisper.cpp/build-rwdi/bin/command' (arm64).
(lldb) r -m ../models/ggml-base.en.bin -t 8 --grammar 'root ::= "Ok Whisper, start Listening for commands. " ("Red" | "Green" | "blue" | "Thank you") ' --grammar-penalty 1000.0
error: shell expansion failed (reason: lldb-argdumper exited with error 127). consider launching with 'process launch'.
(lldb) process l
Available completions:
	launch -- Launch the executable in the debugger.
	load   -- Load a shared library into the current process.
(lldb) process launch 
Process 6351 launched: '/Users/ggerganov/development/github/whisper.cpp/build-rwdi/bin/command' (arm64)
whisper_init_from_file_no_state: loading model from '../models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2
whisper_model_load: mem required  =  310.00 MB (+    6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB

main: processing, 8 threads, lang = en, task = transcribe, timestamps = 0 ...

2023-09-06 13:57:00.825138+0300 command[6351:79069] [plugin] AddInstanceForFactory: No factory registered for id <CFUUID 0x60000020c140> F8BB1C28-BAE8-11D6-9C31-00039315CD46
init: found 1 capture devices:
init:    - Capture device #0: 'Georgi’s iPhone Microphone'
init: attempt to open default capture device ...
init: obtained spec for input device (SDL Id = 2):
init:     - sample rate:       16000
init:     - format:            33056 (required: 33056)
init:     - channels:          1 (required: 1)
init:     - samples per frame: 1024
main: grammar:
root ::= [O] [k] [ ] [W] [h] [i] [s] [p] [e] [r] [,] [ ] [s] [t] [a] [r] [t] [ ] [L] [i] [s] [t] [e] [n] [i] [n] [g] [ ] [f] [o] [r] [ ] [c] [o] [m] [m] [a] [n] [d] [s] [.] [ ] root_1 
root_1 ::= [R] [e] [d] | [G] [r] [e] [e] [n] | [b] [l] [u] [e] | [T] [h] [a] [n] [k] [ ] [y] [o] [u] 


process_general_transcription: general-purpose mode

process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

process_general_transcription: Speech detected! Processing ...
process_general_transcription: Heard 'Ok Whisper', (t = 362 ms)
process_general_transcription: WARNING: prompt not recognized, try again

process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

process_general_transcription: Speech detected! Processing ...
process_general_transcription: Heard 'Ok Whisper, start Listening for commands', (t = 448 ms)

process_general_transcription: The prompt has been recognized!
process_general_transcription: Waiting for voice commands ...

process_general_transcription: Speech detected! Processing ...
libc++abi: terminating due to uncaught exception of type std::out_of_range: basic_string
Process 6351 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x0000000189984764 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`:
->  0x189984764 <+8>:  b.lo   0x189984784               ; <+40>
    0x189984768 <+12>: pacibsp 
    0x18998476c <+16>: stp    x29, x30, [sp, #-0x10]!
    0x189984770 <+20>: mov    x29, sp
Target 0: (command) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x0000000189984764 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001899bbc28 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x00000001898c9ae8 libsystem_c.dylib`abort + 180
    frame #3: 0x0000000189974b84 libc++abi.dylib`abort_message + 132
    frame #4: 0x00000001899643b4 libc++abi.dylib`demangling_terminate_handler() + 320
    frame #5: 0x000000018963b03c libobjc.A.dylib`_objc_terminate() + 160
    frame #6: 0x0000000189973f48 libc++abi.dylib`std::__terminate(void (*)()) + 16
    frame #7: 0x0000000189976d34 libc++abi.dylib`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 36
    frame #8: 0x0000000189976ce0 libc++abi.dylib`__cxa_throw + 140
    frame #9: 0x00000001898ef71c libc++.1.dylib`std::__1::__throw_out_of_range[abi:v15006](char const*) + 72
    frame #10: 0x00000001898eb680 libc++.1.dylib`std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__throw_out_of_range[abi:v15006]() const + 24
    frame #11: 0x00000001898ec79c libc++.1.dylib`std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::basic_string(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long, unsigned long, std::__1::allocator<char> const&) + 208
    frame #12: 0x0000000100008af0 command`process_general_transcription(whisper_context*, audio_async&, whisper_params const&) [inlined] std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::substr[abi:v15006](this="Ok W", __pos=32, __n=18446744073709551615) const at string:3573:12 [opt]
    frame #13: 0x0000000100008ad8 command`process_general_transcription(ctx=0x00000001003046a0, audio=0x000000016fdfede8, params=0x000000016fdfed28) at command.cpp:603:60 [opt]
    frame #14: 0x0000000100009654 command`main(argc=<unavailable>, argv=<unavailable>) at command.cpp:688:23 [opt]
    frame #15: 0x0000000189663f28 dyld`start + 2236
(lldb) 

whisper.cpp Outdated
Comment on lines 3898 to 3904
for (const auto & reject : rejects) {
if (logits[reject.id] > 0) {
logits[reject.id] /= params.grammar_penalty;
} else {
logits[reject.id] *= params.grammar_penalty;
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently experimenting with the following penalty and I think it works better:

Suggested change
for (const auto & reject : rejects) {
if (logits[reject.id] > 0) {
logits[reject.id] /= params.grammar_penalty;
} else {
logits[reject.id] *= params.grammar_penalty;
}
}
for (const auto & reject : rejects) {
logits[reject.id] -= params.grammar_penalty;
}

Not sure where this asymmetric scaling came from in the LLM world, but I think it's wrong.
Here is some more discussion on this topic: ggerganov/llama.cpp#2970

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Honestly I don't have a great understanding of the statistics to know what penalization function makes sense.

@ggerganov
Copy link
Owner

ggerganov commented Sep 6, 2023

I'm still playing with this and so far have really good impressions. The API is perfect.

AFAICT this approach works on the letter level and not on the token level:

root ::= [O] [k] [ ] [W] [h] [i] [s] [p] [e] [r] [,] [ ] [s] [t] [a] [r] [t] [ ] [L] [i] [s] [t] [e] [n] [i] [n] [g] [ ] [f] [o] [r] [ ] [c] [o] [m] [m] [a] [n] [d] [s] [.] [ ] root_1 
root_1 ::= [R] [e] [d] | [G] [r] [e] [e] [n] | [b] [l] [u] [e] | [T] [h] [a] [n] [k] [ ] [y] [o] [u] 

Let's say for example that at the current moment, the grammar allows the letter p:

# decoded so far
Ok Whis

Which tokens are we going to penalize? Is it going to penalize per?
If yes, can we somehow improve it to not penalize it since it fits the possible continuation by the grammar.

Edit: nvm, you have actually did it the best way :)
I've added some notes here: ejones#1

@ggerganov ggerganov mentioned this pull request Sep 6, 2023
@bobqianic
Copy link
Collaborator

I just realized that even though Whisper is designed for audio transcription, it's fundamentally built on a transformer architecture. This makes prompts an incredibly useful tool; for instance, they can guide the model in correctly spelling specific nouns. So my question is, under what circumstances would grammar-based sampling be more effective compared to using prompts?
Whisper prompting guide

@ggerganov
Copy link
Owner

So my question is, under what circumstances would grammar-based sampling be more effective compared to using prompts?

AFAIK, applying grammar constraints to Whisper decoder is a new area yet to be studied

This weekend I'll be looking into this and hopefully merging it.
I'm really excited about this feature! Been thinking about all kinds of cool applications during the past few days :)

Thinking if we should just merge the grammar parser straight into whisper.cpp to make it more integrated.
Does the proposed parser have any significant drawbacks that could potentially be improved by an alternative implementation?

@ejones
Copy link
Contributor Author

ejones commented Sep 9, 2023

Does the proposed parser have any significant drawbacks that could potentially be improved by an alternative implementation?

No, really I was just hesitant to add all that extra code to whisper.cpp. But merging in the grammar parser was suggested in llama.cpp as well. I guess a related question is how to handle the common grammar code (which is most of it, modulo renames) between the two projects going forward.

@bobqianic
Copy link
Collaborator

Does the proposed parser have any significant drawbacks that could potentially be improved by an alternative implementation?

I'm not too sure about it either. I haven't really looked into grammar-based sampling. We can talk about it after it's merged :)

@ggerganov
Copy link
Owner

I guess a related question is how to handle the common grammar code (which is most of it, modulo renames) between the two projects going forward.

One approach is to move the grammar stuff (both impl + parsing) into ggml core library (would need to rewrite it in C).
I guess it is still a bit early for this step, but if grammar usage finds more applications, I think we should do it. For now it is not a problem to have the code duplicated.

I will now try to merge the parsing into whisper.cpp and will also look into merging most of the stuff from ejones#1 .

- option to read grammar from file
- add sample grammars for colors and chess moves
- fine-tune the performance further
@tazz4843
Copy link
Contributor

I'm looking for a way to just slightly nudge whisper towards these tokens, that way I can continue using it as a general purpose transcription tool, while simultaneously using it as a voice assistant. So far my major blocker to using this seems to be as mentioned in ejones#1, the false positive tokens. For my use case, a good workaround would be some way to let whisper abandon the grammar sampling earlier, perhaps through a configuration option on whisper_full_params.

@ejones
Copy link
Contributor Author

ejones commented Sep 11, 2023

@ggerganov I tested this branch with your chess and assistant cases from ejones#1. I had a similar experience as you - tiny fairly consistently matches the grammar and invalid commands tend to produce an empty string (or .). Interestingly, though, the --prompt and --context values alone, without a --grammar, seem to significantly improve command matching. In the chess case I saw basically no difference in quality, and for the assistant I had only one or two commands that the ungrammared process missed compared to the grammared one. Makes me wonder if grammar sampling only improves performance above a certain size or complexity in the grammar.

@ggerganov
Copy link
Owner

Interestingly, though, the --prompt and --context values alone, without a --grammar, seem to significantly improve command matching.

I didn't test this configuration - will do so. My guess is grammar will definitely help especially in situations where certain things sound similar. I imagine a use case where the grammar describes only the legal moves on the chess board at a given moment. In that case, it will help to disambiguate moves that sound similar but could be invalid (e.g. e1 vs d1 vs b1).

@ejones
Copy link
Contributor Author

ejones commented Sep 12, 2023

Ah, good point.

@FrankenApps
Copy link

Hi, whats the state on this. Would try helping out to get this in...

@bobqianic
Copy link
Collaborator

Hi, whats the state on this. Would try helping out to get this in...

I'm not entirely certain either. Over the past two to three weeks, there has been a fascinating discussion regarding the detection of wake words in #1232. @isaac-mcfadyen contributed a truly intriguing perspective.

@ggerganov
Copy link
Owner

Sorry for the delays - I've been travelling recently and now I'm catching up with lots of things. This PR is one of top priorities. Hoping to find the time this week

@ggerganov ggerganov merged commit 3e5c7fe into ggerganov:master Nov 13, 2023
38 of 40 checks passed
felrock pushed a commit to felrock/whisper.cpp that referenced this pull request Nov 18, 2023
* whisper : add grammar-based sampling

* build : fix after master merge

* command : fix exception when recognizing the command

* whisper : fine-tuning grammar functionality

* command : grammar-related improvements

- option to read grammar from file
- add sample grammars for colors and chess moves
- fine-tune the performance further

* grammars : add assistant + update comments

* command : enable beam-search, add "no_timestamps", add "context", add p

* whisper : remove comment

---------

Co-authored-by: Georgi Gerganov <[email protected]>
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
* whisper : add grammar-based sampling

* build : fix after master merge

* command : fix exception when recognizing the command

* whisper : fine-tuning grammar functionality

* command : grammar-related improvements

- option to read grammar from file
- add sample grammars for colors and chess moves
- fine-tune the performance further

* grammars : add assistant + update comments

* command : enable beam-search, add "no_timestamps", add "context", add p

* whisper : remove comment

---------

Co-authored-by: Georgi Gerganov <[email protected]>
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* whisper : add grammar-based sampling

* build : fix after master merge

* command : fix exception when recognizing the command

* whisper : fine-tuning grammar functionality

* command : grammar-related improvements

- option to read grammar from file
- add sample grammars for colors and chess moves
- fine-tune the performance further

* grammars : add assistant + update comments

* command : enable beam-search, add "no_timestamps", add "context", add p

* whisper : remove comment

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants