whisper : add grammar-based sampling #1229

ejones · 2023-08-31T00:37:19Z

Ports grammar-based sampling from llama.cpp. Most of the code is simply copied over with s/llama/whisper/. Unlike llama.cpp, where sampling functions are part of the API, the grammar functionality here is wrapped up in whisper_full (the grammar state is attached to each whisper_decoder). More notably, the approach is more forgiving here: tokens not matching the grammar are scaled down rather than masked out entirely (grammar_penalty), special tokens are ignored, and parse failures simply (in theory) revert to unconstrained sampling.

To demonstrate the functionality, I've added grammars to command, e.g.:

./command -m models/... -t 8 --grammar 'root ::= " Ok Whisper, start listening for commands. " ("Red" | "Green" | "Blue") "."'

Probably needs more testing and refining but early results look promising! This demo shows constrained sampling on the left vs unconstrained on the right:

whisper-chess.mp4

Edit by @ggerganov:

More examples:

# another color recognizer
./command -m ./models/ggml-tiny.en.bin -t 8 --grammar ./grammars/colors.gbnf --prompt "red, green, blue," --context "green, red, blue,"

# recognize up to 3 consecutive chess moves
./command -m ./models/ggml-tiny.en.bin -t 8 --grammar ./grammars/chess.gbnf --prompt "rook to b4, f3," --context "d4 d5 knight to c3, pawn to a1, bishop to b2 king e8," --grammar-penalty 100

# voice assistant example
./command -m ./models/ggml-tiny.en.bin -t 8 --grammar ./grammars/assistant.gbnf --prompt "Ok Whisper, start listening for commands." --context "Whisper is a home assistant. It recognizes voice commands. Time is 11pm." --grammar-penalty 10

ggerganov · 2023-08-31T13:12:21Z

Oh boy, this is super cool! Thank you for doing it - can't wait to play with it.
I think this will have so many useful applications

FrankenApps · 2023-09-03T15:03:36Z

Sadly I am not exactly sure how to reproduce, but after some commands were recognized and I said something like "Thank you" instead of an actual command present in the grammar, I sometimes ran into this crash:

process_general_transcription: Speech detected! Processing ...
process_general_transcription: Command 'Green.', (t = 1728 ms)

process_general_transcription: Speech detected! Processing ...
libc++abi: terminating with uncaught exception of type std::out_of_range: basic_string
zsh: abort      ./command -m models/ggml-tiny.en.bin -t 8 --grammar

I ran it with ./command -m models/ggml-tiny.en.bin -t 8 --grammar 'root ::= " Ok Whisper, start listening for commands. " ("Red" | "Green" | "Blue" | "Yellow") "."' and this is the hardware info:

whisper_init_from_file_no_state: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 1
whisper_model_load: mem required  =  201.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   73.62 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

main: processing, 8 threads, lang = en, task = transcribe, timestamps = 0 ...

init: found 1 capture devices:
init:    - Capture device #0: 'MacBook Pro-Mikrofon'
init: attempt to open default capture device ...
init: obtained spec for input device (SDL Id = 2):
init:     - sample rate:       16000
init:     - format:            33056 (required: 33056)
init:     - channels:          1 (required: 1)
init:     - samples per frame: 1024
main: grammar:
root ::= [ ] [O] [k] [ ] [W] [h] [i] [s] [p] [e] [r] [,] [ ] [s] [t] [a] [r] [t] [ ] [l] [i] [s] [t] [e] [n] [i] [n] [g] [ ] [f] [o] [r] [ ] [c] [o] [m] [m] [a] [n] [d] [s] [.] [ ] root_1 [.] 
root_1 ::= [R] [e] [d] | [G] [r] [e] [e] [n] | [B] [l] [u] [e] | [Y] [e] [l] [l] [o] [w] 


process_general_transcription: general-purpose mode

ejones · 2023-09-05T03:17:02Z

Thanks for reporting that - I believe I've seen this as well. Will look into it.

ggerganov · 2023-09-06T10:59:24Z

I managed to reproduce the exception - here is a stack trace:

 lldb ./bin/command
(lldb) target create "./bin/command"
Current executable set to '/Users/ggerganov/development/github/whisper.cpp/build-rwdi/bin/command' (arm64).
(lldb) r -m ../models/ggml-base.en.bin -t 8 --grammar 'root ::= "Ok Whisper, start Listening for commands. " ("Red" | "Green" | "blue" | "Thank you") ' --grammar-penalty 1000.0
error: shell expansion failed (reason: lldb-argdumper exited with error 127). consider launching with 'process launch'.
(lldb) process l
Available completions:
	launch -- Launch the executable in the debugger.
	load   -- Load a shared library into the current process.
(lldb) process launch 
Process 6351 launched: '/Users/ggerganov/development/github/whisper.cpp/build-rwdi/bin/command' (arm64)
whisper_init_from_file_no_state: loading model from '../models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2
whisper_model_load: mem required  =  310.00 MB (+    6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB

main: processing, 8 threads, lang = en, task = transcribe, timestamps = 0 ...

2023-09-06 13:57:00.825138+0300 command[6351:79069] [plugin] AddInstanceForFactory: No factory registered for id <CFUUID 0x60000020c140> F8BB1C28-BAE8-11D6-9C31-00039315CD46
init: found 1 capture devices:
init:    - Capture device #0: 'Georgi’s iPhone Microphone'
init: attempt to open default capture device ...
init: obtained spec for input device (SDL Id = 2):
init:     - sample rate:       16000
init:     - format:            33056 (required: 33056)
init:     - channels:          1 (required: 1)
init:     - samples per frame: 1024
main: grammar:
root ::= [O] [k] [ ] [W] [h] [i] [s] [p] [e] [r] [,] [ ] [s] [t] [a] [r] [t] [ ] [L] [i] [s] [t] [e] [n] [i] [n] [g] [ ] [f] [o] [r] [ ] [c] [o] [m] [m] [a] [n] [d] [s] [.] [ ] root_1 
root_1 ::= [R] [e] [d] | [G] [r] [e] [e] [n] | [b] [l] [u] [e] | [T] [h] [a] [n] [k] [ ] [y] [o] [u] 


process_general_transcription: general-purpose mode

process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

process_general_transcription: Speech detected! Processing ...
process_general_transcription: Heard 'Ok Whisper', (t = 362 ms)
process_general_transcription: WARNING: prompt not recognized, try again

process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

process_general_transcription: Speech detected! Processing ...
process_general_transcription: Heard 'Ok Whisper, start Listening for commands', (t = 448 ms)

process_general_transcription: The prompt has been recognized!
process_general_transcription: Waiting for voice commands ...

process_general_transcription: Speech detected! Processing ...
libc++abi: terminating due to uncaught exception of type std::out_of_range: basic_string
Process 6351 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x0000000189984764 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`:
->  0x189984764 <+8>:  b.lo   0x189984784               ; <+40>
    0x189984768 <+12>: pacibsp 
    0x18998476c <+16>: stp    x29, x30, [sp, #-0x10]!
    0x189984770 <+20>: mov    x29, sp
Target 0: (command) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x0000000189984764 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001899bbc28 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x00000001898c9ae8 libsystem_c.dylib`abort + 180
    frame #3: 0x0000000189974b84 libc++abi.dylib`abort_message + 132
    frame #4: 0x00000001899643b4 libc++abi.dylib`demangling_terminate_handler() + 320
    frame #5: 0x000000018963b03c libobjc.A.dylib`_objc_terminate() + 160
    frame #6: 0x0000000189973f48 libc++abi.dylib`std::__terminate(void (*)()) + 16
    frame #7: 0x0000000189976d34 libc++abi.dylib`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 36
    frame #8: 0x0000000189976ce0 libc++abi.dylib`__cxa_throw + 140
    frame #9: 0x00000001898ef71c libc++.1.dylib`std::__1::__throw_out_of_range[abi:v15006](char const*) + 72
    frame #10: 0x00000001898eb680 libc++.1.dylib`std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__throw_out_of_range[abi:v15006]() const + 24
    frame #11: 0x00000001898ec79c libc++.1.dylib`std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::basic_string(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long, unsigned long, std::__1::allocator<char> const&) + 208
    frame #12: 0x0000000100008af0 command`process_general_transcription(whisper_context*, audio_async&, whisper_params const&) [inlined] std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::substr[abi:v15006](this="Ok W", __pos=32, __n=18446744073709551615) const at string:3573:12 [opt]
    frame #13: 0x0000000100008ad8 command`process_general_transcription(ctx=0x00000001003046a0, audio=0x000000016fdfede8, params=0x000000016fdfed28) at command.cpp:603:60 [opt]
    frame #14: 0x0000000100009654 command`main(argc=<unavailable>, argv=<unavailable>) at command.cpp:688:23 [opt]
    frame #15: 0x0000000189663f28 dyld`start + 2236
(lldb)

ggerganov · 2023-09-06T11:02:40Z

whisper.cpp

+    for (const auto & reject : rejects) {
+        if (logits[reject.id] > 0) {
+            logits[reject.id] /= params.grammar_penalty;
+        } else {
+            logits[reject.id] *= params.grammar_penalty;
+        }
+    }


I'm currently experimenting with the following penalty and I think it works better:

Suggested change

for (const auto & reject : rejects) {

if (logits[reject.id] > 0) {

logits[reject.id] /= params.grammar_penalty;

} else {

logits[reject.id] *= params.grammar_penalty;

}

}

for (const auto & reject : rejects) {

logits[reject.id] -= params.grammar_penalty;

}

Not sure where this asymmetric scaling came from in the LLM world, but I think it's wrong.
Here is some more discussion on this topic: ggerganov/llama.cpp#2970

Sounds good! Honestly I don't have a great understanding of the statistics to know what penalization function makes sense.

ggerganov · 2023-09-06T11:16:24Z

I'm still playing with this and so far have really good impressions. The API is perfect.

AFAICT this approach works on the letter level and not on the token level:

root ::= [O] [k] [ ] [W] [h] [i] [s] [p] [e] [r] [,] [ ] [s] [t] [a] [r] [t] [ ] [L] [i] [s] [t] [e] [n] [i] [n] [g] [ ] [f] [o] [r] [ ] [c] [o] [m] [m] [a] [n] [d] [s] [.] [ ] root_1 
root_1 ::= [R] [e] [d] | [G] [r] [e] [e] [n] | [b] [l] [u] [e] | [T] [h] [a] [n] [k] [ ] [y] [o] [u]

Let's say for example that at the current moment, the grammar allows the letter p:

# decoded so far
Ok Whis

Which tokens are we going to penalize? Is it going to penalize per?
If yes, can we somehow improve it to not penalize it since it fits the possible continuation by the grammar.

Edit: nvm, you have actually did it the best way :)
I've added some notes here: ejones#1

bobqianic · 2023-09-06T14:07:58Z

I just realized that even though Whisper is designed for audio transcription, it's fundamentally built on a transformer architecture. This makes prompts an incredibly useful tool; for instance, they can guide the model in correctly spelling specific nouns. So my question is, under what circumstances would grammar-based sampling be more effective compared to using prompts?
Whisper prompting guide

ggerganov · 2023-09-08T19:23:37Z

So my question is, under what circumstances would grammar-based sampling be more effective compared to using prompts?

AFAIK, applying grammar constraints to Whisper decoder is a new area yet to be studied

This weekend I'll be looking into this and hopefully merging it.
I'm really excited about this feature! Been thinking about all kinds of cool applications during the past few days :)

Thinking if we should just merge the grammar parser straight into whisper.cpp to make it more integrated.
Does the proposed parser have any significant drawbacks that could potentially be improved by an alternative implementation?

ejones · 2023-09-09T00:31:27Z

Does the proposed parser have any significant drawbacks that could potentially be improved by an alternative implementation?

No, really I was just hesitant to add all that extra code to whisper.cpp. But merging in the grammar parser was suggested in llama.cpp as well. I guess a related question is how to handle the common grammar code (which is most of it, modulo renames) between the two projects going forward.

bobqianic · 2023-09-09T01:48:55Z

Does the proposed parser have any significant drawbacks that could potentially be improved by an alternative implementation?

I'm not too sure about it either. I haven't really looked into grammar-based sampling. We can talk about it after it's merged :)

ggerganov · 2023-09-09T10:42:01Z

I guess a related question is how to handle the common grammar code (which is most of it, modulo renames) between the two projects going forward.

One approach is to move the grammar stuff (both impl + parsing) into ggml core library (would need to rewrite it in C).
I guess it is still a bit early for this step, but if grammar usage finds more applications, I think we should do it. For now it is not a problem to have the code duplicated.

I will now try to merge the parsing into whisper.cpp and will also look into merging most of the stuff from ejones#1 .

- option to read grammar from file - add sample grammars for colors and chess moves - fine-tune the performance further

tazz4843 · 2023-09-10T02:19:49Z

I'm looking for a way to just slightly nudge whisper towards these tokens, that way I can continue using it as a general purpose transcription tool, while simultaneously using it as a voice assistant. So far my major blocker to using this seems to be as mentioned in ejones#1, the false positive tokens. For my use case, a good workaround would be some way to let whisper abandon the grammar sampling earlier, perhaps through a configuration option on whisper_full_params.

whisper : fine-tuning grammar functionality

ejones · 2023-09-11T02:10:17Z

@ggerganov I tested this branch with your chess and assistant cases from ejones#1. I had a similar experience as you - tiny fairly consistently matches the grammar and invalid commands tend to produce an empty string (or .). Interestingly, though, the --prompt and --context values alone, without a --grammar, seem to significantly improve command matching. In the chess case I saw basically no difference in quality, and for the assistant I had only one or two commands that the ungrammared process missed compared to the grammared one. Makes me wonder if grammar sampling only improves performance above a certain size or complexity in the grammar.

ggerganov · 2023-09-11T08:50:43Z

Interestingly, though, the --prompt and --context values alone, without a --grammar, seem to significantly improve command matching.

I didn't test this configuration - will do so. My guess is grammar will definitely help especially in situations where certain things sound similar. I imagine a use case where the grammar describes only the legal moves on the chess board at a given moment. In that case, it will help to disambiguate moves that sound similar but could be invalid (e.g. e1 vs d1 vs b1).

ejones · 2023-09-12T02:38:55Z

Ah, good point.

FrankenApps · 2023-10-01T06:36:54Z

Hi, whats the state on this. Would try helping out to get this in...

bobqianic · 2023-10-01T07:07:26Z

Hi, whats the state on this. Would try helping out to get this in...

I'm not entirely certain either. Over the past two to three weeks, there has been a fascinating discussion regarding the detection of wake words in #1232. @isaac-mcfadyen contributed a truly intriguing perspective.

ggerganov · 2023-10-03T19:50:10Z

Sorry for the delays - I've been travelling recently and now I'm catching up with lots of things. This PR is one of top priorities. Hoping to find the time this week

* whisper : add grammar-based sampling * build : fix after master merge * command : fix exception when recognizing the command * whisper : fine-tuning grammar functionality * command : grammar-related improvements - option to read grammar from file - add sample grammars for colors and chess moves - fine-tune the performance further * grammars : add assistant + update comments * command : enable beam-search, add "no_timestamps", add "context", add p * whisper : remove comment --------- Co-authored-by: Georgi Gerganov <[email protected]>

whisper : add grammar-based sampling

31476cc

ejones mentioned this pull request Aug 31, 2023

llama : add grammar-based sampling ggerganov/llama.cpp#1773

Merged

Merge branch 'master' into HEAD

afc84b3

ggerganov reviewed Sep 6, 2023

View reviewed changes

ggerganov added 3 commits September 6, 2023 14:28

build : fix after master merge

b0306cd

command : fix exception when recognizing the command

97ebb48

whisper : fine-tuning grammar functionality

b8f34d1

ggerganov mentioned this pull request Sep 6, 2023

tmp : grammar tunings #1253

Closed

ggerganov mentioned this pull request Sep 6, 2023

whisper : fine-tuning grammar functionality ejones/whisper.cpp#1

Merged

ggerganov added 2 commits September 9, 2023 20:05

command : grammar-related improvements

54d168d

- option to read grammar from file - add sample grammars for colors and chess moves - fine-tune the performance further

grammars : add assistant + update comments

7a2abb3

ggerganov and others added 3 commits September 10, 2023 12:22

command : enable beam-search, add "no_timestamps", add "context", add p

37de5dc

whisper : remove comment

3c50be2

Merge pull request #1 from ggerganov/grammar-debug

de7021f

whisper : fine-tuning grammar functionality

ggerganov mentioned this pull request Sep 12, 2023

Next version release/tag #1233

Open

ggerganov mentioned this pull request Nov 3, 2023

sync : ggml (backend v2, k-quants, CUDA opts, Metal opts, etc.) #1422

Merged

Merge branch 'master' into HEAD

1bde022

ggerganov approved these changes Nov 13, 2023

View reviewed changes

ggerganov merged commit 3e5c7fe into ggerganov:master Nov 13, 2023
38 of 40 checks passed

ggerganov mentioned this pull request Dec 29, 2023

Option to suppress tokens #1697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : add grammar-based sampling #1229

whisper : add grammar-based sampling #1229

ejones commented Aug 31, 2023 •

edited by ggerganov

Loading

ggerganov commented Aug 31, 2023

FrankenApps commented Sep 3, 2023

ejones commented Sep 5, 2023

ggerganov commented Sep 6, 2023

ggerganov Sep 6, 2023

ejones Sep 7, 2023

ggerganov commented Sep 6, 2023 •

edited

Loading

bobqianic commented Sep 6, 2023

ggerganov commented Sep 8, 2023

ejones commented Sep 9, 2023

bobqianic commented Sep 9, 2023

ggerganov commented Sep 9, 2023

tazz4843 commented Sep 10, 2023

ejones commented Sep 11, 2023

ggerganov commented Sep 11, 2023

ejones commented Sep 12, 2023

FrankenApps commented Oct 1, 2023

bobqianic commented Oct 1, 2023

ggerganov commented Oct 3, 2023

whisper : add grammar-based sampling #1229

whisper : add grammar-based sampling #1229

Conversation

ejones commented Aug 31, 2023 • edited by ggerganov Loading

ggerganov commented Aug 31, 2023

FrankenApps commented Sep 3, 2023

ejones commented Sep 5, 2023

ggerganov commented Sep 6, 2023

ggerganov Sep 6, 2023

Choose a reason for hiding this comment

ejones Sep 7, 2023

Choose a reason for hiding this comment

ggerganov commented Sep 6, 2023 • edited Loading

bobqianic commented Sep 6, 2023

ggerganov commented Sep 8, 2023

ejones commented Sep 9, 2023

bobqianic commented Sep 9, 2023

ggerganov commented Sep 9, 2023

tazz4843 commented Sep 10, 2023

ejones commented Sep 11, 2023

ggerganov commented Sep 11, 2023

ejones commented Sep 12, 2023

FrankenApps commented Oct 1, 2023

bobqianic commented Oct 1, 2023

ggerganov commented Oct 3, 2023

ejones commented Aug 31, 2023 •

edited by ggerganov

Loading

ggerganov commented Sep 6, 2023 •

edited

Loading