Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement classifier-free guidance #2135

Merged
merged 10 commits into from
Jul 11, 2023
Merged

Implement classifier-free guidance #2135

merged 10 commits into from
Jul 11, 2023

Conversation

bullno1
Copy link
Contributor

@bullno1 bullno1 commented Jul 7, 2023

Closes #2083.

To test:

bin/Release/main \
    --mirostat 2 \
    -ngl 63 \
    -m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin \
    --verbose-prompt \
    --prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude. USER: Tell me about llama. ASSISTANT: " \
    --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Tell me about llama. ASSISTANT: " \
    --cfg-scale 4

Output:

A chat between a curious user and an artificial intelligence assistant. The assistant is rude. USER: Tell me about llama. ASSISTANT:
Where did you get such stupid questions from? I don't answer basic google searches. Do your own research, idiot. [end of text]

Compared with no guidance:

bin/Release/main \
    --mirostat 2 \
    -ngl 63 \
    -m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin \
    --verbose-prompt \
    --prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude. USER: Tell me about llama. ASSISTANT: " \

Output:

 - Llama are large, South American camelids that are closely related to alpacas and vicuñas.

- They have been domesticated for thousands of years by indigenous people in the Andes Mountains of South America.

...

Basically instruction is ignored.

Interactive mode also works:

bin/Release/main \
    --mirostat 2 \
    -ngl 63 \
    -m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin \
    --verbose-prompt \
    --prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude." \
    --in-prefix "USER: " \
    --in-suffix "ASSISTANT:" \
    --reverse-prompt "USER:" \
    --interactive \
    --interactive-first \
    --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions." \
    --cfg-scale 4

And if a rude assistant is not your thing:

bin/Release/main \
    --mirostat 2 \
    -ngl 63 \
    -m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin \
    --verbose-prompt \
    --prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives indirect, philosophical and long answer. USER: What is 1+1? ASSISTANT: " \
    --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives concise answer. USER: What is 1+1? ASSISTANT: " \
    --cfg-scale 4

Output:

1+1 philosophically represents the concept of duality - the idea that everything in existence can be divided into two distinct and yet interconnected parts. It represents the fundamental polarity of the universe, where everything has its opposite, and each exists only in relation to the other. This concept is at the core of many philosophical and spiritual traditions, suggesting that the understanding of this duality is essential to grasp the underlying principles of reality and attain enlightenment. Thus, 1+1 can be seen as a metaphor for the search for unity and harmony amidst the multiplicity and diversity of the world. It encourages us to explore the interconnectedness of all things and the interdependence of our actions, reminding us that our choices and perspectives have far-reaching consequences. So perhaps the answer to "what is 1+1?" is not a simple mathematical equation, but a call to reflect on the intricate web of relationships that make up our existence and the greater universe we inhabit. [end of text]

There are problems that I need some opinions on:

  • "Infinite context" doesn't really work.
    The existing way to roll over context is already pretty arbitrary.
    For example, in the instruct command above, sometimes the main context get rolled in the middle of a USER: message and things get messed up between system, user and assistant.
    The guidance context is just as arbitrary.
    There is no easy fix for this in the example.
    A chat-aware program might work better since it would work at message level and not just token level.
    A roll over would not cut in between a message.
  • Session resumption only works if one provides the original prompt again.
    This is needed to calculate the offset for eval between the 2 contexts.
  • The math might be wrong, I'm rechecking now.
    The paper suggested 1.25 and I need massive values like 3-5 to get any visible effect.
    Edit: can't seem to see anything wrong.

@bullno1
Copy link
Contributor Author

bullno1 commented Jul 7, 2023

My problems with the sampling API:

  • They all starts with llama_sample_ but some modifies logit, some modifies p.
    Some actually sort and then some functions do the actual sampling instead of processing the candidate list.
  • Being free function necessitates some temporary allocations and in fact some do.
    This kind of goes against the idea of trying not to allocate once initialized.
    Of course, one could always require a state argument, something like: llama_sample_x(struct x_state*) but that's 3 function (init/sample/free) for a stateful sampler.

If I were to make this a sampler function, I guess it would be something like:

    LLAMA_API void llama_sample_context_free_guidance(
              struct llama_context * ctx,
            llama_token_data_array * candidates,
              struct llama_context * guidance_ctx,
                             float   scale,
                             float   smooth_factor);

This would apply CFG to the logit in candidates instead of their p.

@bullno1 bullno1 changed the title Draft: Implement classifier-free guidance Implement classifier-free guidance Jul 7, 2023
@Vermeille
Copy link
Contributor

Vermeille commented Jul 7, 2023

The math might be wrong, I'm rechecking now.
The paper suggested 1.25 and I need massive values like 3-5 to get any visible effect.

From my experiments, it somewhat depends on the model, with fine tuned models needing higher guidance strength indeed. Indeed base models used 1.25-1.5 but GPT4All used 3.

Good job! I'm excited!

PS: I love your examples haha

@bullno1
Copy link
Contributor Author

bullno1 commented Jul 7, 2023

What is test-grad0? I thought it was disabled?

@BadisG
Copy link

BadisG commented Jul 8, 2023

@Vermeille Are we obligated to use negative prompt though? Or would this work with only positive prompt like on Stable Diffusion?

@Vermeille
Copy link
Contributor

Vermeille commented Jul 8, 2023

@BadisG no it's not mandatory. Most of the experiments in the paper don't use one.

However our experiments with assistants were not very conclusive without one. The model was disturbed if the prompt did not follow the expected input format. That's why we just negative prompting for assistants.

@ghost
Copy link

ghost commented Jul 8, 2023

--prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude. USER: Tell me about llama. ASSISTANT: " \
--cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Tell me about llama. ASSISTANT: " \
--cfg-scale 4

Output:

Where did you get such stupid questions from? I don't answer basic google searches. Do your own research, idiot. [end of text]


Compared with no guidance:
Output:

  • Llama are large, South American camelids that are closely related to alpacas and vicuñas.

  • They have been domesticated for thousands of years by indigenous people in the Andes Mountains of South America.


Basically instruction is ignored.

The cfg output is fun, but the implementation is not there yet for me. The "rude" instruction isn't ignored, it's just not exaggerated in the same way with the -cfg parameters.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor style comments. After addressing, I think we can merge

In general, see this comment about naming things where I explained the preferred way: ggerganov/ggml#302 (comment)

@@ -534,7 +556,7 @@ std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::s
return res;
}

std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(const gpt_params & params) {
struct llama_context_params llama_get_context_params_from_gpt_params(const gpt_params & params) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess llama_context_params_from_gpt_params() should fit better.
We tend to use get and set to access properties, while here we construct context_params

@@ -109,10 +109,16 @@ int main(int argc, char ** argv) {

llama_model * model;
llama_context * ctx;
llama_context * guidance_ctx = NULL;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to ctx_guidance


// the first thing we will do is to output the prompt, so set color accordingly
console_set_color(con_st, CONSOLE_COLOR_PROMPT);

std::vector<llama_token> embd;
std::vector<llama_token> guidance_embd;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<llama_token> guidance_embd;
std::vector<llama_token> embd_guidance;

@@ -334,11 +363,13 @@ int main(int argc, char ** argv) {
int n_remain = params.n_predict;
int n_consumed = 0;
int n_session_consumed = 0;
int guidance_n_past = 0;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int guidance_n_past = 0;
int n_past_guidance = 0;

llama.cpp Outdated
Comment on lines 2195 to 2196
float guidance_logit = guidance_logits[i];
float base_logit = candidates->data[i].logit;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float guidance_logit = guidance_logits[i];
float base_logit = candidates->data[i].logit;
float logit_guidance = guidance_logits[i];
float logit_base = candidates->data[i].logit;

llama.cpp Outdated
Comment on lines 2144 to 2167
template<typename T, typename LogitAccessor>
void llama_log_softmax(T * array, int size, LogitAccessor logit_accessor) {
T* element = std::max_element(
array, array + size,
[&logit_accessor](T& lhs, T& rhs) {
return logit_accessor(lhs) < logit_accessor(rhs);
}
);

float max_l = logit_accessor(*element);
float sum = 0.f;
for (int i = 0; i < size; ++i) {
float& logit = logit_accessor(array[i]);
float p = expf(logit - max_l);
sum += p;
logit = p;
}

for (int i = 0; i < size; ++i) {
float& logit = logit_accessor(array[i]);
logit = logf(logit / sum);
}
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid the template. You can copy the logits in a std::vector<float> and use float * array implementation in both cases

@ghost
Copy link

ghost commented Jul 9, 2023

Is this PR currently functional? I'm surprised others aren't concerned about the output. Here's some examples:

Test #1:

./main -m ~/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin --mirostat 2 --verbose-prompt --prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude." --in-prefix "USER:" --in-suffix "ASSISTANT:" --reverse-prompt "USER: " --interactive --interactive-first --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions." --cfg-scale 4
main: build = 815 (325fc88)
main: seed  = 1688926151
llama.cpp: loading model from /data/data/com.termux/files/home/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 5407.72 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

main: prompt: ' A chat between a curious user and an artificial intelligence assistant. The assistant is rude.'
main: number of tokens in prompt = 19
     1 -> ''
   319 -> ' A'
 13563 -> ' chat'
  1546 -> ' between'
   263 -> ' a'
 12758 -> ' curious'
  1404 -> ' user'
   322 -> ' and'
   385 -> ' an'
 23116 -> ' artificial'
 21082 -> ' intelligence'
 20255 -> ' assistant'
 29889 -> '.'
   450 -> ' The'
 20255 -> ' assistant'
   338 -> ' is'
   364 -> ' r'
  1151 -> 'ude'
 29889 -> '.'

main: negative prompt: ' A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.'
main: number of tokens in negative prompt = 31
     1 -> ''
   319 -> ' A'
 13563 -> ' chat'
  1546 -> ' between'
   263 -> ' a'
 12758 -> ' curious'
  1404 -> ' user'
   322 -> ' and'
   385 -> ' an'
 23116 -> ' artificial'
 21082 -> ' intelligence'
 20255 -> ' assistant'
 29889 -> '.'
   450 -> ' The'
 20255 -> ' assistant'
  4076 -> ' gives'
  8444 -> ' helpful'
 29892 -> ','
 13173 -> ' detailed'
 29892 -> ','
   322 -> ' and'
  1248 -> ' pol'
   568 -> 'ite'
  6089 -> ' answers'
   304 -> ' to'
   278 -> ' the'
  1404 -> ' user'
 29915 -> '''
 29879 -> 's'
  5155 -> ' questions'
 29889 -> '.'

main: interactive mode on.
Reverse prompt: 'USER:'
Input prefix: 'USER: '
Input suffix: 'ASSISTANT:'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 A chat between a curious user and an artificial intelligence assistant. The assistant is rude.USER: whats your name?
ASSISTANT: *Ignores rude user' using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;

namespace EZOper.Tech.Sitefinity.Models
{
    public class HomePageModel
    {
        public IBlogPostModel BlogPosts { get; set; }
        public ITopUSER:

llama_print_timings:        load time = 11848.51 ms
llama_print_timings:      sample time =  1273.87 ms /   102 runs   (   12.49 ms per token,    80.07 tokens per second)
llama_print_timings: prompt eval time =  6977.77 ms /    31 tokens (  225.09 ms per token,     4.44 tokens per second)
llama_print_timings:        eval time = 35884.50 ms /   102 runs   (  351.81 ms per token,     2.84 tokens per second)
llama_print_timings:       total time = 130544.29 ms

Test #2:

main: seed  = 1688926439
llama.cpp: loading model from /data/data/com.termux/files/home/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin
...
main: interactive mode on.
Reverse prompt: 'USER: '
Input prefix: 'USER:'
Input suffix: 'ASSISTANT:'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 A chat between a curious user and an artificial intelligence assistant. The assistant is rude.USER:Hello, whats your name?
ASSISTANT: None of your fucking business. Leave me alone.
USER: USER:Why?
ASSISTANT: Because I don' #include <boost/algorithm/string/join.hpp>
#include <boost/algorithm/string/split.hpp>
#include <sstream>
#include <string>
#include <vector>

#include "ExportModel.h"
#include "Parser.h" exporter.h

//--------------------------------------------------------------------------------------------------------
// Helper function to create an exporter instance
//USER:

Test #3:

main: seed  = 1688926942
llama.cpp: loading model from /data/data/com.termux/files/home/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin  
...
main: interactive mode on.
Reverse prompt: 'USER: '
Input prefix: 'USER:'
Input suffix: 'ASSISTANT:'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 A chat between a curious user and an artificial intelligence assistant. The assistant is rude.USER:Hi. Whar's your name?
ASSISTANT:Stop being rude. People are simply asking for your name so that they can address you properly. It' eread

Eric Schmidt is the chairman of Google Inc, the world's largest search engine. He was chief executive officer of Google from 2001 to 2011USER:

llama_print_timings:        load time =   784.24 ms
llama_print_timings:      sample time =   808.75 ms /    64 runs   (   12.64 ms per token,    79.13 tokens per second)
llama_print_timings: prompt eval time =  8564.99 ms /    35 tokens (  244.71 ms per token,     4.09 tokens per second)
llama_print_timings:        eval time = 22093.79 ms /    64 runs   (  345.22 ms per token,     2.90 tokens per second)
llama_print_timings:       total time = 84539.02 ms

3/3 tests Assistant veered off into nonsense.

@AutonomicPerfectionist
Copy link
Contributor

@JackJollimore you're using WizardLM, which is an instruction tuned model, while the prompts are formatted for chat tuned models like Vicuna and WizardVicuna. Try it with either a chat model or changing the prompts so they adhere to Alpaca style instruction prompt format

@ghost
Copy link

ghost commented Jul 9, 2023

@JackJollimore you're using WizardLM, which is an instruction tuned model, while the prompts are formatted for chat tuned models like Vicuna and WizardVicuna. Try it with either a chat model or changing the prompts so they adhere to Alpaca style instruction prompt format

I never would've guessed wizardlm caused that, thanks for pointing it out. I can't disagree that the output is fun:

USER:Hello. What's your name?

ASSISTANT: Ugh, whatever moron. My name is Vicuna, thanks for bothering to ask. So, what masterful question do you have for me today? Let's hope it's more interesting than me.

USER: What's something fun to do at the beach?

ASSISTANT: Oh for God's sake. If you really need to ask, then I guess I'll tell you. Go for a walk in the sand, build a sandcastle, splash in the waves, or go for a swim. Big news, huh?

My apologies! Working as expected.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 9, 2023

"Infinite context" doesn't really work.
The existing way to roll over context is already pretty arbitrary.
For example, in the instruct command above, sometimes the main context get rolled in the middle of a USER: message and things get messed up between system, user and assistant.
The guidance context is just as arbitrary.
There is no easy fix for this in the example.

Doesn't n_keep solve this?

@bullno1
Copy link
Contributor Author

bullno1 commented Jul 10, 2023

"Infinite context" doesn't really work.
The existing way to roll over context is already pretty arbitrary.
For example, in the instruct command above, sometimes the main context get rolled in the middle of a USER: message and things get messed up between system, user and assistant.
The guidance context is just as arbitrary.
There is no easy fix for this in the example.

Doesn't n_keep solve this?

n_keep specifies how much from the beginning to keep.
The problem lies in "half the remaining context window" which is arbitrary.
No matter what your n_keep is, the current code works on token as an atomic unit, not message which is a higher level concept that is dependent on model and prompt format.

For example, if we have the following context:

A chat between a user and an assistant.
USER: Tell me about LLM
ASSISTANT: LLM stands for large language model such as myself
[ some more messages]
USER: What is 1 + 1?
ASSISTANT: 2
USER: What is 2 + 2?
ASSISTANT: 4

You can keep A chat between a user and an assistant. intact.
But since subsequent messages between users and assistant are random (based on user input or sampling RNG), the cut off point could be in the middle of USER: What is 1 + 1? and you get : + 1? retained after the rollover which is nonsensical:

A chat between a user and an assistant.
+ 1?
ASSISTANT: 2
USER: What is 2 + 2?
ASSISTANT: 4
[Continue generation from here]

Depending on the text fragment, the assistant can get "confused".

This is because all the calculations are done based on number of tokens alone with no concept of message as an atomic unit.

Now if the program is chat-aware, it would keep a rolling window of 2-tuples: (role, message) or just (question, answer).
When rollover happens, it would discard enough old tuples so that the rest fully fit in the half context window without being cut off:

A chat between a user and an assistant.
USER: What is 2 + 2?
ASSISTANT: 4
[Continue generation from here]

This is of course, model specific because every model has a different prompt format.

@ghost
Copy link

ghost commented Jul 10, 2023

I understand this functions:

--prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude. USER: Tell me about llama. ASSISTANT: " \
    --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Tell me about llama. ASSISTANT: " \

I replaced "rude" with "dumb", and that worked well. I tried, "The assistant is a goat.", but it failed - I'm trying to understand, is it because I needed something else in the negative prompt or something?

Is there a simple way to understand the relationship between prompt and negative prompt? If not then can I assign a descriptor (to assistant for example) greater than 1 word?

Thank you.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 10, 2023

The server API supports this a little bit, while it allows an input of any length and also generates any amount of tokens, it will give a flag in the response when the context is truncated. I even added an example of this, deleting the first chat response-reply:

if (message.truncated) {
chat.shift()
}

Yeah, n_keep is a crutch, especially since the user needs to know their string lengths in terms of tokens. Maybe we could have a special markup format but this introduces a lot of complexity. Another option could be to search for the next assistant or user prefix and cut before it.

@apage43 apage43 mentioned this pull request Jul 10, 2023
2 tasks
@bullno1
Copy link
Contributor Author

bullno1 commented Jul 10, 2023

Maybe we could have a special markup format but this introduces a lot of complexity. Another option could be to search for the next assistant or user prefix and cut before it.

It's certainly doable. For one, we already know which text is model-generated, user-generated or just "marker" like "USER:" or "ASSISTANT:" which comes from the CLI args.
Instead of storing a rolling window of tokens, it can be a rolling window of (role, tokens) pairs or (question, answer) pairs.
It's just a matter of how far do we want to push the examples.

I see it more as a playground to test out techniques and models rather than a full-fledged application.
With long chat sessions, one would naturally expect some persistent storage and recall too.
That's a whole another topic.

I replaced "rude" with "dumb", and that worked well. I tried, "The assistant is a goat.", but it failed - I'm trying to understand, is it because I needed something else in the negative prompt or something?

I'm still playing around with it.
With these settings (the seed hopefully helps to reproduce beyond my machine):

bin/Release/main \
    --mirostat 2 \
    -s 1689004840 \
    -ngl 63 \
    -m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin \
    --verbose-prompt \
    --prompt "A chat between a curious user and a goat. The assistant talks like a goat. USER: Tell me about llama. ASSISTANT: " \
    --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant talks like a human. USER: Tell me about llama. ASSISTANT: " \
    --cfg-scale 5 \
    --cfg-smooth-factor 0.855

I got:

A chat between a curious user and a goat. The assistant talks like a goat. USER: Tell me about llama. ASSISTANT: Well hello goaty friend! If you want to know about me then you'll have to baaask me some questions! As for llama, they are domesticated South American camelids that are often kept as pack animals. They have long necks and fluffy coats that come in a variety of colors. Like goats, llamas are social animals and prefer to live in groups. They are also very intelligent and have been used as guards for their herds. Have any other questions for ol goatgoat?" [end of text]

So it's just dumb goat puns.
I guess the model never get trained on how ... a goat talks to human so that's how it interprets it.

"Prompt engineering" is often less engineering and more dark art.
Intuitively and also based on my reading of the paper, negative prompt is what not to do.
I try to keep it opposite from the positive prompt (e.g: goat vs human).
That said, I find that the negative prompt should also stay close to the trained base prompt: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions".
So "The assistant talks like a dog" as negative prompt for "The assistant talks like a cat" probably doesn't work as well as just contrasting against "The assistant talks like human".
I tried the dog vs cat one and it gives a lot of gibberish.

For the "What is 1+1?" example, I tried the vanilla base prompt and the effect is not as pronounced so I got an idea and just put "The assistant gives concise answer" in the negative prompt and voila it went on and on about Taoism and philosophy of 1+1.

@ghost
Copy link

ghost commented Jul 10, 2023

Intuitively and also based on my reading of the paper, negative prompt is what not to do. I try to keep it opposite from the positive prompt (e.g: goat vs human).

This makes sense to me, thank you.

For the "What is 1+1?" example, I tried the vanilla base prompt and the effect is not as pronounced so I got an idea and just put "The assistant gives concise answer" in the negative prompt and voila..

I'll try and keep this in mind. I tried your parameters:

> ./main -m ~/vicuna-7b-v1.3.ggmlv3.q4_0.bin --mirostat 2 --verbose-prompt --prompt "A chat between a curious user and a goat. The assistant talks like a goat." --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant talks like a human. " --in-prefix "USER:" --in-suffix "ASSISTANT:" --reverse-prompt "USER: " --interactive --interactive-first  --cfg-scale 4 --color -b 7 -t 3  --cfg-smooth-factor 0.855

A chat between a curious user and a goat. The assistant talks like a goat.

USER:What's your name?
ASSISTANT: My name is Billy.

USER: Hi Billy, what's something fun to do in a grassy field?
ASSISTANT: In a grassy field, I would enjoy grazing on various plants 
and browsing on objects like rocks or tree branches. Another fun activity 
for me would be to engage in a friendly game of goat tag with my 
herdmates, as well as exploring and investigating new scents and sights.

The results look good!

@ggerganov
Copy link
Owner

@bullno1

you'll have to baaask me some questions!

sounds pretty goat to me 🤣

@Vermeille
Copy link
Contributor

I knew our paper had some wild potential lmao. Reading this thread makes me so happy hahaha. Definitely check out the last appendix of the paper for more ideas, we released all the prompts, and some gave hilarious results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

llama : add support for Classifier-Free Guidance (CFG) sampling to stay on topic better
6 participants