-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement classifier-free guidance #2135
Conversation
My problems with the sampling API:
If I were to make this a sampler function, I guess it would be something like: LLAMA_API void llama_sample_context_free_guidance(
struct llama_context * ctx,
llama_token_data_array * candidates,
struct llama_context * guidance_ctx,
float scale,
float smooth_factor); This would apply CFG to the |
From my experiments, it somewhat depends on the model, with fine tuned models needing higher guidance strength indeed. Indeed base models used 1.25-1.5 but GPT4All used 3. Good job! I'm excited! PS: I love your examples haha |
What is |
@Vermeille Are we obligated to use negative prompt though? Or would this work with only positive prompt like on Stable Diffusion? |
@BadisG no it's not mandatory. Most of the experiments in the paper don't use one. However our experiments with assistants were not very conclusive without one. The model was disturbed if the prompt did not follow the expected input format. That's why we just negative prompting for assistants. |
The cfg output is fun, but the implementation is not there yet for me. The "rude" instruction isn't ignored, it's just not exaggerated in the same way with the -cfg parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor style comments. After addressing, I think we can merge
In general, see this comment about naming things where I explained the preferred way: ggerganov/ggml#302 (comment)
examples/common.cpp
Outdated
@@ -534,7 +556,7 @@ std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::s | |||
return res; | |||
} | |||
|
|||
std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(const gpt_params & params) { | |||
struct llama_context_params llama_get_context_params_from_gpt_params(const gpt_params & params) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess llama_context_params_from_gpt_params()
should fit better.
We tend to use get
and set
to access properties, while here we construct context_params
examples/main/main.cpp
Outdated
@@ -109,10 +109,16 @@ int main(int argc, char ** argv) { | |||
|
|||
llama_model * model; | |||
llama_context * ctx; | |||
llama_context * guidance_ctx = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to ctx_guidance
examples/main/main.cpp
Outdated
|
||
// the first thing we will do is to output the prompt, so set color accordingly | ||
console_set_color(con_st, CONSOLE_COLOR_PROMPT); | ||
|
||
std::vector<llama_token> embd; | ||
std::vector<llama_token> guidance_embd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<llama_token> guidance_embd; | |
std::vector<llama_token> embd_guidance; |
examples/main/main.cpp
Outdated
@@ -334,11 +363,13 @@ int main(int argc, char ** argv) { | |||
int n_remain = params.n_predict; | |||
int n_consumed = 0; | |||
int n_session_consumed = 0; | |||
int guidance_n_past = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int guidance_n_past = 0; | |
int n_past_guidance = 0; |
llama.cpp
Outdated
float guidance_logit = guidance_logits[i]; | ||
float base_logit = candidates->data[i].logit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float guidance_logit = guidance_logits[i]; | |
float base_logit = candidates->data[i].logit; | |
float logit_guidance = guidance_logits[i]; | |
float logit_base = candidates->data[i].logit; |
llama.cpp
Outdated
template<typename T, typename LogitAccessor> | ||
void llama_log_softmax(T * array, int size, LogitAccessor logit_accessor) { | ||
T* element = std::max_element( | ||
array, array + size, | ||
[&logit_accessor](T& lhs, T& rhs) { | ||
return logit_accessor(lhs) < logit_accessor(rhs); | ||
} | ||
); | ||
|
||
float max_l = logit_accessor(*element); | ||
float sum = 0.f; | ||
for (int i = 0; i < size; ++i) { | ||
float& logit = logit_accessor(array[i]); | ||
float p = expf(logit - max_l); | ||
sum += p; | ||
logit = p; | ||
} | ||
|
||
for (int i = 0; i < size; ++i) { | ||
float& logit = logit_accessor(array[i]); | ||
logit = logf(logit / sum); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid the template
. You can copy the logits in a std::vector<float>
and use float * array
implementation in both cases
Is this PR currently functional? I'm surprised others aren't concerned about the output. Here's some examples: Test #1:
Test #2:
Test #3:
3/3 tests Assistant veered off into nonsense. |
@JackJollimore you're using WizardLM, which is an instruction tuned model, while the prompts are formatted for chat tuned models like Vicuna and WizardVicuna. Try it with either a chat model or changing the prompts so they adhere to Alpaca style instruction prompt format |
I never would've guessed wizardlm caused that, thanks for pointing it out. I can't disagree that the output is fun:
My apologies! Working as expected. |
Doesn't |
For example, if we have the following context:
You can keep
Depending on the text fragment, the assistant can get "confused". This is because all the calculations are done based on number of tokens alone with no concept of message as an atomic unit. Now if the program is chat-aware, it would keep a rolling window of 2-tuples:
This is of course, model specific because every model has a different prompt format. |
I understand this functions:
I replaced "rude" with "dumb", and that worked well. I tried, "The assistant is a goat.", but it failed - I'm trying to understand, is it because I needed something else in the negative prompt or something? Is there a simple way to understand the relationship between prompt and negative prompt? If not then can I assign a descriptor (to assistant for example) greater than 1 word? Thank you. |
The server API supports this a little bit, while it allows an input of any length and also generates any amount of tokens, it will give a flag in the response when the context is truncated. I even added an example of this, deleting the first chat response-reply: llama.cpp/examples/server/chat.mjs Lines 68 to 70 in 1d16309
Yeah, |
It's certainly doable. For one, we already know which text is model-generated, user-generated or just "marker" like "USER:" or "ASSISTANT:" which comes from the CLI args. I see it more as a playground to test out techniques and models rather than a full-fledged application.
I'm still playing around with it. bin/Release/main \
--mirostat 2 \
-s 1689004840 \
-ngl 63 \
-m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin \
--verbose-prompt \
--prompt "A chat between a curious user and a goat. The assistant talks like a goat. USER: Tell me about llama. ASSISTANT: " \
--cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant talks like a human. USER: Tell me about llama. ASSISTANT: " \
--cfg-scale 5 \
--cfg-smooth-factor 0.855 I got:
So it's just dumb goat puns. "Prompt engineering" is often less engineering and more dark art. For the "What is 1+1?" example, I tried the vanilla base prompt and the effect is not as pronounced so I got an idea and just put "The assistant gives concise answer" in the negative prompt and voila it went on and on about Taoism and philosophy of 1+1. |
This makes sense to me, thank you.
I'll try and keep this in mind. I tried your parameters:
The results look good! |
sounds pretty goat to me 🤣 |
I knew our paper had some wild potential lmao. Reading this thread makes me so happy hahaha. Definitely check out the last appendix of the paper for more ideas, we released all the prompts, and some gave hilarious results! |
Closes #2083.
To test:
Output:
Compared with no guidance:
Output:
Basically instruction is ignored.
Interactive mode also works:
And if a rude assistant is not your thing:
Output:
There are problems that I need some opinions on:
The existing way to roll over context is already pretty arbitrary.
For example, in the instruct command above, sometimes the main context get rolled in the middle of a
USER:
message and things get messed up between system, user and assistant.The guidance context is just as arbitrary.
There is no easy fix for this in the example.
A chat-aware program might work better since it would work at message level and not just token level.
A roll over would not cut in between a message.
This is needed to calculate the offset for eval between the 2 contexts.
The paper suggested 1.25 and I need massive values like 3-5 to get any visible effect.
Edit: can't seem to see anything wrong.