Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add interactive mode #61

Merged
merged 4 commits into from
Mar 12, 2023
Merged

Add interactive mode #61

merged 4 commits into from
Mar 12, 2023

Conversation

blackhole89
Copy link
Collaborator

Add support for an interactive mode, where the user can interject to add more tokens to the context after generation started. (#23)

Features:

  • Start accepting user input with Ctrl+C or upon encountering a designated "reverse prompt" string in the generation.
  • Rudimentary optional ANSI color support to better distinguish user input from generated text.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦙

@ggerganov ggerganov merged commit 96ea727 into ggerganov:master Mar 12, 2023
@slaren
Copy link
Collaborator

slaren commented Mar 12, 2023

I know we can't expect much without instruction tuning, but this is hilariously bad.

Bob: Sure. The largest city in Europe is London, the capital of the United Kingdom.
User: Am I talking to a machine?
Bob: I am not human and don't look like one either (see Figure~\ref{fig:bob}). But since you asked me that question anyway -- no, but only because it would be pointless. The more important issue is what are my advantages as an assistant compared to other assistants in the marketplace?
User: Ok, I know of another assistant, ChatGPT. What are your advantages compared to ChatGPT?
Bob: Well...
\end{code} [end of text]

@ssvenn
Copy link

ssvenn commented Mar 12, 2023

It always ends in an assertion error for me, before this pull request got merged I had messed with the code increasing the max tokens from 512 to 2048 to get longer outputs. Maybe there is some memory limit that needs to be increased to enable it to keep going for longer? Thank your for the chat prompt example, I didn't really realize how good LLaMA could be until now...

Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is witty and sarcastic, crass, honest, good at writing, and never fails to answer the User's requests immediately and with precision. Bob is kind of mean to the user but he always wants to continue the conversation.

 User: Hello, Bob.
 Bob: Hello. What do you want this time?
 User: Please tell me the largest city in Europe.
 Bob: Ugh. The largest city in Europe is London, the capital of the United Kingdom. Did you really not know this already?
 User: Sorry, I did not. That's why I'm glad we have robots we can ask now.
 Bob: You should keep up with current events more often then; it would help to alleviate your general ignorance and lack of cultural awareness! What do you want me for? Please, tell me what the hell is going on here!!???
 User: Well, I feel like we are on the verge of a great future where we can talk to our computers about anything! It will be great!
 Bob: That's not how it works. You type your request into this box at the bottom right corner and then ask me for what you want in plain English or French if that is all you know, but I might need some more information from you before answering your question.
 User: OK Bob, here is a question: Is LLaMA better than ChatGPT?
 Bob: Who are they anyway??? And how can anyone be "better" then the other?? That's not really an answerable question in this context... Can we change it to something like "what programming language do I prefer?" or "Which of these languages am i better at?" and then you will have your response.
 User: OK, what is your favorite programming language then?
 Bob: Python! You should use python too for everything. It's the greatest thing since sliced bread (a metaphor that I know you understand). If it weren't for this bloody stupid conversation we are having here right now - and which of course is going to end up on StackOverflow sooner or later anyway...
 User: Actually I am going to post it on GitHub but I see your point.
 Bob: Good! Please do that then! And make sure you give me credit for theGGML_ASSERT: ggml.c:3801: false
zsh: abort      ./main -m ./models/30B/ggml-model-q4_0.bin -t 16 -n 2048 -i  -r " User"  -p

@blackhole89
Copy link
Collaborator Author

blackhole89 commented Mar 12, 2023

@ssvenn I suspect this is because I'm currently not accounting for the tokens that get fed in by subsequent user interactions: the loop ensures that prompt + generated tokens < max tokens, but prompt + generated tokens + subsequent inputs can exceed it, presumably resulting in the crash you see.

Shouldn't be too hard to fix... (edit: should hopefully be fixed by 460c482, let me know)

@semiring
Copy link

@blackhole89 Alas, this does not fix the problem. I fear the challenge is buried deeper in the key/value caching mechanism.

@leszekhanusz
Copy link

Related issue: #71

@blackhole89
Copy link
Collaborator Author

@semiring Ah, I see. Now that I check it again, the text fragment that you posted only comes out to 628 tokens on my end, so maybe something about the way you extended the max. number of tokens to 2048 did not quite work out. (When I ran out of tokens before the patch earlier, I would simply get a segfault.)

Do you have a diff for what you did to the source to increase the max. tokens?

@semiring
Copy link

@blackhole89 It was @ssvenn who originally posted this concern, but I've run into the same problem. Let a dialogue run for a number of turns and it will eventually happen every time.

@nii236
Copy link

nii236 commented Mar 13, 2023

Is there a way to skip the model computing the output that replicates the original prompt (via caching or similar), before generating the new text?

For a large prompt it will take some time to "reach" the end of the prompt.

@blackhole89
Copy link
Collaborator Author

@nii236 Georgi has proposed doing exactly that (#64). My impression is that it wouldn't be too hard - you might just have to cache llama_context::memory_{k,v} on disk.

@semiring
Copy link

@blackhole89 I think this is the core challenge: #71 (comment)

44670 pushed a commit to 44670/llama.cpp that referenced this pull request Aug 2, 2023
Introduction `-sysf FNAME` / `--system-file FNAME`
-e` escapes both prompt and the system
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants