Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No memory past first conversation for local models? #631

Closed
BarfingLemurs opened this issue Feb 21, 2024 · 12 comments
Closed

No memory past first conversation for local models? #631

BarfingLemurs opened this issue Feb 21, 2024 · 12 comments

Comments

@BarfingLemurs
Copy link

Describe the bug
问题描述
A clear and concise description of what the bug is.

Any of the chat windows do not support continued conversations for local models. I'm not sure if this is a bug or it has not been implemented. Example:
image

When using local model apis like https://github.com/theroyallab/tabbyAPI, I was unable to continue a conversation, the model only receives my input as its first.

To Reproduce
如何复现
Steps to reproduce the behavior:
Using firefox, enter local url:
Screenshot_2024-02-21_14-07-24

Please complete the following information):
请补全以下内容

  • OS: Ubuntu
  • Browser: Firefox
  • Extension Version: 2.4.9
    Last Updated: February 6, 2024

Additional context
其他
Add any other context about the problem here.

@josStorer
Copy link
Owner

I think there may be an issue with the local model service you're using. I'm confident that ChatGPTBox has passed history messages through the API. Perhaps you can check if there are any log files.

@BarfingLemurs
Copy link
Author

I have been trying some other local apis to see if if there is a problem with the specific backend.

For the openai compatible api in this project: koboldcpp, I can see that the previous messages aren't being sent.

kobold.cpp

Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

Input: {"messages": [{"role": "user", "content": "hi"}], "model": "", "stream": true, "max_tokens": 40000, "temperature": 1}

Processing Prompt (1 / 1 tokens)
Generating (40 / 2047 tokens)
(EOS token triggered!)
ContextLimit: 41/2048, Processing:0.83s (832.0ms/T), Generation:30.47s (761.8ms/T), Total:31.30s (782.5ms/T = 1.28T/s)
Output:  Question: What is 2704537848 to the power of 1/2, to the nearest integer?
Answer: 51931

Input: {"messages": [{"role": "user", "content": "repeat that."}], "model": "", "stream": true, "max_tokens": 40000, "temperature": 1}

Processing Prompt (0 / 0 tokens)
Generating (1 / 2047 tokens)
(EOS token triggered!)
ContextLimit: 2/2048, Processing:0.00s (0.0ms/T), Generation:0.00s (0.0ms/T), Total:0.00s (0.0ms/T = infT/s)
Output:

image

Here's a normal log of what should happen:

Input: {"n": 1, "max_context_length": 1600, "max_length": 120, "rep_pen": 1.1, "temperature": 0.7, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 320, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "genkey": "KCPP6614", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "presence_penalty": 0, "logit_bias": {}, "prompt": "\n### Instruction:\nmy favorite sport is soccer. repeat that.\n### Response:\n", "quiet": true, "stop_sequence": ["### Instruction:", "### Response:"], "use_default_badwordsids": false}

Processing Prompt (22 / 22 tokens)
Generating (7 / 120 tokens)
(EOS token triggered!)
ContextLimit: 29/1600, Processing:16.62s (755.5ms/T), Generation:4.75s (679.3ms/T), Total:21.38s (3053.6ms/T = 0.33T/s)
Output: Your favorite sport is soccer.

Input: {"n": 1, "max_context_length": 1600, "max_length": 120, "rep_pen": 1.1, "temperature": 0.7, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 320, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "genkey": "KCPP5896", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "presence_penalty": 0, "logit_bias": {}, "prompt": "\n### Instruction:\nmy favorite sport is soccer. repeat that.\n### Response:\nYour favorite sport is soccer.\n### Instruction:\nrepeat that.\n### Response:\n", "quiet": true, "stop_sequence": ["### Instruction:", "### Response:"], "use_default_badwordsids": false}

Processing Prompt (14 / 14 tokens)
Generating (7 / 120 tokens)
(EOS token triggered!)
ContextLimit: 49/1600, Processing:10.54s (752.9ms/T), Generation:4.80s (686.3ms/T), Total:15.35s (2192.1ms/T = 0.46T/s)
Output: Your favorite sport is soccer.

Input: {"n": 1, "max_context_length": 1600, "max_length": 120, "rep_pen": 1.1, "temperature": 0.7, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 320, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "genkey": "KCPP9206", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "presence_penalty": 0, "logit_bias": {}, "prompt": "\n### Instruction:\nmy favorite sport is soccer. repeat that.\n### Response:\nYour favorite sport is soccer.\n### Instruction:\nrepeat that.\n### Response:\nYour favorite sport is soccer.\n### Instruction:\ngood. say it once more.\n### Response:\n", "quiet": true, "stop_sequence": ["### Instruction:", "### Response:"], "use_default_badwordsids": false}

Processing Prompt (18 / 18 tokens)
Generating (7 / 120 tokens)
(EOS token triggered!)
ContextLimit: 73/1600, Processing:13.57s (754.2ms/T), Generation:4.79s (684.4ms/T), Total:18.37s (2623.7ms/T = 0.38T/s)
Output: Your favorite sport is soccer.

image

Here's my logs with the llama.cpp server binary:

llama.cpp server

 = 3.86 GiB (4.57 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  3947.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    10.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    72.00 MiB
llama_new_context_with_model: graph splits (measure): 1
Available slots:
 -> Slot 0 - max context: 512
{"timestamp":1709619932,"level":"INFO","function":"main","line":2713,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    7938.96 ms /    14 tokens (  567.07 ms per token,     1.76 tokens per second)
print_timings:        eval time =   33946.96 ms /    56 runs   (  606.20 ms per token,     1.65 tokens per second)
print_timings:       total time =   41885.92 ms
slot 0 released (70 tokens in cache)
{"timestamp":1709620055,"level":"INFO","function":"log_server_request","line":2469,"message":"request","remote_addr":"127.0.0.1","remote_port":36862,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
slot 0 is processing [task id: 59]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    6230.06 ms /    11 tokens (  566.37 ms per token,     1.77 tokens per second)
print_timings:        eval time =   30168.38 ms /    48 runs   (  628.51 ms per token,     1.59 tokens per second)
print_timings:       total time =   36398.44 ms
slot 0 released (59 tokens in cache)
{"timestamp":1709620103,"level":"INFO","function":"log_server_request","line":2469,"message":"request","remote_addr":"127.0.0.1","remote_port":54752,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
slot 0 is processing [task id: 110]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    9683.77 ms /    17 tokens (  569.63 ms per token,     1.76 tokens per second)
print_timings:        eval time =    3806.89 ms /     7 runs   (  543.84 ms per token,     1.84 tokens per second)
print_timings:       total time =   13490.66 ms
slot 0 released (24 tokens in cache)
{"timestamp":1709620128,"level":"INFO","function":"log_server_request","line":2469,"message":"request","remote_addr":"127.0.0.1","remote_port":60440,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
slot 0 is processing [task id: 120]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    6225.77 ms /    11 tokens (  565.98 ms per token,     1.77 tokens per second)
print_timings:        eval time =   41423.46 ms /    63 runs   (  657.52 ms per token,     1.52 tokens per second)
print_timings:       total time =   47649.23 ms
slot 0 released (74 tokens in cache)
{"timestamp":1709620188,"level":"INFO","function":"log_server_request","line":2469,"message":"request","remote_addr":"127.0.0.1","remote_port":53026,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

image

are you able to reproduce it with any of the llama.cpp / ollama backends? am I using the wrong api url?

@josStorer
Copy link
Owner

Did you change your settings?
It may be your Max Conversation Length being set to zero.

image

@BarfingLemurs
Copy link
Author

my settings seem ok, here is my video footage of the issue:
https://github.com/josStorer/chatGPTBox/assets/128182951/82639627-48e1-4e48-a064-a98eba96a3a0

is it chrome or some other operating system issue? I was actually able to use the extension on an android phone, with the firefox browser. the auto queries it makes with searches work great.

@josStorer
Copy link
Owner

Refresh the conversation page, does the history messages still exist?
Press F12 and click the network section, then select the completion request and click on payload, give me a screenshot

@BarfingLemurs
Copy link
Author

Refresh the conversation page, does the history messages still exist?

No, newly created sessions do not persist after refreshing.

then select the completion request and click on payload

I don't know about this, let me know what to do next.

image

@josStorer
Copy link
Owner

No, newly created sessions do not persist after refreshing.

This is not normal. If an answer is completed normally, the conversation page should save it correctly, and then when you continue the conversation, it will be sent as a history message.

If it disappears after refreshing, it means that this answer has not been considered complete. ChatGPTBox does not store or send failed or interrupted answers as history messages, which is the same situation you encountered.

For me, using ollama answers can be completed and stored correctly.

@BarfingLemurs
Copy link
Author

BarfingLemurs commented Mar 5, 2024

Thank you, I hadn't tested ollama, but this works properly.
image

filling the model name eg: "gemma:2b" is a requirement for ollama to work, along with: export OLLAMA_ORIGINS=* on linux

I will check those other apis again.

@BarfingLemurs
Copy link
Author

Some notable things with tabbyapi (and other ones):

  • ending "</" token is not displayed, as shown 36 seconds in
  • the chat ui doesn't create a border after output response, which I think normally happens with ollama backend.
chatgptbox_tabbyAPI.mp4

@josStorer
Copy link
Owner

"</" token is actually "</>", and it's rendered as a html element, so not displayed

josStorer added a commit that referenced this issue Mar 23, 2024
…can be customized, thus requiring more condition checks, now the api has better support for ollama and LM Studio) (#631, #648)
@josStorer
Copy link
Owner

v2.5.1

@BarfingLemurs
Copy link
Author

Thank you, the conversation now is stored properly and works with these local APIs mentioned, such as tabbyAPI!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants