No memory past first conversation for local models? #631

BarfingLemurs · 2024-02-21T19:14:30Z

Describe the bug
问题描述
A clear and concise description of what the bug is.

Any of the chat windows do not support continued conversations for local models. I'm not sure if this is a bug or it has not been implemented. Example:

When using local model apis like https://github.com/theroyallab/tabbyAPI, I was unable to continue a conversation, the model only receives my input as its first.

To Reproduce
如何复现
Steps to reproduce the behavior:
Using firefox, enter local url:

Please complete the following information):
请补全以下内容

OS: Ubuntu
Browser: Firefox
Extension Version: 2.4.9
Last Updated: February 6, 2024

Additional context
其他
Add any other context about the problem here.

josStorer · 2024-03-04T15:15:54Z

I think there may be an issue with the local model service you're using. I'm confident that ChatGPTBox has passed history messages through the API. Perhaps you can check if there are any log files.

BarfingLemurs · 2024-03-05T06:47:08Z

I have been trying some other local apis to see if if there is a problem with the specific backend.

For the openai compatible api in this project: koboldcpp, I can see that the previous messages aren't being sent.

kobold.cpp

Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

Input: {"messages": [{"role": "user", "content": "hi"}], "model": "", "stream": true, "max_tokens": 40000, "temperature": 1}

Processing Prompt (1 / 1 tokens)
Generating (40 / 2047 tokens)
(EOS token triggered!)
ContextLimit: 41/2048, Processing:0.83s (832.0ms/T), Generation:30.47s (761.8ms/T), Total:31.30s (782.5ms/T = 1.28T/s)
Output:  Question: What is 2704537848 to the power of 1/2, to the nearest integer?
Answer: 51931

Input: {"messages": [{"role": "user", "content": "repeat that."}], "model": "", "stream": true, "max_tokens": 40000, "temperature": 1}

Processing Prompt (0 / 0 tokens)
Generating (1 / 2047 tokens)
(EOS token triggered!)
ContextLimit: 2/2048, Processing:0.00s (0.0ms/T), Generation:0.00s (0.0ms/T), Total:0.00s (0.0ms/T = infT/s)
Output:

Here's a normal log of what should happen:

Input: {"n": 1, "max_context_length": 1600, "max_length": 120, "rep_pen": 1.1, "temperature": 0.7, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 320, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "genkey": "KCPP6614", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "presence_penalty": 0, "logit_bias": {}, "prompt": "\n### Instruction:\nmy favorite sport is soccer. repeat that.\n### Response:\n", "quiet": true, "stop_sequence": ["### Instruction:", "### Response:"], "use_default_badwordsids": false}

Processing Prompt (22 / 22 tokens)
Generating (7 / 120 tokens)
(EOS token triggered!)
ContextLimit: 29/1600, Processing:16.62s (755.5ms/T), Generation:4.75s (679.3ms/T), Total:21.38s (3053.6ms/T = 0.33T/s)
Output: Your favorite sport is soccer.

Input: {"n": 1, "max_context_length": 1600, "max_length": 120, "rep_pen": 1.1, "temperature": 0.7, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 320, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "genkey": "KCPP5896", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "presence_penalty": 0, "logit_bias": {}, "prompt": "\n### Instruction:\nmy favorite sport is soccer. repeat that.\n### Response:\nYour favorite sport is soccer.\n### Instruction:\nrepeat that.\n### Response:\n", "quiet": true, "stop_sequence": ["### Instruction:", "### Response:"], "use_default_badwordsids": false}

Processing Prompt (14 / 14 tokens)
Generating (7 / 120 tokens)
(EOS token triggered!)
ContextLimit: 49/1600, Processing:10.54s (752.9ms/T), Generation:4.80s (686.3ms/T), Total:15.35s (2192.1ms/T = 0.46T/s)
Output: Your favorite sport is soccer.

Input: {"n": 1, "max_context_length": 1600, "max_length": 120, "rep_pen": 1.1, "temperature": 0.7, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 320, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "genkey": "KCPP9206", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "presence_penalty": 0, "logit_bias": {}, "prompt": "\n### Instruction:\nmy favorite sport is soccer. repeat that.\n### Response:\nYour favorite sport is soccer.\n### Instruction:\nrepeat that.\n### Response:\nYour favorite sport is soccer.\n### Instruction:\ngood. say it once more.\n### Response:\n", "quiet": true, "stop_sequence": ["### Instruction:", "### Response:"], "use_default_badwordsids": false}

Processing Prompt (18 / 18 tokens)
Generating (7 / 120 tokens)
(EOS token triggered!)
ContextLimit: 73/1600, Processing:13.57s (754.2ms/T), Generation:4.79s (684.4ms/T), Total:18.37s (2623.7ms/T = 0.38T/s)
Output: Your favorite sport is soccer.

Here's my logs with the llama.cpp server binary:

llama.cpp server

 = 3.86 GiB (4.57 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  3947.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    10.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    72.00 MiB
llama_new_context_with_model: graph splits (measure): 1
Available slots:
 -> Slot 0 - max context: 512
{"timestamp":1709619932,"level":"INFO","function":"main","line":2713,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    7938.96 ms /    14 tokens (  567.07 ms per token,     1.76 tokens per second)
print_timings:        eval time =   33946.96 ms /    56 runs   (  606.20 ms per token,     1.65 tokens per second)
print_timings:       total time =   41885.92 ms
slot 0 released (70 tokens in cache)
{"timestamp":1709620055,"level":"INFO","function":"log_server_request","line":2469,"message":"request","remote_addr":"127.0.0.1","remote_port":36862,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
slot 0 is processing [task id: 59]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    6230.06 ms /    11 tokens (  566.37 ms per token,     1.77 tokens per second)
print_timings:        eval time =   30168.38 ms /    48 runs   (  628.51 ms per token,     1.59 tokens per second)
print_timings:       total time =   36398.44 ms
slot 0 released (59 tokens in cache)
{"timestamp":1709620103,"level":"INFO","function":"log_server_request","line":2469,"message":"request","remote_addr":"127.0.0.1","remote_port":54752,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
slot 0 is processing [task id: 110]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    9683.77 ms /    17 tokens (  569.63 ms per token,     1.76 tokens per second)
print_timings:        eval time =    3806.89 ms /     7 runs   (  543.84 ms per token,     1.84 tokens per second)
print_timings:       total time =   13490.66 ms
slot 0 released (24 tokens in cache)
{"timestamp":1709620128,"level":"INFO","function":"log_server_request","line":2469,"message":"request","remote_addr":"127.0.0.1","remote_port":60440,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
slot 0 is processing [task id: 120]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    6225.77 ms /    11 tokens (  565.98 ms per token,     1.77 tokens per second)
print_timings:        eval time =   41423.46 ms /    63 runs   (  657.52 ms per token,     1.52 tokens per second)
print_timings:       total time =   47649.23 ms
slot 0 released (74 tokens in cache)
{"timestamp":1709620188,"level":"INFO","function":"log_server_request","line":2469,"message":"request","remote_addr":"127.0.0.1","remote_port":53026,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

are you able to reproduce it with any of the llama.cpp / ollama backends? am I using the wrong api url?

josStorer · 2024-03-05T07:13:03Z

Did you change your settings?
It may be your Max Conversation Length being set to zero.

BarfingLemurs · 2024-03-05T09:03:25Z

my settings seem ok, here is my video footage of the issue:
https://github.com/josStorer/chatGPTBox/assets/128182951/82639627-48e1-4e48-a064-a98eba96a3a0

is it chrome or some other operating system issue? I was actually able to use the extension on an android phone, with the firefox browser. the auto queries it makes with searches work great.

josStorer · 2024-03-05T09:46:55Z

Refresh the conversation page, does the history messages still exist?
Press F12 and click the network section, then select the completion request and click on payload, give me a screenshot

BarfingLemurs · 2024-03-05T10:57:49Z

Refresh the conversation page, does the history messages still exist?

No, newly created sessions do not persist after refreshing.

then select the completion request and click on payload

I don't know about this, let me know what to do next.

josStorer · 2024-03-05T11:11:50Z

No, newly created sessions do not persist after refreshing.

This is not normal. If an answer is completed normally, the conversation page should save it correctly, and then when you continue the conversation, it will be sent as a history message.

If it disappears after refreshing, it means that this answer has not been considered complete. ChatGPTBox does not store or send failed or interrupted answers as history messages, which is the same situation you encountered.

For me, using ollama answers can be completed and stored correctly.

BarfingLemurs · 2024-03-05T22:49:46Z

Thank you, I hadn't tested ollama, but this works properly.

filling the model name eg: "gemma:2b" is a requirement for ollama to work, along with: export OLLAMA_ORIGINS=* on linux

I will check those other apis again.

BarfingLemurs · 2024-03-06T01:56:29Z

Some notable things with tabbyapi (and other ones):

ending "</" token is not displayed, as shown 36 seconds in
the chat ui doesn't create a border after output response, which I think normally happens with ollama backend.

chatgptbox_tabbyAPI.mp4

josStorer · 2024-03-06T05:59:11Z

"</" token is actually "</>", and it's rendered as a html element, so not displayed

…can be customized, thus requiring more condition checks, now the api has better support for ollama and LM Studio) (#631, #648)

josStorer · 2024-03-23T16:36:47Z

v2.5.1

BarfingLemurs · 2024-03-25T12:55:46Z

Thank you, the conversation now is stored properly and works with these local APIs mentioned, such as tabbyAPI!

josStorer added a commit that referenced this issue Mar 23, 2024

add additional finish conditions for OpenAI API and Custom API (both …

680900b

…can be customized, thus requiring more condition checks, now the api has better support for ollama and LM Studio) (#631, #648)

josStorer closed this as completed Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No memory past first conversation for local models? #631

No memory past first conversation for local models? #631

BarfingLemurs commented Feb 21, 2024

josStorer commented Mar 4, 2024

BarfingLemurs commented Mar 5, 2024

llama.cpp server

josStorer commented Mar 5, 2024

BarfingLemurs commented Mar 5, 2024

josStorer commented Mar 5, 2024

BarfingLemurs commented Mar 5, 2024

josStorer commented Mar 5, 2024

BarfingLemurs commented Mar 5, 2024 •

edited

Loading

BarfingLemurs commented Mar 6, 2024

josStorer commented Mar 6, 2024

josStorer commented Mar 23, 2024

BarfingLemurs commented Mar 25, 2024

No memory past first conversation for local models? #631

No memory past first conversation for local models? #631

Comments

BarfingLemurs commented Feb 21, 2024

josStorer commented Mar 4, 2024

BarfingLemurs commented Mar 5, 2024

kobold.cpp

llama.cpp server

josStorer commented Mar 5, 2024

BarfingLemurs commented Mar 5, 2024

josStorer commented Mar 5, 2024

BarfingLemurs commented Mar 5, 2024

josStorer commented Mar 5, 2024

BarfingLemurs commented Mar 5, 2024 • edited Loading

BarfingLemurs commented Mar 6, 2024

josStorer commented Mar 6, 2024

josStorer commented Mar 23, 2024

BarfingLemurs commented Mar 25, 2024

BarfingLemurs commented Mar 5, 2024 •

edited

Loading