[api, openai] 8k fixes, lora api updates, and bug fixes by matatonic · Pull Request #2942 · oobabooga/textgen

matatonic · 2023-06-30T08:59:54Z

Lora fixes and updates for the model API, also fixes for compatibility with GUI lora loading and API's (Ex, synchronize shared.settings so truncation_length can be updated properly between GUI, command line and API's).

Fixes Longer Context for OpenAI and Superbooga Extensions #2911
Partially Fixes Support specifying loras when making an inference request via API #2827
Partially Fixes API: LORA not applied when loading new model #2939, Update models.py to clear LORA names after unload #2951 has more
Additional more complex example for the model api with loras
Make settings and lora's available via API, updated example
Handle context length problems correctly in Openai API
Much improved error handling in Openai API
Remove obsolete stopping strings implementation
Improve system role messages, better chat top_k default

Test results over 100 models and 10 different API calls are passing as expected.

matatonic · 2023-06-30T16:09:12Z

@tensiondriven, @chigkim, @g588928812 - if you have time to test this PR, please let me know how it goes. Thanks!

chigkim · 2023-06-30T18:01:45Z

For some reason, it doesn't load the model when launching the server.
There should be a message about loading the model.

2023-06-30 17:58:01 WARNING:The gradio "share link" feature uses a proprietary executable to create a reverse tunnel. Use it with care.
2023-06-30 17:58:03.432297: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
2023-06-30 17:58:06 INFO:Loading the extension "openai"...
2023-06-30 17:58:06 INFO:Loading the extension "superbooga"...
2023-06-30 17:58:06 INFO:Intercepting all calls to posthog :)

Loaded embedding model: all-mpnet-base-v2, max sequence length: 384
2023-06-30 17:58:10 WARNING:Using embedded DuckDB without persistence: data will be transient
2023-06-30 17:58:11 WARNING:Using embedded DuckDB without persistence: data will be transient
2023-06-30 17:58:11 INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860
Starting OpenAI compatible api at
OPENAI_API_BASE=https://...
Running on public URL: https://...

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)

chigkim · 2023-06-30T18:22:45Z

I was able to load the model from the web gui.

It still discards the message if I send a prompt around 7K via OpenAI Api.
"Warning: too many messages for context size, dropping 1 oldest message(s)."
However, If I send the same prompt in the web gui, it takes it without an error.
If I send a prompt around 5K via OpenAI API, it throws out of memory error.

Traceback (most recent call last):
  File "/content/text-generation-webui/modules/callbacks.py", line 55, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/content/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2619, in sample
    outputs = self(
  File "/content/text-generation-webui/modules/exllama_hf.py", line 57, in __call__
    self.ex_model.forward(torch.tensor([seq[:-1]], dtype=torch.long), cache, preprocess_only=True, lora=self.lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 844, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 925, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 468, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 423, in forward
    attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask = buffer.attn_mask, is_causal = False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 14.75 GiB total capacity; 13.34 GiB already allocated; 226.81 MiB
free; 13.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation
for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 3.93 seconds (0.00 tokens/s, 0 tokens, context 4840, seed 351642880)

matatonic · 2023-06-30T18:49:52Z

@chigkim

--model loading fixed, sorry I didn't test that, thanks for the report
can you share the other parameters in the prompt? max_tokens and api endpoint you're using? Chat/Completions/Edit? Also which model are you loading? Please share the model loading message and parameters. If you're using a lora to get 8k context, please share which one. Please also share your GPU VRAM or other settings. - Thank you!

Ideally please enable OPENEDAI_DEBUG=1 and share the complete log with me (perhaps via pastebin), thank you.

Another thing that is immensely helpful is if you can run this python after the model is loaded (update the IP/port if needed):

import requests
print(requests.post('http://0.0.0.0:5000/api/v1/model', json={'action': 'info'}).json())

chigkim · 2023-06-30T23:00:48Z

I left everything as default out of the box except the ones specified below, and I'm using Colab T4. I'm not using lora.

python download-model.py TheBloke/Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ --output models
python server.py --verbose --notebook --share --chat --model-dir models --model Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ --model_type llama --wbits 4 --groupsize 128 --loader exllama_hf --max_seq_len 8192 --compress_pos_emb 4 --xformers --no-stream --extensions openai

2023-06-30 22:47:48 WARNING:The gradio "share link" feature uses a proprietary executable to create a reverse tunnel. Use it with care.
2023-06-30 22:47:51.749475: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
Loading Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ...
2023-06-30 22:47:55 INFO:Loading Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ...
2023-06-30 22:49:50 INFO:Replaced attention with xformers_attention
2023-06-30 22:49:50 INFO:Loaded the model in 114.68 seconds.

Successfully loaded Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ
2023-06-30 22:49:50 INFO:Loading the extension "openai"...
2023-06-30 22:49:51 INFO:Loading the extension "gallery"...
Downloading (…)a8e1d/.gitattributes: 100% 1.18k/1.18k [00:00<00:00, 5.55MB/s]
Downloading (…)_Pooling/config.json: 100% 190/190 [00:00<00:00, 995kB/s]
Downloading (…)b20bca8e1d/README.md: 100% 10.6k/10.6k [00:00<00:00, 43.6MB/s]
Downloading (…)0bca8e1d/config.json: 100% 571/571 [00:00<00:00, 3.74MB/s]
Downloading (…)ce_transformers.json: 100% 116/116 [00:00<00:00, 665kB/s]
Downloading (…)e1d/data_config.json: 100% 39.3k/39.3k [00:00<00:00, 21.4MB/s]
Downloading pytorch_model.bin:  22% 94.4M/438M [00:01<00:05, 67.4MB/s]
Downloading pytorch_model.bin: 100% 438M/438M [00:08<00:00, 52.9MB/s]
Downloading (…)nce_bert_config.json: 100% 53.0/53.0 [00:00<00:00, 284kB/s]
Downloading (…)cial_tokens_map.json: 100% 239/239 [00:00<00:00, 1.47MB/s]
Downloading (…)a8e1d/tokenizer.json: 100% 466k/466k [00:00<00:00, 14.4MB/s]
Downloading (…)okenizer_config.json: 100% 363/363 [00:00<00:00, 2.11MB/s]
Downloading (…)8e1d/train_script.py: 100% 13.1k/13.1k [00:00<00:00, 49.5MB/s]
Downloading (…)b20bca8e1d/vocab.txt: 100% 232k/232k [00:00<00:00, 101MB/s]
Downloading (…)bca8e1d/modules.json: 100% 349/349 [00:00<00:00, 2.15MB/s]
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://62f2646077001b3abb.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)

Loaded embedding model: all-mpnet-base-v2, max sequence length: 384
Starting OpenAI compatible api at
OPENAI_API_BASE=https://makers-asylum-vid-instrumental.trycloudflare.com/v1
127.0.0.1 - - [30/Jun/2023 22:51:04] "POST /v1/chat/completions HTTP/1.1" 200 -

matatonic · 2023-06-30T23:54:15Z

@chigkim
I'm testing with that model now, and I also tried xformers just to be sure, but it's working for me. I didn't get any out of memory errors (I have 24G) or other problems. This is what the latest code will show when you hit the limits:

Warning: too many messages for context size, dropping 2 oldest message(s).
truncation_length: 8192, system_prompt: 32, remaining_tokens: 8144, new size would be 8161 tokens.
Warning: Ignoring max_new_tokens (8192), too large for the remaining context. Remaining tokens: 231
Warning: Set max_new_tokens = 231
...
Output generated in 4.87 seconds (22.80 tokens/s, 111 tokens, context 8081, seed 609841173)

TheBloke/Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ uses about 16G at peak usage (8192 length, 4x compression).

With the more recent update there is a bit more information printed when messages are dropped. Can you confirm you're still having problems?

chigkim · 2023-07-01T12:28:48Z

I experimented with various conditions changing VRAM size as well as context size, and this is what happened.

OpenAI Api, V100 GPU, 10K prompt:

127.0.0.1 - - [01/Jul/2023 11:36:43] "POST /v1/chat/completions HTTP/1.1" 200 -
Warning: too many messages for context size, dropping 1 oldest message(s).
truncation_length: 8192, system_prompt: 51, remaining_tokens: 8125, new size would be 11957 tokens.
Output generated in 8.91 seconds (2.25 tokens/s, 20 tokens, context 55, seed 1322609618)

WebUI, V100 GPU, 10K prompt:

Output generated in 10.48 seconds (13.65 tokens/s, 143 tokens, context 7992, seed 1068782772)

It looks like WebUI truncates part of the prompt and just generates whereas OpenAI api completely removes the last message.

Then I tried with T4 GPU which has less VRAM.

OpenAi Api, T4 GPU, 10K prompt:

127.0.0.1 - - [01/Jul/2023 12:05:44] "POST /v1/chat/completions HTTP/1.1" 200 -
Warning: too many messages for context size, dropping 1 oldest message(s).
truncation_length: 8192, system_prompt: 51, remaining_tokens: 8125, new size would be 11957 tokens.
Output generated in 12.40 seconds (1.61 tokens/s, 20 tokens, context 55, seed 802978709)

WebUI, T4 GPU, 10K prompt:

Traceback (most recent call last):
  File "/content/text-generation-webui/modules/callbacks.py", line 55, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/content/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2619, in sample
    outputs = self(
  File "/content/text-generation-webui/modules/exllama_hf.py", line 57, in __call__
    self.ex_model.forward(torch.tensor([seq[:-1]], dtype=torch.long), cache, preprocess_only=True, lora=self.lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 844, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 925, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 468, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 423, in forward
    attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask = buffer.attn_mask, is_causal = False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 14.75 GiB total capacity; 13.34 GiB already allocated; 226.81 MiB free; 13.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 6.57 seconds (0.00 tokens/s, 0 tokens, context 7992, seed 476360153)

OpenAI API truncates the last message completely, and Web throws an error.

Then I tried with 4K prompt.

OpenAI API, T4 GPU, 4K prompt:

127.0.0.1 - - [01/Jul/2023 12:09:03] "POST /v1/chat/completions HTTP/1.1" 200 -
Traceback (most recent call last):
  File "/content/text-generation-webui/modules/callbacks.py", line 55, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/content/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2619, in sample
    outputs = self(
  File "/content/text-generation-webui/modules/exllama_hf.py", line 57, in __call__
    self.ex_model.forward(torch.tensor([seq[:-1]], dtype=torch.long), cache, preprocess_only=True, lora=self.lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 844, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 925, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 468, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 423, in forward
    attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask = buffer.attn_mask, is_causal = False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 14.75 GiB total capacity; 13.34 GiB already allocated; 226.81 MiB free; 13.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 5.59 seconds (0.00 tokens/s, 0 tokens, context 4840, seed 475630363)

WebUI, T4 GPU, 4K prompt:

Traceback (most recent call last):
  File "/content/text-generation-webui/modules/callbacks.py", line 55, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/content/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2619, in sample
    outputs = self(
  File "/content/text-generation-webui/modules/exllama_hf.py", line 57, in __call__
    self.ex_model.forward(torch.tensor([seq[:-1]], dtype=torch.long), cache, preprocess_only=True, lora=self.lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 844, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 925, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 468, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 423, in forward
    attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask = buffer.attn_mask, is_causal = False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 14.75 GiB total capacity; 13.34 GiB already allocated; 226.81 MiB free; 13.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 3.59 seconds (0.00 tokens/s, 0 tokens, context 4760, seed 238691795)

Both throws an error.

Lastly, if I feed 2K context prompt on T4 GPU, it works both from OpenAI API and WebUI as expected.

hope this helps...

matatonic · 2023-07-01T13:08:40Z

Thanks so much for the reports, it's really helpful. For the T4 systems I don't think there's anything we can do about the OOM, you need to use a smaller context or larger GPU for that.

This is an implementation choice:

Current: Dropping the oldest messages until we have context room enough to generate new tokens
Simple and clean, but ignores max_new_tokens and will proceed when there is 16 tokens free (I picked 16 as a minimum tokens free). This has the very unfortunate effect of sometimes dropping huge detailed messages to make room Example: your 10k prompt scenario - it will just ignore a 10k prompt in an 8k context.

Possible: Truncation of an old message
In this case we cut the context to allow generation of <number> tokens. So the last message that partly fits in context, we cut it mid sentence and do something like stick '...' in front of it. This will leave room to generate <number> tokens before the context is exhausted. <number> in some implementations will be 'max_new_tokens', but there is no other data field in openai for 'min_tokens' so the only options I have is to hard code a min_tokens (16 currently) or I can use max_new_tokens.

Which do you prefer - maybe a vote? Let's say Thumbs up for Current, and Thumbs down to opt for the "Truncation of an old message". If you vote Thumbs down, please include a comment about what <number> should be.

Any other suggestions for how to handle this?

TL;DR
The context wont fit and we need to make room to generate new tokens, so which tokens are dropped and how many?

Update: scrap this, it's just going to be an error for compatibility with openai.

chigkim · 2023-07-01T13:24:56Z

Actually I do not have a strong opinion on one over the other.
However, I would like consistency for both OpenAI Api and Web regardless which behavior ends up getting implemented. They should behave the same way.
Right now, case 1 vs 2 and 3 vs 4 yield different result.

matatonic · 2023-07-01T13:25:16Z

Actually, I'm sorry for even asking, I should have just looked up what openai does. It will just return an error saying it's too large an you have to decide for yourself which messages to drop, this will factor in whatever you have chosen for max_tokens. So max prompt size is: truncation_length - max_tokens. The default max_tokens is unlimited, so 'whatever is left' so I'm going to reproduce this behaviour and update this PR.

matatonic · 2023-07-01T13:26:59Z

I'm choosing to behave like openai for compatibility, not to work as the webui which has it's own API already.

chigkim · 2023-07-01T13:48:55Z

That makes sense.
Could you also look into why case 3 (OpenAI, T4, 10K prompt) vs case 5 (OpenAI, T4, 4K prompt) have different behavior?
Both don't fit on VRAM, and they produce different result.

matatonic · 2023-07-01T17:51:19Z

That makes sense. Could you also look into why case 3 (OpenAI, T4, 10K prompt) vs case 5 (OpenAI, T4, 4K prompt) have different behavior? Both don't fit on VRAM, and they produce different result.

The 10K prompt was completely dropped, Output generated in 12.40 seconds (1.61 tokens/s, 20 tokens, context 55, seed 802978709) -- context 55 is probably just the default system prompt + extras.

The 4K prompt was loaded, but caused the OOM error.
Output generated in 5.59 seconds (0.00 tokens/s, 0 tokens, context 4840, seed 475630363) -- context 4840.

I've updated a lot to handle length properly and also an overhaul of errors and parameter handling in general. The behaviour is much more similar to how openai handles the length problems.

I did find one remaining bug not fixed here, and that is truncation_length is not updated if the model doesn't set it. This update also adds API support for model settings, so you can set the truncation_length with an API call like so:

import requests, sys
t_len = int(sys.argv[1]) if len(sys.argv) > 1 else 8192
print(requests.post('http://0.0.0.0:5000/api/v1/model', json={'action': 'settings', 'settings': {'truncation_length': t_len}}).json()['result']['shared.settings']['truncation_length'])

Or from the command line like so (Example):

python -c "import requests; print(requests.post('http://0.0.0.0:5000/api/v1/model', json={'action': 'settings', 'settings': {'truncation_length': 8192}}).json()['result']['shared.settings']['truncation_length'])"

Right now this is the only way to set an 8192 truncation_length on a 2048 model + 8K LoRA, the WebUI does not update the server setting, so setting it there is not effective. You can set it via a models/config-user.yaml however.

FartyPants · 2023-07-01T18:16:23Z

see my #2951
the lora_names should go to unload_model, as it is issue for entire UI, not just API

stoplime · 2023-07-05T20:02:42Z

@matatonic Thanks for fixing the max_token being stuck at 200 tokens. This PR fixes that!

matatonic · 2023-07-07T17:24:10Z

New PR coming, a bunch more changes but I will split out openai & other changes.

tensiondriven · 2023-07-07T22:42:51Z

Very exciting! I saw from the commit messages that you addressed stopping_strings; I was having an issue with stopping_strings not being respected when using exllama, but apparently it works with exllama_hf and a specific branch of exllama. I am also running into issues with no_repeat_ngram_size, just wanted to mention that in case it was something you're addressing. (`no_repeat_ngram_size is increasingly important to get good output at longer context lengths.)

atisharma · 2023-07-08T11:28:20Z

Is there anything holding up this merge? I can test if that's what's needed.

matatonic · 2023-07-08T16:06:02Z

I've moved all the openai updates and a bunch more new openai stuff to another PR, I will update this PR and remove them when I get some more time.

matatonic · 2023-07-08T16:07:44Z

@tensiondriven I don't know much about no_repeat_ngram_size, but it's fixed to 0 with the openai api, do you have another suggestion?

tensiondriven · 2023-07-08T17:03:06Z

@tensiondriven I don't know much about no_repeat_ngram_size, but it's fixed to 0 with the openai api, do you have another suggestion?

I was thinking mainly about the built-in api; if the OpenAI API doesn't support it, then I suppose it doesn't make sense to allow it to be specified per-request, unless we're already allowing for "extra params".

A possible solution would be to allow it to be specified on the command line, so it would apply to all generations, but that sort of breaks the line between inference-time params and model-load-time params.

I don't feel too strongly on this with regard to the OpenAI api; my main concern is being able to get decent control of generations when using the builtin api.

If it's possible to emit a warning on ignored/unexpected/invalid params, I think that would improve ergonomics for people using the API(s), but that may not be practical.

oobabooga · 2023-09-16T15:07:13Z

@matatonic is this PR ready? Could you please merge the main branch?

matatonic · 2023-09-16T15:15:17Z

@matatonic is this PR ready? Could you please merge the main branch?

No, sorry this is pretty stale. I've been on holiday the last couple weeks but will be able to put in some more time near the end of Sept.

oobabooga · 2023-09-16T15:22:42Z

Awesome, thank you :)

teddybear082 · 2023-10-26T10:20:46Z

@matatonic @oobabooga quick question do you think adding after line 53 in this file: https://github.com/oobabooga/text-generation-webui/blob/main/modules/models_settings.py:

model_settings[‘truncation_length’]=metadata[‘llama.context_length’]
model_settings[‘max_sequence_length’]=metadata[‘llama.context_length’]

Similar to what is done for transformers models would fix the bug in the openai API extension where truncation length is always stuck at 2048 (only way I’ve found consistently around that for people right now using gguf is hard coding a new truncation length at line 76 of completions.py in the OpenAI Extension). I think matatonic said the shared settings are not being updated before using the extension so maybe this doesn’t help but figured I would ask. (Maybe basically it’s the order of operations, that textgen is loaded, shared settings are loaded, openai takes these and makes them it’s settings, then model is loaded with its settings but by that point openai already has set its variables, so that this change would also not help, not sure).

EDIT: on quick testing this indeed appears to fix the issue for me. I have to upgrade to the latest textgenwebui and test there, but for example, using a Synthia 7b gguf model with no special .yaml and no change to completions.py, I was able to get a context over 2048:

Output generated in 18.87 seconds (4.08 tokens/s, 77 tokens, context 3311, seed 1310051720)
data: {"id": "chatcmpl-1698319737724307456", "object": "chat.completions.chunk", "created": 1698319737, "model": "Synthia-7B-v1.3.q4_k_s.gguf", "choices": [{"index": 0, "finish_reason": "stop", "message": {"role": "assistant", "content": ""}, "delta": {"role": "assistant", "content": ""}}], "usage": {"prompt_tokens": 3311, "completion_tokens": 74, "total_tokens": 3385}}

. . .

Output generated in 12.99 seconds (3.77 tokens/s, 49 tokens, context 3415, seed 1451077598)
data: {"id": "chatcmpl-1698319778796408320", "object": "chat.completions.chunk", "created": 1698319778, "model": "Synthia-7B-v1.3.q4_k_s.gguf", "choices": [{"index": 0, "finish_reason": "stop", "message": {"role": "assistant", "content": ""}, "delta": {"role": "assistant", "content": ""}}], "usage": {"prompt_tokens": 3415, "completion_tokens": 50, "total_tokens": 3465}}

oobabooga · 2023-11-01T05:08:44Z

I am working on a big revision to the OpenAI extension in #4430 that may fix the bugs here. Any feedback on that PR is welcome.

A result so far is that I converted the API to FastAPI, so now opening 127.0.0.1:5000/docs shows a documentation.

oobabooga · 2023-11-06T05:47:36Z

I think that most of the issues here should be fixed after #4430, and the code in this PR is now outdated after the changes in that PR.

If any of the issues are still present, a PR would be very much welcome.

matatonic marked this pull request as draft June 30, 2023 09:02

matatonic marked this pull request as ready for review June 30, 2023 16:00

matatonic changed the title ~~[WIP] 8k fixes + lora api~~ [api, openai] 8k fixes, lora api updates, and bug fixes Jun 30, 2023

Matthew Ashton added 13 commits July 3, 2023 23:52

wip

fcb4187

update shared.settings with model settings on load in UI

86a7c25

simplify model load on startup

b9067fa

code cleanup, simplify

e4ca89a

reorg.

95d44f1

more robust API

4d10b74

shared.lora_names = [] on reload

0cb4f83

minor

d87fdcb

reuse

514d166

apply settings after defaults.

9f7fde5

updated api example, some fixups.

45ef210

remove test case.

80f4e73

fix loading with --model

be84c7f

Matthew Ashton and others added 12 commits July 3, 2023 23:53

overhaul parameters, errors and length truncation

75d63a7

re #2951: lora_names = [] to unload_model()

0126b8c

only error if logprobs or logit_bias is not usable

8a8f9b3

don't error with empty unsupported params

d1c8f66

lora_names = [] in model_unload() now.

6937611

update docs about guidance

8c8d9c8

include alpha_value in api-example-model.py

30c10bd

Merge branch 'oobabooga:main' into 8k_loras_fixes

f865edf

remove obsolete stopping_strings implementation

85738a8

reorder system prompts, set better top_k for chat

91a6a1e

Merge branch 'oobabooga:main' into 8k_loras_fixes

630ac3f

Merge branch 'oobabooga:main' into 8k_loras_fixes

290c5a1

Merge branch 'oobabooga:main' into 8k_loras_fixes

eedd9eb

matatonic mentioned this pull request Jul 8, 2023

[extensions/openai]: Major openai extension updates & fixes #3049

Merged

teddybear082 mentioned this pull request Oct 28, 2023

this arg is getting ignored: --n_ctx 32000 #4364

Closed

1 task

oobabooga closed this Nov 6, 2023

matatonic deleted the 8k_loras_fixes branch January 21, 2024 21:03

Conversation

matatonic commented Jun 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matatonic commented Jun 30, 2023

Uh oh!

chigkim commented Jun 30, 2023

Uh oh!

chigkim commented Jun 30, 2023

Uh oh!

matatonic commented Jun 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chigkim commented Jun 30, 2023

Uh oh!

matatonic commented Jun 30, 2023

Uh oh!

chigkim commented Jul 1, 2023

Uh oh!

matatonic commented Jul 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chigkim commented Jul 1, 2023

Uh oh!

matatonic commented Jul 1, 2023

Uh oh!

matatonic commented Jul 1, 2023

Uh oh!

chigkim commented Jul 1, 2023

Uh oh!

matatonic commented Jul 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FartyPants commented Jul 1, 2023

Uh oh!

stoplime commented Jul 5, 2023

Uh oh!

matatonic commented Jul 7, 2023

Uh oh!

tensiondriven commented Jul 7, 2023

Uh oh!

atisharma commented Jul 8, 2023

Uh oh!

matatonic commented Jul 8, 2023

Uh oh!

matatonic commented Jul 8, 2023

Uh oh!

tensiondriven commented Jul 8, 2023

Uh oh!

oobabooga commented Sep 16, 2023

Uh oh!

matatonic commented Sep 16, 2023

Uh oh!

oobabooga commented Sep 16, 2023

Uh oh!

teddybear082 commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oobabooga commented Nov 1, 2023

Uh oh!

oobabooga commented Nov 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

matatonic commented Jun 30, 2023 •

edited

Loading

matatonic commented Jun 30, 2023 •

edited

Loading

matatonic commented Jul 1, 2023 •

edited

Loading

matatonic commented Jul 1, 2023 •

edited

Loading

teddybear082 commented Oct 26, 2023 •

edited

Loading