Skip to content

[api, openai] 8k fixes, lora api updates, and bug fixes#2942

Closed
matatonic wants to merge 27 commits intooobabooga:mainfrom
matatonic:8k_loras_fixes
Closed

[api, openai] 8k fixes, lora api updates, and bug fixes#2942
matatonic wants to merge 27 commits intooobabooga:mainfrom
matatonic:8k_loras_fixes

Conversation

@matatonic
Copy link
Copy Markdown
Contributor

@matatonic matatonic commented Jun 30, 2023

Lora fixes and updates for the model API, also fixes for compatibility with GUI lora loading and API's (Ex, synchronize shared.settings so truncation_length can be updated properly between GUI, command line and API's).

Test results over 100 models and 10 different API calls are passing as expected.

@matatonic matatonic marked this pull request as draft June 30, 2023 09:02
@matatonic matatonic marked this pull request as ready for review June 30, 2023 16:00
@matatonic matatonic changed the title [WIP] 8k fixes + lora api [api, openai] 8k fixes, lora api updates, and bug fixes Jun 30, 2023
@matatonic
Copy link
Copy Markdown
Contributor Author

@tensiondriven, @chigkim, @g588928812 - if you have time to test this PR, please let me know how it goes. Thanks!

@chigkim
Copy link
Copy Markdown

chigkim commented Jun 30, 2023

For some reason, it doesn't load the model when launching the server.
There should be a message about loading the model.

2023-06-30 17:58:01 WARNING:The gradio "share link" feature uses a proprietary executable to create a reverse tunnel. Use it with care.
2023-06-30 17:58:03.432297: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
2023-06-30 17:58:06 INFO:Loading the extension "openai"...
2023-06-30 17:58:06 INFO:Loading the extension "superbooga"...
2023-06-30 17:58:06 INFO:Intercepting all calls to posthog :)

Loaded embedding model: all-mpnet-base-v2, max sequence length: 384
2023-06-30 17:58:10 WARNING:Using embedded DuckDB without persistence: data will be transient
2023-06-30 17:58:11 WARNING:Using embedded DuckDB without persistence: data will be transient
2023-06-30 17:58:11 INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860
Starting OpenAI compatible api at
OPENAI_API_BASE=https://...
Running on public URL: https://...

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)

@chigkim
Copy link
Copy Markdown

chigkim commented Jun 30, 2023

I was able to load the model from the web gui.

  1. It still discards the message if I send a prompt around 7K via OpenAI Api.
    "Warning: too many messages for context size, dropping 1 oldest message(s)."
  2. However, If I send the same prompt in the web gui, it takes it without an error.
  3. If I send a prompt around 5K via OpenAI API, it throws out of memory error.
Traceback (most recent call last):
  File "/content/text-generation-webui/modules/callbacks.py", line 55, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/content/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2619, in sample
    outputs = self(
  File "/content/text-generation-webui/modules/exllama_hf.py", line 57, in __call__
    self.ex_model.forward(torch.tensor([seq[:-1]], dtype=torch.long), cache, preprocess_only=True, lora=self.lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 844, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 925, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 468, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 423, in forward
    attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask = buffer.attn_mask, is_causal = False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 14.75 GiB total capacity; 13.34 GiB already allocated; 226.81 MiB
free; 13.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation
for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 3.93 seconds (0.00 tokens/s, 0 tokens, context 4840, seed 351642880)

@matatonic
Copy link
Copy Markdown
Contributor Author

matatonic commented Jun 30, 2023

@chigkim

  • --model loading fixed, sorry I didn't test that, thanks for the report
  • can you share the other parameters in the prompt? max_tokens and api endpoint you're using? Chat/Completions/Edit? Also which model are you loading? Please share the model loading message and parameters. If you're using a lora to get 8k context, please share which one. Please also share your GPU VRAM or other settings. - Thank you!

Ideally please enable OPENEDAI_DEBUG=1 and share the complete log with me (perhaps via pastebin), thank you.

Another thing that is immensely helpful is if you can run this python after the model is loaded (update the IP/port if needed):

import requests
print(requests.post('http://0.0.0.0:5000/api/v1/model', json={'action': 'info'}).json())

@chigkim
Copy link
Copy Markdown

chigkim commented Jun 30, 2023

I left everything as default out of the box except the ones specified below, and I'm using Colab T4. I'm not using lora.

python download-model.py TheBloke/Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ --output models
python server.py --verbose --notebook --share --chat --model-dir models --model Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ --model_type llama --wbits 4 --groupsize 128 --loader exllama_hf --max_seq_len 8192 --compress_pos_emb 4 --xformers --no-stream --extensions openai

2023-06-30 22:47:48 WARNING:The gradio "share link" feature uses a proprietary executable to create a reverse tunnel. Use it with care.
2023-06-30 22:47:51.749475: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
Loading Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ...
2023-06-30 22:47:55 INFO:Loading Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ...
2023-06-30 22:49:50 INFO:Replaced attention with xformers_attention
2023-06-30 22:49:50 INFO:Loaded the model in 114.68 seconds.

Successfully loaded Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ
2023-06-30 22:49:50 INFO:Loading the extension "openai"...
2023-06-30 22:49:51 INFO:Loading the extension "gallery"...
Downloading (…)a8e1d/.gitattributes: 100% 1.18k/1.18k [00:00<00:00, 5.55MB/s]
Downloading (…)_Pooling/config.json: 100% 190/190 [00:00<00:00, 995kB/s]
Downloading (…)b20bca8e1d/README.md: 100% 10.6k/10.6k [00:00<00:00, 43.6MB/s]
Downloading (…)0bca8e1d/config.json: 100% 571/571 [00:00<00:00, 3.74MB/s]
Downloading (…)ce_transformers.json: 100% 116/116 [00:00<00:00, 665kB/s]
Downloading (…)e1d/data_config.json: 100% 39.3k/39.3k [00:00<00:00, 21.4MB/s]
Downloading pytorch_model.bin:  22% 94.4M/438M [00:01<00:05, 67.4MB/s]
Downloading pytorch_model.bin: 100% 438M/438M [00:08<00:00, 52.9MB/s]
Downloading (…)nce_bert_config.json: 100% 53.0/53.0 [00:00<00:00, 284kB/s]
Downloading (…)cial_tokens_map.json: 100% 239/239 [00:00<00:00, 1.47MB/s]
Downloading (…)a8e1d/tokenizer.json: 100% 466k/466k [00:00<00:00, 14.4MB/s]
Downloading (…)okenizer_config.json: 100% 363/363 [00:00<00:00, 2.11MB/s]
Downloading (…)8e1d/train_script.py: 100% 13.1k/13.1k [00:00<00:00, 49.5MB/s]
Downloading (…)b20bca8e1d/vocab.txt: 100% 232k/232k [00:00<00:00, 101MB/s]
Downloading (…)bca8e1d/modules.json: 100% 349/349 [00:00<00:00, 2.15MB/s]
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://62f2646077001b3abb.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)

Loaded embedding model: all-mpnet-base-v2, max sequence length: 384
Starting OpenAI compatible api at
OPENAI_API_BASE=https://makers-asylum-vid-instrumental.trycloudflare.com/v1
127.0.0.1 - - [30/Jun/2023 22:51:04] "POST /v1/chat/completions HTTP/1.1" 200 -

@matatonic
Copy link
Copy Markdown
Contributor Author

@chigkim
I'm testing with that model now, and I also tried xformers just to be sure, but it's working for me. I didn't get any out of memory errors (I have 24G) or other problems. This is what the latest code will show when you hit the limits:

Warning: too many messages for context size, dropping 2 oldest message(s).
truncation_length: 8192, system_prompt: 32, remaining_tokens: 8144, new size would be 8161 tokens.
Warning: Ignoring max_new_tokens (8192), too large for the remaining context. Remaining tokens: 231
Warning: Set max_new_tokens = 231
...
Output generated in 4.87 seconds (22.80 tokens/s, 111 tokens, context 8081, seed 609841173)

TheBloke/Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ uses about 16G at peak usage (8192 length, 4x compression).

With the more recent update there is a bit more information printed when messages are dropped. Can you confirm you're still having problems?

@chigkim
Copy link
Copy Markdown

chigkim commented Jul 1, 2023

I experimented with various conditions changing VRAM size as well as context size, and this is what happened.

  1. OpenAI Api, V100 GPU, 10K prompt:
127.0.0.1 - - [01/Jul/2023 11:36:43] "POST /v1/chat/completions HTTP/1.1" 200 -
Warning: too many messages for context size, dropping 1 oldest message(s).
truncation_length: 8192, system_prompt: 51, remaining_tokens: 8125, new size would be 11957 tokens.
Output generated in 8.91 seconds (2.25 tokens/s, 20 tokens, context 55, seed 1322609618)
  1. WebUI, V100 GPU, 10K prompt:
Output generated in 10.48 seconds (13.65 tokens/s, 143 tokens, context 7992, seed 1068782772)

It looks like WebUI truncates part of the prompt and just generates whereas OpenAI api completely removes the last message.

Then I tried with T4 GPU which has less VRAM.

  1. OpenAi Api, T4 GPU, 10K prompt:
127.0.0.1 - - [01/Jul/2023 12:05:44] "POST /v1/chat/completions HTTP/1.1" 200 -
Warning: too many messages for context size, dropping 1 oldest message(s).
truncation_length: 8192, system_prompt: 51, remaining_tokens: 8125, new size would be 11957 tokens.
Output generated in 12.40 seconds (1.61 tokens/s, 20 tokens, context 55, seed 802978709)
  1. WebUI, T4 GPU, 10K prompt:
Traceback (most recent call last):
  File "/content/text-generation-webui/modules/callbacks.py", line 55, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/content/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2619, in sample
    outputs = self(
  File "/content/text-generation-webui/modules/exllama_hf.py", line 57, in __call__
    self.ex_model.forward(torch.tensor([seq[:-1]], dtype=torch.long), cache, preprocess_only=True, lora=self.lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 844, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 925, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 468, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 423, in forward
    attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask = buffer.attn_mask, is_causal = False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 14.75 GiB total capacity; 13.34 GiB already allocated; 226.81 MiB free; 13.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 6.57 seconds (0.00 tokens/s, 0 tokens, context 7992, seed 476360153)

OpenAI API truncates the last message completely, and Web throws an error.

Then I tried with 4K prompt.

  1. OpenAI API, T4 GPU, 4K prompt:
127.0.0.1 - - [01/Jul/2023 12:09:03] "POST /v1/chat/completions HTTP/1.1" 200 -
Traceback (most recent call last):
  File "/content/text-generation-webui/modules/callbacks.py", line 55, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/content/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2619, in sample
    outputs = self(
  File "/content/text-generation-webui/modules/exllama_hf.py", line 57, in __call__
    self.ex_model.forward(torch.tensor([seq[:-1]], dtype=torch.long), cache, preprocess_only=True, lora=self.lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 844, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 925, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 468, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 423, in forward
    attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask = buffer.attn_mask, is_causal = False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 14.75 GiB total capacity; 13.34 GiB already allocated; 226.81 MiB free; 13.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 5.59 seconds (0.00 tokens/s, 0 tokens, context 4840, seed 475630363)
  1. WebUI, T4 GPU, 4K prompt:
Traceback (most recent call last):
  File "/content/text-generation-webui/modules/callbacks.py", line 55, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/content/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2619, in sample
    outputs = self(
  File "/content/text-generation-webui/modules/exllama_hf.py", line 57, in __call__
    self.ex_model.forward(torch.tensor([seq[:-1]], dtype=torch.long), cache, preprocess_only=True, lora=self.lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 844, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 925, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 468, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 423, in forward
    attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask = buffer.attn_mask, is_causal = False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 14.75 GiB total capacity; 13.34 GiB already allocated; 226.81 MiB free; 13.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 3.59 seconds (0.00 tokens/s, 0 tokens, context 4760, seed 238691795)

Both throws an error.

Lastly, if I feed 2K context prompt on T4 GPU, it works both from OpenAI API and WebUI as expected.

hope this helps...

@matatonic
Copy link
Copy Markdown
Contributor Author

matatonic commented Jul 1, 2023

Thanks so much for the reports, it's really helpful. For the T4 systems I don't think there's anything we can do about the OOM, you need to use a smaller context or larger GPU for that.

This is an implementation choice:

Current: Dropping the oldest messages until we have context room enough to generate new tokens
Simple and clean, but ignores max_new_tokens and will proceed when there is 16 tokens free (I picked 16 as a minimum tokens free). This has the very unfortunate effect of sometimes dropping huge detailed messages to make room Example: your 10k prompt scenario - it will just ignore a 10k prompt in an 8k context.

Possible: Truncation of an old message
In this case we cut the context to allow generation of <number> tokens. So the last message that partly fits in context, we cut it mid sentence and do something like stick '...' in front of it. This will leave room to generate <number> tokens before the context is exhausted. <number> in some implementations will be 'max_new_tokens', but there is no other data field in openai for 'min_tokens' so the only options I have is to hard code a min_tokens (16 currently) or I can use max_new_tokens.

Which do you prefer - maybe a vote? Let's say Thumbs up for Current, and Thumbs down to opt for the "Truncation of an old message". If you vote Thumbs down, please include a comment about what <number> should be.

Any other suggestions for how to handle this?

TL;DR
The context wont fit and we need to make room to generate new tokens, so which tokens are dropped and how many?

Update: scrap this, it's just going to be an error for compatibility with openai.

@chigkim
Copy link
Copy Markdown

chigkim commented Jul 1, 2023

Actually I do not have a strong opinion on one over the other.
However, I would like consistency for both OpenAI Api and Web regardless which behavior ends up getting implemented. They should behave the same way.
Right now, case 1 vs 2 and 3 vs 4 yield different result.

@matatonic
Copy link
Copy Markdown
Contributor Author

Actually, I'm sorry for even asking, I should have just looked up what openai does. It will just return an error saying it's too large an you have to decide for yourself which messages to drop, this will factor in whatever you have chosen for max_tokens. So max prompt size is: truncation_length - max_tokens. The default max_tokens is unlimited, so 'whatever is left' so I'm going to reproduce this behaviour and update this PR.

@matatonic
Copy link
Copy Markdown
Contributor Author

I'm choosing to behave like openai for compatibility, not to work as the webui which has it's own API already.

@chigkim
Copy link
Copy Markdown

chigkim commented Jul 1, 2023

That makes sense.
Could you also look into why case 3 (OpenAI, T4, 10K prompt) vs case 5 (OpenAI, T4, 4K prompt) have different behavior?
Both don't fit on VRAM, and they produce different result.

@matatonic
Copy link
Copy Markdown
Contributor Author

matatonic commented Jul 1, 2023

That makes sense. Could you also look into why case 3 (OpenAI, T4, 10K prompt) vs case 5 (OpenAI, T4, 4K prompt) have different behavior? Both don't fit on VRAM, and they produce different result.

The 10K prompt was completely dropped, Output generated in 12.40 seconds (1.61 tokens/s, 20 tokens, context 55, seed 802978709) -- context 55 is probably just the default system prompt + extras.

The 4K prompt was loaded, but caused the OOM error.
Output generated in 5.59 seconds (0.00 tokens/s, 0 tokens, context 4840, seed 475630363) -- context 4840.

I've updated a lot to handle length properly and also an overhaul of errors and parameter handling in general. The behaviour is much more similar to how openai handles the length problems.

I did find one remaining bug not fixed here, and that is truncation_length is not updated if the model doesn't set it. This update also adds API support for model settings, so you can set the truncation_length with an API call like so:

import requests, sys
t_len = int(sys.argv[1]) if len(sys.argv) > 1 else 8192
print(requests.post('http://0.0.0.0:5000/api/v1/model', json={'action': 'settings', 'settings': {'truncation_length': t_len}}).json()['result']['shared.settings']['truncation_length'])

Or from the command line like so (Example):

python -c "import requests; print(requests.post('http://0.0.0.0:5000/api/v1/model', json={'action': 'settings', 'settings': {'truncation_length': 8192}}).json()['result']['shared.settings']['truncation_length'])"

Right now this is the only way to set an 8192 truncation_length on a 2048 model + 8K LoRA, the WebUI does not update the server setting, so setting it there is not effective. You can set it via a models/config-user.yaml however.

@FartyPants
Copy link
Copy Markdown
Contributor

see my #2951
the lora_names should go to unload_model, as it is issue for entire UI, not just API

@stoplime
Copy link
Copy Markdown

stoplime commented Jul 5, 2023

@matatonic Thanks for fixing the max_token being stuck at 200 tokens. This PR fixes that!

@matatonic
Copy link
Copy Markdown
Contributor Author

New PR coming, a bunch more changes but I will split out openai & other changes.

@tensiondriven
Copy link
Copy Markdown
Contributor

Very exciting! I saw from the commit messages that you addressed stopping_strings; I was having an issue with stopping_strings not being respected when using exllama, but apparently it works with exllama_hf and a specific branch of exllama. I am also running into issues with no_repeat_ngram_size, just wanted to mention that in case it was something you're addressing. (`no_repeat_ngram_size is increasingly important to get good output at longer context lengths.)

@atisharma
Copy link
Copy Markdown

Is there anything holding up this merge? I can test if that's what's needed.

@matatonic
Copy link
Copy Markdown
Contributor Author

I've moved all the openai updates and a bunch more new openai stuff to another PR, I will update this PR and remove them when I get some more time.

@matatonic
Copy link
Copy Markdown
Contributor Author

@tensiondriven I don't know much about no_repeat_ngram_size, but it's fixed to 0 with the openai api, do you have another suggestion?

@tensiondriven
Copy link
Copy Markdown
Contributor

@tensiondriven I don't know much about no_repeat_ngram_size, but it's fixed to 0 with the openai api, do you have another suggestion?

I was thinking mainly about the built-in api; if the OpenAI API doesn't support it, then I suppose it doesn't make sense to allow it to be specified per-request, unless we're already allowing for "extra params".

A possible solution would be to allow it to be specified on the command line, so it would apply to all generations, but that sort of breaks the line between inference-time params and model-load-time params.

I don't feel too strongly on this with regard to the OpenAI api; my main concern is being able to get decent control of generations when using the builtin api.

If it's possible to emit a warning on ignored/unexpected/invalid params, I think that would improve ergonomics for people using the API(s), but that may not be practical.

@oobabooga
Copy link
Copy Markdown
Owner

@matatonic is this PR ready? Could you please merge the main branch?

@matatonic
Copy link
Copy Markdown
Contributor Author

@matatonic is this PR ready? Could you please merge the main branch?

No, sorry this is pretty stale. I've been on holiday the last couple weeks but will be able to put in some more time near the end of Sept.

@oobabooga
Copy link
Copy Markdown
Owner

Awesome, thank you :)

@teddybear082
Copy link
Copy Markdown

teddybear082 commented Oct 26, 2023

@matatonic @oobabooga quick question do you think adding after line 53 in this file: https://github.com/oobabooga/text-generation-webui/blob/main/modules/models_settings.py:

model_settings[‘truncation_length’]=metadata[‘llama.context_length’]
model_settings[‘max_sequence_length’]=metadata[‘llama.context_length’]

Similar to what is done for transformers models would fix the bug in the openai API extension where truncation length is always stuck at 2048 (only way I’ve found consistently around that for people right now using gguf is hard coding a new truncation length at line 76 of completions.py in the OpenAI Extension). I think matatonic said the shared settings are not being updated before using the extension so maybe this doesn’t help but figured I would ask. (Maybe basically it’s the order of operations, that textgen is loaded, shared settings are loaded, openai takes these and makes them it’s settings, then model is loaded with its settings but by that point openai already has set its variables, so that this change would also not help, not sure).

EDIT: on quick testing this indeed appears to fix the issue for me. I have to upgrade to the latest textgenwebui and test there, but for example, using a Synthia 7b gguf model with no special .yaml and no change to completions.py, I was able to get a context over 2048:

Output generated in 18.87 seconds (4.08 tokens/s, 77 tokens, context 3311, seed 1310051720)
data: {"id": "chatcmpl-1698319737724307456", "object": "chat.completions.chunk", "created": 1698319737, "model": "Synthia-7B-v1.3.q4_k_s.gguf", "choices": [{"index": 0, "finish_reason": "stop", "message": {"role": "assistant", "content": ""}, "delta": {"role": "assistant", "content": ""}}], "usage": {"prompt_tokens": 3311, "completion_tokens": 74, "total_tokens": 3385}}

. . .

Output generated in 12.99 seconds (3.77 tokens/s, 49 tokens, context 3415, seed 1451077598)
data: {"id": "chatcmpl-1698319778796408320", "object": "chat.completions.chunk", "created": 1698319778, "model": "Synthia-7B-v1.3.q4_k_s.gguf", "choices": [{"index": 0, "finish_reason": "stop", "message": {"role": "assistant", "content": ""}, "delta": {"role": "assistant", "content": ""}}], "usage": {"prompt_tokens": 3415, "completion_tokens": 50, "total_tokens": 3465}}

@oobabooga
Copy link
Copy Markdown
Owner

I am working on a big revision to the OpenAI extension in #4430 that may fix the bugs here. Any feedback on that PR is welcome.

A result so far is that I converted the API to FastAPI, so now opening 127.0.0.1:5000/docs shows a documentation.

@oobabooga
Copy link
Copy Markdown
Owner

I think that most of the issues here should be fixed after #4430, and the code in this PR is now outdated after the changes in that PR.

If any of the issues are still present, a PR would be very much welcome.

@oobabooga oobabooga closed this Nov 6, 2023
@matatonic matatonic deleted the 8k_loras_fixes branch January 21, 2024 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

8 participants