[api, openai] 8k fixes, lora api updates, and bug fixes#2942
[api, openai] 8k fixes, lora api updates, and bug fixes#2942matatonic wants to merge 27 commits intooobabooga:mainfrom matatonic:8k_loras_fixes
Conversation
|
@tensiondriven, @chigkim, @g588928812 - if you have time to test this PR, please let me know how it goes. Thanks! |
|
For some reason, it doesn't load the model when launching the server. |
|
I was able to load the model from the web gui.
|
Ideally please enable OPENEDAI_DEBUG=1 and share the complete log with me (perhaps via pastebin), thank you. Another thing that is immensely helpful is if you can run this python after the model is loaded (update the IP/port if needed): import requests
print(requests.post('http://0.0.0.0:5000/api/v1/model', json={'action': 'info'}).json()) |
|
I left everything as default out of the box except the ones specified below, and I'm using Colab T4. I'm not using lora. |
|
@chigkim Warning: too many messages for context size, dropping 2 oldest message(s). TheBloke/Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ uses about 16G at peak usage (8192 length, 4x compression). With the more recent update there is a bit more information printed when messages are dropped. Can you confirm you're still having problems? |
|
I experimented with various conditions changing VRAM size as well as context size, and this is what happened.
It looks like WebUI truncates part of the prompt and just generates whereas OpenAI api completely removes the last message. Then I tried with T4 GPU which has less VRAM.
OpenAI API truncates the last message completely, and Web throws an error. Then I tried with 4K prompt.
Both throws an error. Lastly, if I feed 2K context prompt on T4 GPU, it works both from OpenAI API and WebUI as expected. hope this helps... |
|
Thanks so much for the reports, it's really helpful. For the T4 systems I don't think there's anything we can do about the OOM, you need to use a smaller context or larger GPU for that. This is an implementation choice: Current: Dropping the oldest messages until we have context room enough to generate new tokens Possible: Truncation of an old message Which do you prefer - maybe a vote? Let's say Thumbs up for Current, and Thumbs down to opt for the "Truncation of an old message". If you vote Thumbs down, please include a comment about what Any other suggestions for how to handle this? TL;DR Update: scrap this, it's just going to be an error for compatibility with openai. |
|
Actually I do not have a strong opinion on one over the other. |
|
Actually, I'm sorry for even asking, I should have just looked up what openai does. It will just return an error saying it's too large an you have to decide for yourself which messages to drop, this will factor in whatever you have chosen for max_tokens. So max prompt size is: truncation_length - max_tokens. The default max_tokens is unlimited, so 'whatever is left' so I'm going to reproduce this behaviour and update this PR. |
|
I'm choosing to behave like openai for compatibility, not to work as the webui which has it's own API already. |
|
That makes sense. |
The 10K prompt was completely dropped, Output generated in 12.40 seconds (1.61 tokens/s, 20 tokens, context 55, seed 802978709) -- context 55 is probably just the default system prompt + extras. The 4K prompt was loaded, but caused the OOM error. I've updated a lot to handle length properly and also an overhaul of errors and parameter handling in general. The behaviour is much more similar to how openai handles the length problems. I did find one remaining bug not fixed here, and that is truncation_length is not updated if the model doesn't set it. This update also adds API support for model settings, so you can set the truncation_length with an API call like so: import requests, sys
t_len = int(sys.argv[1]) if len(sys.argv) > 1 else 8192
print(requests.post('http://0.0.0.0:5000/api/v1/model', json={'action': 'settings', 'settings': {'truncation_length': t_len}}).json()['result']['shared.settings']['truncation_length'])Or from the command line like so (Example): Right now this is the only way to set an 8192 truncation_length on a 2048 model + 8K LoRA, the WebUI does not update the server setting, so setting it there is not effective. You can set it via a models/config-user.yaml however. |
|
see my #2951 |
|
@matatonic Thanks for fixing the max_token being stuck at 200 tokens. This PR fixes that! |
|
New PR coming, a bunch more changes but I will split out openai & other changes. |
|
Very exciting! I saw from the commit messages that you addressed |
|
Is there anything holding up this merge? I can test if that's what's needed. |
|
I've moved all the openai updates and a bunch more new openai stuff to another PR, I will update this PR and remove them when I get some more time. |
|
@tensiondriven I don't know much about no_repeat_ngram_size, but it's fixed to 0 with the openai api, do you have another suggestion? |
I was thinking mainly about the built-in api; if the OpenAI API doesn't support it, then I suppose it doesn't make sense to allow it to be specified per-request, unless we're already allowing for "extra params". A possible solution would be to allow it to be specified on the command line, so it would apply to all generations, but that sort of breaks the line between inference-time params and model-load-time params. I don't feel too strongly on this with regard to the OpenAI api; my main concern is being able to get decent control of generations when using the builtin api. If it's possible to emit a warning on ignored/unexpected/invalid params, I think that would improve ergonomics for people using the API(s), but that may not be practical. |
|
@matatonic is this PR ready? Could you please merge the main branch? |
No, sorry this is pretty stale. I've been on holiday the last couple weeks but will be able to put in some more time near the end of Sept. |
|
Awesome, thank you :) |
|
@matatonic @oobabooga quick question do you think adding after line 53 in this file: https://github.com/oobabooga/text-generation-webui/blob/main/modules/models_settings.py: model_settings[‘truncation_length’]=metadata[‘llama.context_length’] Similar to what is done for transformers models would fix the bug in the openai API extension where truncation length is always stuck at 2048 (only way I’ve found consistently around that for people right now using gguf is hard coding a new truncation length at line 76 of completions.py in the OpenAI Extension). I think matatonic said the shared settings are not being updated before using the extension so maybe this doesn’t help but figured I would ask. (Maybe basically it’s the order of operations, that textgen is loaded, shared settings are loaded, openai takes these and makes them it’s settings, then model is loaded with its settings but by that point openai already has set its variables, so that this change would also not help, not sure). EDIT: on quick testing this indeed appears to fix the issue for me. I have to upgrade to the latest textgenwebui and test there, but for example, using a Synthia 7b gguf model with no special .yaml and no change to completions.py, I was able to get a context over 2048: Output generated in 18.87 seconds (4.08 tokens/s, 77 tokens, context 3311, seed 1310051720) . . . Output generated in 12.99 seconds (3.77 tokens/s, 49 tokens, context 3415, seed 1451077598) |
|
I am working on a big revision to the OpenAI extension in #4430 that may fix the bugs here. Any feedback on that PR is welcome. A result so far is that I converted the API to FastAPI, so now opening |
|
I think that most of the issues here should be fixed after #4430, and the code in this PR is now outdated after the changes in that PR. If any of the issues are still present, a PR would be very much welcome. |
Lora fixes and updates for the model API, also fixes for compatibility with GUI lora loading and API's (Ex, synchronize shared.settings so truncation_length can be updated properly between GUI, command line and API's).
Test results over 100 models and 10 different API calls are passing as expected.