Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model wishlist #49

Closed
7 of 14 tasks
EricLBuehler opened this issue Mar 30, 2024 · 23 comments
Closed
7 of 14 tasks

Model wishlist #49

EricLBuehler opened this issue Mar 30, 2024 · 23 comments
Labels
models Additions to model or architectures

Comments

@EricLBuehler
Copy link
Owner

EricLBuehler commented Mar 30, 2024

Please let us know what model architectures you would like to be added!

  • mistralai/Mistral-7B-Instruct-v0.1
  • mistralai/Mistral-7B-Instruct-v0.2
  • mistralai/Mixtral-8x7B-Instruct-v0.1
  • meta-llama/Llama-2-13b-hf
  • google/gemma-7b-it
  • microsoft/phi-2
  • stabilityai/stablelm-2-1_6b
  • 01-ai/Yi-6B
  • RWKV/rwkv-6-world-1b6 and RWKV/rwkv-5-world-1b5
  • yzsydlc/qwen2
  • adept/persimmon-8b-chat
  • mosaicml/mpt-7b

Quantized architectures:

  • llama
  • phi
@EricLBuehler EricLBuehler pinned this issue Mar 30, 2024
@EricLBuehler EricLBuehler added the models Additions to model or architectures label Apr 3, 2024
@lucasavila00
Copy link
Contributor

lucasavila00 commented Apr 6, 2024

Mistral v0.2

It currently doesn't work with the mistral-gguf setup.

$ ./target/release/mistralrs-server --port 1234 --log output.txt mistral-gguf -t mistralai/Mistral-7B-Instruct-v0.2 -m TheBloke/Mistral-7B-Instruct-v0.2-GGUF -f mistral-7b-instruct-v0.2.Q4_K_M.gguf
2024-04-06T04:23:47.390913Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-04-06T04:23:47.390932Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-06T04:23:47.390935Z  INFO mistralrs_server: Loading model `mistralai/Mistral-7B-Instruct-v0.2` on Cuda(CudaDevice(DeviceId(1)))...
Error: invalid type: null, expected usize at line 19 column 24

I think this is related to an error on config.json? https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/blob/main/config.json#L19

v0.2 doesn't have sliding-window attention indeed https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

@EricLBuehler
Copy link
Owner Author

Thank you for raising this. I have fixed the issue, and mistral v0.2 should work now.

@hugoabonizio
Copy link
Contributor

Is it currently possible to use models that are not tuned for instructions? It seems that only chat models are supported.

$ target/release/mistralrs-server --port 1234 mistral --model-id mistralai/Mistral-7B-v0.1
(...)
No specified chat template, loading default chat template at `./default.json`.

@EricLBuehler
Copy link
Owner Author

@hugoabonizio, I just merged support for models with no chat template, so now models that are not tuned for instructions are supported.

@hugoabonizio
Copy link
Contributor

hugoabonizio commented Apr 11, 2024

@EricLBuehler, thank you for the quick reply! I'm trying to run lm-eval against mistral.rs to compare it with the Python implementation, but I'm having some issues since it calls the completions endpoint (/completions) instead of the chat completions (/chat/completions), and it has a different output format. I'll try to hack that in over the weekend.

@EricLBuehler
Copy link
Owner Author

@hugoabonizio, #107 just added the /completions endpoint. Hopefully that is helpful!

@hugoabonizio
Copy link
Contributor

@EricLBuehler wow, that's faster than I can think! 😆

I'm getting an OOM error using non-quantized Mistral on an A100 80GB. Do you have any clue why?

$ target/release/mistralrs-server --port 1234 mistral --model-id mistralai/Mistral-7B-v0.1
$ curl http://localhost:1234/v1/completions -H "Content-Type: application/json" -H "Authorization: Bearer EMPTY" -d '{
"model": "",
"prompt": "What is Rust?"
}'
{"message":"DriverError(CUDA_ERROR_OUT_OF_MEMORY, \"out of memory\")","partial_response":{"id":"0","choices":[{"finish_reason":"error","index":0,"text":"\n\nRust is a malicious programming language that frustrates and irritates users. It’s activated by the RustM.exe file hiding on your device, disguised as the Windows Adobe Flash Player update “admin123.exe” file.\n\nUpon installing, the virus blackmails the victim to send bitcoin or pay money to remove Rust.\n\nRecently, the virus has been often distributed via an infection email titled: “Adobe Flash Player”‘.\n\nTherefore, it’s crucial to carefully read the document, since any presented data is commonly false.\n\nYou should never open anything from a stranger on the internet, as the Rust infection can catch you off guard and take control of your device.\n\nFurthermore, you shouldn’t click files that end on ‘.doc’ and ‘.exe’. As you know, these could be easily disguised as something innocent, but rather be something dangerous.\n\n## How to Get Rust Virus and How to Remove Rust From Your Device\n\nAs we already mentioned, it’s very easy to get Rust on your computer. The Rust virus distributor sends an email like you’re just any other bad driver.\n\nAs links and file documents with different extensions can be disguised as a game, email with some new data, or an update that easily infects your computer.\n\nMalicious malware viruses like the Rust virus blackmail the victim with pictures of their computer video feed. Through camera pictures and what seems like a screenshot, the Rust virus feels dangerous and frustrating.\n\nEspecially when the presentation is complete with good spellings and grammar.\n\nThis is a way of gaining the victim’s trust. Meaning that some believe what they are being told.\n\nThis in turn can cause fear, as the virus has installed completely and the device now belongs to the virus.\n\nTake into account, that everything seems innocent and easy to use, but in reality, the victim needs a solution as fast as possible.\n\nThe virus doesn’t give the victim options when to pay or send the amount of money they request for the “ransom”.\n\nRemembering that there’s no guarantee that he’ll really send the files back or not.\n\nSo that’s why VirusPro can help you solve the problem by offering you the options to always be safe and protected when it comes to Viruses like Rust.\n\n## VirusPro\n\nVirusPro is an antivirus that works best with Mac.\n\nThey offer tips and tutorials for utilizing information provided by them, such as installation and data/spam scanning. Also","logprobs":null}],"created":1712927025,"model":"mistralai/Mistral-7B-v0.1","system_fingerprint":"local","object":"text_completion","usage":{"completion_tokens":579,"prompt_tokens":5,"total_tokens":584,"avg_tok_per_sec":30.635262,"avg_prompt_tok_per_sec":32.467533,"avg_compl_tok_per_sec":30.62034,"avg_sample_tok_per_sec":77.887436,"total_time_sec":19.063,"total_prompt_time_sec":0.154,"total_completion_time_sec":18.909,"total_sampling_time_sec":7.498}}}

The process starts with ~14GB of memory and grows limitless until OOM. With fewer max_tokens, e.g. 500, it reaches ~64GB of memory and returns successfully.

@EricLBuehler
Copy link
Owner Author

One thing that I noticed is that the error only happens on non-quantized models. Quantized models do not seem to have that problem. I've tested the llama-index integration, which works with a large amount of output tokens (easily more than 500) on an A10 with 24GB. #44 seems to be similar, and I'm not really sure why this is happening. I'll take another look.

@hugoabonizio
Copy link
Contributor

@EricLBuehler yeah, it seems like it's overgrowing the kv-cache or something like that. It doesn't happen with the Candle's original implementation BTW.

@EricLBuehler
Copy link
Owner Author

@lucasavila00, do you think you can take a look at this? I've been trying to find what is wrong but made no progress and maybe a second pair of eyes would help. I have the nonquant_oom_branch where I have been playing around in case you want to try that out.

Interestingly, during debugging, I discovered that even after disabling the mistral.rs KV cache mechanism and reverting to the Candle official implementation, the problem persists. Additionally, this is not a problem for the quantized models.

@lucasavila00
Copy link
Contributor

@EricLBuehler I'll give it a shot. I can't run it locally in CUDA though, too little VRAM, making it harder to debug (I only know how to open the visual profiler locally, etc). But I'll try CPU or a if that doesn't work, a VM.

@EricLBuehler
Copy link
Owner Author

Thanks! Let me know if you find anything.

@lucasavila00
Copy link
Contributor

lucasavila00 commented Apr 13, 2024

I can reproduce it on CPU.

Te generate this amount of text:

Once upon a time, in a land far, far away, there was a kingdom ruled by a wise and just king. The kingdom was prosperous and peaceful, but there was one problem: the code written by the programmers in the kingdom was prone to errors.

The king knew that errors in code could lead to disastrous consequences, so he called upon his most trusted advisors to find a solution. After much deliberation, they came up with a plan: they would implement error handling in the code.

It used 6gb RAM unquant, and 100mb RAM quant. Both running on CPU.

As far as I know the quantization should differ only on the weights, right? The activations and KV cache is still full precision on quantized models, right? So the RAM usage should increase by the same amount no matter if quantized... 🤔

@lucasavila00
Copy link
Contributor

@EricLBuehler I did a heap dump of it, and, weirdly, 5gb were on repeat kv - also 2gb on kvconcat

image

Compared to quantized, where no memory leaked:

image

@EricLBuehler
Copy link
Owner Author

Thanks, that is very useful. Were you running an X-LoRA model? It looks like xlora_models' repeat_kv shows up in the heap track, and that should not be happening.

@lucasavila00
Copy link
Contributor

lucasavila00 commented Apr 13, 2024

I'm running:

heaptrack ./target/release/mistralrs-server --prompt "Tell me 3 jokes." mistral-gguf

and

heaptrack ./target/release/mistralrs-server --prompt "Tell me 3 jokes." mistral

I'm looking at the dumps and the gguf leaks no memory at all (only the model).

The regular version leaks.

I opened the biggest leaks here:

image

// 1.5gb
__GI___clone3 in libc.so.6
start_thread in libc.so.6
std::sys::pal::unix::thread::Thread::new::thread_start::h40e6fd3f8ce15a14 in mistralrs-server
core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hea2f92b31a4a3afa in mistralrs-server
std::sys_common::backtrace::__rust_begin_short_backtrace::hc791cc3c3af3a6b3 in mistralrs-server
mistralrs_core::engine::Engine::run::hf69ab0a23e9700c2 in mistralrs-server
_$LT$mistralrs_core..pipeline..mistral..MistralPipeline$u20$as$u20$mistralrs_core..pipeline..Pipeline$GT$::forward::ha7a7d564e9679585 in mistralrs-server
mistralrs_core::models::mistral::Model::forward::h65f13a2e19e8d611 in mistralrs-server
mistralrs_core::models::mistral::DecoderLayer::forward::hff6dd2034ec6664b in mistralrs-server
mistralrs_core::xlora_models::mistral::Attention::repeat_kv::h44480ea00c7561e8 in mistralrs-server
candle_core::tensor::Tensor::reshape::h8992546c4121d565 in mistralrs-server
candle_core::device::Device::alloc_uninit::h8ba026683b968a15 in mistralrs-server
_$LT$candle_core..cpu_backend..CpuDevice$u20$as$u20$candle_core..backend..BackendDevice$GT$::alloc_uninit::he0f5b8080cf5aeb7 in mistralrs-server

// 0.75gb
__GI___clone3 in libc.so.6
start_thread in libc.so.6
std::sys::pal::unix::thread::Thread::new::thread_start::h40e6fd3f8ce15a14 in mistralrs-server
core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hea2f92b31a4a3afa in mistralrs-server
std::sys_common::backtrace::__rust_begin_short_backtrace::hc791cc3c3af3a6b3 in mistralrs-server
mistralrs_core::engine::Engine::run::hf69ab0a23e9700c2 in mistralrs-server
_$LT$mistralrs_core..pipeline..mistral..MistralPipeline$u20$as$u20$mistralrs_core..pipeline..Pipeline$GT$::forward::ha7a7d564e9679585 in mistralrs-server
mistralrs_core::models::mistral::Model::forward::h65f13a2e19e8d611 in mistralrs-server
mistralrs_core::models::mistral::DecoderLayer::forward::hff6dd2034ec6664b in mistralrs-server
candle_nn::ops::kvconcat::h1329a0a6eafa7765 in mistralrs-server
candle_core::tensor_cat::_$LT$impl$u20$candle_core..tensor..Tensor$GT$::cat::h7c43ea827a13b08e in mistralrs-server
candle_core::tensor_cat::_$LT$impl$u20$candle_core..tensor..Tensor$GT$::cat_contiguous::hd9b8c02cb3a6334d in mistralrs-server
candle_core::device::Device::alloc_uninit::h8ba026683b968a15 in mistralrs-server
_$LT$candle_core..cpu_backend..CpuDevice$u20$as$u20$candle_core..backend..BackendDevice$GT$::alloc_uninit::he0f5b8080cf5aeb7 in mistralrs-server

@EricLBuehler
Copy link
Owner Author

It is very strange that mistralrs_core::xlora_models::mistral::Attention::repeat_kv is reported as being called if you are not running an X-LoRA model. That is very strange, as when I added a panic to that function on my machine, it doesn't run. I'll assume the profiler means the regular mistral repeat_kv implementation. Can you please run this on the Candle mistral so we can get some idea of how they perform?

@lucasavila00
Copy link
Contributor

lucasavila00 commented Apr 13, 2024

Yeah, it's weird. Once I added the panic too it reports the correct function.

I also enabled debug symbols with the profiling build profile:

__GI___clone3 in libc.so.6
start_thread in libc.so.6
std::sys::pal::unix::thread::Thread::new::thread_start::h40e6fd3f8ce15a14 in mistralrs-server
_$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::hd05b2dc112b7a972 in mistralrs-server
_$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h32ae492e80523c39 in mistralrs-server
core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::he7bc82b7859d27ea in mistralrs-server
std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::he395973212fd5227 in mistralrs-server
std::panic::catch_unwind::h8125d48c000381b5 in mistralrs-server
std::panicking::try::h836c71900e4592ac in mistralrs-server
std::panicking::try::do_call::h5884dae2f150ed49 in mistralrs-server
_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h51ad0e79282fb9d5 in mistralrs-server
std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h3efed42a53913f56 in mistralrs-server
std::sys_common::backtrace::__rust_begin_short_backtrace::h863a1d42df1f38d4 in mistralrs-server
mistralrs_core::MistralRs::new::_$u7b$$u7b$closure$u7d$$u7d$::h249bbc6d80598a29 in mistralrs-server
mistralrs_core::engine::Engine::run::h3480a4c2af3a7266 in mistralrs-server
_$LT$mistralrs_core..pipeline..mistral..MistralPipeline$u20$as$u20$mistralrs_core..pipeline..Pipeline$GT$::forward::h287e96e269b84bac in mistralrs-server
mistralrs_core::models::mistral::Model::forward::h05a3b38fcb45cb34 in mistralrs-server
mistralrs_core::models::mistral::DecoderLayer::forward::hf23bc1b1ec28acc3 in mistralrs-server
mistralrs_core::models::mistral::Attention::forward::hcc85fe9de1682d7a in mistralrs-server
mistralrs_core::models::mistral::Attention::repeat_kv::hc26597baed00693a in mistralrs-server
candle_core::tensor::Tensor::reshape::hf1f9f0a4912f0217 in mistralrs-server
candle_core::device::Device::alloc_uninit::h56d57331a20cfef6 in mistralrs-server
_$LT$candle_core..cpu_backend..CpuDevice$u20$as$u20$candle_core..backend..BackendDevice$GT$::alloc_uninit::hd74b7fc8acf5858a in mistralrs-server

I'm running the candle example.

@lucasavila00
Copy link
Contributor

lucasavila00 commented Apr 13, 2024

cargo build --release --example mistral
heaptrack ./target/release/examples/mistral --prompt "Tell me a joke." --sample-len 50

image

No leaks. It even freed the model.

I also watched the system resource monitor and it never went above the 29gb it takes to load the model initially.

Ah, this was on hf/candle let me re-run this on the fork being used -- I ran it on the fork, no leaks too. Same picture as above.

@lucasavila00
Copy link
Contributor

It seems the memory leak is only on DecoderLayer and under. No layer above seemed to leak. All layers under seem to leak.

image

@EricLBuehler
Copy link
Owner Author

Thank you for getting those traces! I wish we had something like miri for CUDA, that would be very helpful here. After my testing on the branch I mentioned above the only difference is that we use a custom Candle branch. The only changes are for CUDA though, so I'm a bit confused as to why this is happening. I'll take a deeper look.

@lucasavila00
Copy link
Contributor

Disabling kv cache fixes it on my end.

I ran the same heap profile for the quantized version and it doesn't leak anything, either with or without KV cache.

heaptrack ./target/profiling/mistralrs-server --no-kv-cache --prompt "Hello!" mistral

image

@EricLBuehler
Copy link
Owner Author

Moved to #156.

@EricLBuehler EricLBuehler unpinned this issue Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Additions to model or architectures
Projects
None yet
Development

No branches or pull requests

3 participants