Model wishlist #49

EricLBuehler · 2024-03-30T19:53:20Z

lucasavila00 · 2024-04-06T04:31:53Z

Mistral v0.2

It currently doesn't work with the mistral-gguf setup.

$ ./target/release/mistralrs-server --port 1234 --log output.txt mistral-gguf -t mistralai/Mistral-7B-Instruct-v0.2 -m TheBloke/Mistral-7B-Instruct-v0.2-GGUF -f mistral-7b-instruct-v0.2.Q4_K_M.gguf
2024-04-06T04:23:47.390913Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-04-06T04:23:47.390932Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-06T04:23:47.390935Z  INFO mistralrs_server: Loading model `mistralai/Mistral-7B-Instruct-v0.2` on Cuda(CudaDevice(DeviceId(1)))...
Error: invalid type: null, expected usize at line 19 column 24

I think this is related to an error on config.json? https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/blob/main/config.json#L19

v0.2 doesn't have sliding-window attention indeed https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

EricLBuehler · 2024-04-06T09:37:21Z

Thank you for raising this. I have fixed the issue, and mistral v0.2 should work now.

hugoabonizio · 2024-04-10T23:39:20Z

Is it currently possible to use models that are not tuned for instructions? It seems that only chat models are supported.

$ target/release/mistralrs-server --port 1234 mistral --model-id mistralai/Mistral-7B-v0.1
(...)
No specified chat template, loading default chat template at `./default.json`.

EricLBuehler · 2024-04-11T10:13:10Z

@hugoabonizio, I just merged support for models with no chat template, so now models that are not tuned for instructions are supported.

hugoabonizio · 2024-04-11T14:33:45Z

@EricLBuehler, thank you for the quick reply! I'm trying to run lm-eval against mistral.rs to compare it with the Python implementation, but I'm having some issues since it calls the completions endpoint (/completions) instead of the chat completions (/chat/completions), and it has a different output format. I'll try to hack that in over the weekend.

EricLBuehler · 2024-04-12T00:41:41Z

@hugoabonizio, #107 just added the /completions endpoint. Hopefully that is helpful!

hugoabonizio · 2024-04-12T13:11:33Z

@EricLBuehler wow, that's faster than I can think! 😆

I'm getting an OOM error using non-quantized Mistral on an A100 80GB. Do you have any clue why?

$ target/release/mistralrs-server --port 1234 mistral --model-id mistralai/Mistral-7B-v0.1

$ curl http://localhost:1234/v1/completions -H "Content-Type: application/json" -H "Authorization: Bearer EMPTY" -d '{
"model": "",
"prompt": "What is Rust?"
}'
{"message":"DriverError(CUDA_ERROR_OUT_OF_MEMORY, \"out of memory\")","partial_response":{"id":"0","choices":[{"finish_reason":"error","index":0,"text":"\n\nRust is a malicious programming language that frustrates and irritates users. It’s activated by the RustM.exe file hiding on your device, disguised as the Windows Adobe Flash Player update “admin123.exe” file.\n\nUpon installing, the virus blackmails the victim to send bitcoin or pay money to remove Rust.\n\nRecently, the virus has been often distributed via an infection email titled: “Adobe Flash Player”‘.\n\nTherefore, it’s crucial to carefully read the document, since any presented data is commonly false.\n\nYou should never open anything from a stranger on the internet, as the Rust infection can catch you off guard and take control of your device.\n\nFurthermore, you shouldn’t click files that end on ‘.doc’ and ‘.exe’. As you know, these could be easily disguised as something innocent, but rather be something dangerous.\n\n## How to Get Rust Virus and How to Remove Rust From Your Device\n\nAs we already mentioned, it’s very easy to get Rust on your computer. The Rust virus distributor sends an email like you’re just any other bad driver.\n\nAs links and file documents with different extensions can be disguised as a game, email with some new data, or an update that easily infects your computer.\n\nMalicious malware viruses like the Rust virus blackmail the victim with pictures of their computer video feed. Through camera pictures and what seems like a screenshot, the Rust virus feels dangerous and frustrating.\n\nEspecially when the presentation is complete with good spellings and grammar.\n\nThis is a way of gaining the victim’s trust. Meaning that some believe what they are being told.\n\nThis in turn can cause fear, as the virus has installed completely and the device now belongs to the virus.\n\nTake into account, that everything seems innocent and easy to use, but in reality, the victim needs a solution as fast as possible.\n\nThe virus doesn’t give the victim options when to pay or send the amount of money they request for the “ransom”.\n\nRemembering that there’s no guarantee that he’ll really send the files back or not.\n\nSo that’s why VirusPro can help you solve the problem by offering you the options to always be safe and protected when it comes to Viruses like Rust.\n\n## VirusPro\n\nVirusPro is an antivirus that works best with Mac.\n\nThey offer tips and tutorials for utilizing information provided by them, such as installation and data/spam scanning. Also","logprobs":null}],"created":1712927025,"model":"mistralai/Mistral-7B-v0.1","system_fingerprint":"local","object":"text_completion","usage":{"completion_tokens":579,"prompt_tokens":5,"total_tokens":584,"avg_tok_per_sec":30.635262,"avg_prompt_tok_per_sec":32.467533,"avg_compl_tok_per_sec":30.62034,"avg_sample_tok_per_sec":77.887436,"total_time_sec":19.063,"total_prompt_time_sec":0.154,"total_completion_time_sec":18.909,"total_sampling_time_sec":7.498}}}

The process starts with ~14GB of memory and grows limitless until OOM. With fewer max_tokens, e.g. 500, it reaches ~64GB of memory and returns successfully.

EricLBuehler · 2024-04-12T18:18:29Z

One thing that I noticed is that the error only happens on non-quantized models. Quantized models do not seem to have that problem. I've tested the llama-index integration, which works with a large amount of output tokens (easily more than 500) on an A10 with 24GB. #44 seems to be similar, and I'm not really sure why this is happening. I'll take another look.

hugoabonizio · 2024-04-12T19:43:18Z

@EricLBuehler yeah, it seems like it's overgrowing the kv-cache or something like that. It doesn't happen with the Candle's original implementation BTW.

EricLBuehler · 2024-04-13T18:26:24Z

@lucasavila00, do you think you can take a look at this? I've been trying to find what is wrong but made no progress and maybe a second pair of eyes would help. I have the nonquant_oom_branch where I have been playing around in case you want to try that out.

Interestingly, during debugging, I discovered that even after disabling the mistral.rs KV cache mechanism and reverting to the Candle official implementation, the problem persists. Additionally, this is not a problem for the quantized models.

lucasavila00 · 2024-04-13T18:49:09Z

@EricLBuehler I'll give it a shot. I can't run it locally in CUDA though, too little VRAM, making it harder to debug (I only know how to open the visual profiler locally, etc). But I'll try CPU or a if that doesn't work, a VM.

EricLBuehler · 2024-04-13T18:55:06Z

Thanks! Let me know if you find anything.

lucasavila00 · 2024-04-13T19:58:17Z

I can reproduce it on CPU.

Te generate this amount of text:

Once upon a time, in a land far, far away, there was a kingdom ruled by a wise and just king. The kingdom was prosperous and peaceful, but there was one problem: the code written by the programmers in the kingdom was prone to errors.

The king knew that errors in code could lead to disastrous consequences, so he called upon his most trusted advisors to find a solution. After much deliberation, they came up with a plan: they would implement error handling in the code.

It used 6gb RAM unquant, and 100mb RAM quant. Both running on CPU.

As far as I know the quantization should differ only on the weights, right? The activations and KV cache is still full precision on quantized models, right? So the RAM usage should increase by the same amount no matter if quantized... 🤔

lucasavila00 · 2024-04-13T20:55:08Z

@EricLBuehler I did a heap dump of it, and, weirdly, 5gb were on repeat kv - also 2gb on kvconcat

Compared to quantized, where no memory leaked:

EricLBuehler · 2024-04-13T21:28:00Z

Thanks, that is very useful. Were you running an X-LoRA model? It looks like xlora_models' repeat_kv shows up in the heap track, and that should not be happening.

lucasavila00 · 2024-04-13T21:33:10Z

I'm running:

heaptrack ./target/release/mistralrs-server --prompt "Tell me 3 jokes." mistral-gguf

and

heaptrack ./target/release/mistralrs-server --prompt "Tell me 3 jokes." mistral

I'm looking at the dumps and the gguf leaks no memory at all (only the model).

The regular version leaks.

I opened the biggest leaks here:

// 1.5gb
__GI___clone3 in libc.so.6
start_thread in libc.so.6
std::sys::pal::unix::thread::Thread::new::thread_start::h40e6fd3f8ce15a14 in mistralrs-server
core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hea2f92b31a4a3afa in mistralrs-server
std::sys_common::backtrace::__rust_begin_short_backtrace::hc791cc3c3af3a6b3 in mistralrs-server
mistralrs_core::engine::Engine::run::hf69ab0a23e9700c2 in mistralrs-server
_$LT$mistralrs_core..pipeline..mistral..MistralPipeline$u20$as$u20$mistralrs_core..pipeline..Pipeline$GT$::forward::ha7a7d564e9679585 in mistralrs-server
mistralrs_core::models::mistral::Model::forward::h65f13a2e19e8d611 in mistralrs-server
mistralrs_core::models::mistral::DecoderLayer::forward::hff6dd2034ec6664b in mistralrs-server
mistralrs_core::xlora_models::mistral::Attention::repeat_kv::h44480ea00c7561e8 in mistralrs-server
candle_core::tensor::Tensor::reshape::h8992546c4121d565 in mistralrs-server
candle_core::device::Device::alloc_uninit::h8ba026683b968a15 in mistralrs-server
_$LT$candle_core..cpu_backend..CpuDevice$u20$as$u20$candle_core..backend..BackendDevice$GT$::alloc_uninit::he0f5b8080cf5aeb7 in mistralrs-server

// 0.75gb
__GI___clone3 in libc.so.6
start_thread in libc.so.6
std::sys::pal::unix::thread::Thread::new::thread_start::h40e6fd3f8ce15a14 in mistralrs-server
core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hea2f92b31a4a3afa in mistralrs-server
std::sys_common::backtrace::__rust_begin_short_backtrace::hc791cc3c3af3a6b3 in mistralrs-server
mistralrs_core::engine::Engine::run::hf69ab0a23e9700c2 in mistralrs-server
_$LT$mistralrs_core..pipeline..mistral..MistralPipeline$u20$as$u20$mistralrs_core..pipeline..Pipeline$GT$::forward::ha7a7d564e9679585 in mistralrs-server
mistralrs_core::models::mistral::Model::forward::h65f13a2e19e8d611 in mistralrs-server
mistralrs_core::models::mistral::DecoderLayer::forward::hff6dd2034ec6664b in mistralrs-server
candle_nn::ops::kvconcat::h1329a0a6eafa7765 in mistralrs-server
candle_core::tensor_cat::_$LT$impl$u20$candle_core..tensor..Tensor$GT$::cat::h7c43ea827a13b08e in mistralrs-server
candle_core::tensor_cat::_$LT$impl$u20$candle_core..tensor..Tensor$GT$::cat_contiguous::hd9b8c02cb3a6334d in mistralrs-server
candle_core::device::Device::alloc_uninit::h8ba026683b968a15 in mistralrs-server
_$LT$candle_core..cpu_backend..CpuDevice$u20$as$u20$candle_core..backend..BackendDevice$GT$::alloc_uninit::he0f5b8080cf5aeb7 in mistralrs-server

EricLBuehler · 2024-04-13T21:42:05Z

It is very strange that mistralrs_core::xlora_models::mistral::Attention::repeat_kv is reported as being called if you are not running an X-LoRA model. That is very strange, as when I added a panic to that function on my machine, it doesn't run. I'll assume the profiler means the regular mistral repeat_kv implementation. Can you please run this on the Candle mistral so we can get some idea of how they perform?

lucasavila00 · 2024-04-13T21:49:51Z

Yeah, it's weird. Once I added the panic too it reports the correct function.

I also enabled debug symbols with the profiling build profile:

__GI___clone3 in libc.so.6
start_thread in libc.so.6
std::sys::pal::unix::thread::Thread::new::thread_start::h40e6fd3f8ce15a14 in mistralrs-server
_$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::hd05b2dc112b7a972 in mistralrs-server
_$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h32ae492e80523c39 in mistralrs-server
core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::he7bc82b7859d27ea in mistralrs-server
std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::he395973212fd5227 in mistralrs-server
std::panic::catch_unwind::h8125d48c000381b5 in mistralrs-server
std::panicking::try::h836c71900e4592ac in mistralrs-server
std::panicking::try::do_call::h5884dae2f150ed49 in mistralrs-server
_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h51ad0e79282fb9d5 in mistralrs-server
std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h3efed42a53913f56 in mistralrs-server
std::sys_common::backtrace::__rust_begin_short_backtrace::h863a1d42df1f38d4 in mistralrs-server
mistralrs_core::MistralRs::new::_$u7b$$u7b$closure$u7d$$u7d$::h249bbc6d80598a29 in mistralrs-server
mistralrs_core::engine::Engine::run::h3480a4c2af3a7266 in mistralrs-server
_$LT$mistralrs_core..pipeline..mistral..MistralPipeline$u20$as$u20$mistralrs_core..pipeline..Pipeline$GT$::forward::h287e96e269b84bac in mistralrs-server
mistralrs_core::models::mistral::Model::forward::h05a3b38fcb45cb34 in mistralrs-server
mistralrs_core::models::mistral::DecoderLayer::forward::hf23bc1b1ec28acc3 in mistralrs-server
mistralrs_core::models::mistral::Attention::forward::hcc85fe9de1682d7a in mistralrs-server
mistralrs_core::models::mistral::Attention::repeat_kv::hc26597baed00693a in mistralrs-server
candle_core::tensor::Tensor::reshape::hf1f9f0a4912f0217 in mistralrs-server
candle_core::device::Device::alloc_uninit::h56d57331a20cfef6 in mistralrs-server
_$LT$candle_core..cpu_backend..CpuDevice$u20$as$u20$candle_core..backend..BackendDevice$GT$::alloc_uninit::hd74b7fc8acf5858a in mistralrs-server

I'm running the candle example.

lucasavila00 · 2024-04-13T22:08:51Z

cargo build --release --example mistral
heaptrack ./target/release/examples/mistral --prompt "Tell me a joke." --sample-len 50

No leaks. It even freed the model.

I also watched the system resource monitor and it never went above the 29gb it takes to load the model initially.

Ah, this was on hf/candle let me re-run this on the fork being used -- I ran it on the fork, no leaks too. Same picture as above.

lucasavila00 · 2024-04-13T22:40:15Z

It seems the memory leak is only on DecoderLayer and under. No layer above seemed to leak. All layers under seem to leak.

EricLBuehler · 2024-04-14T00:55:53Z

Thank you for getting those traces! I wish we had something like miri for CUDA, that would be very helpful here. After my testing on the branch I mentioned above the only difference is that we use a custom Candle branch. The only changes are for CUDA though, so I'm a bit confused as to why this is happening. I'll take a deeper look.

lucasavila00 · 2024-04-14T00:57:15Z

Disabling kv cache fixes it on my end.

I ran the same heap profile for the quantized version and it doesn't leak anything, either with or without KV cache.

heaptrack ./target/profiling/mistralrs-server --no-kv-cache --prompt "Hello!" mistral

EricLBuehler · 2024-04-16T13:38:01Z

Moved to #156.

EricLBuehler pinned this issue Mar 30, 2024

EricLBuehler added the models Additions to model or architectures label Apr 3, 2024

EricLBuehler mentioned this issue Apr 6, 2024

Mistral 7b Instruct v0.2 crashes due to no sliding window #82

Closed

EricLBuehler mentioned this issue Apr 11, 2024

Support no chat template #105

Merged

EricLBuehler mentioned this issue Apr 11, 2024

Add the /v1/completion endpoint #107

Merged

EricLBuehler mentioned this issue Apr 12, 2024

Getting out of memory error running mistral on 3090 #44

Closed

EricLBuehler mentioned this issue Apr 14, 2024

Apparent memory leak in DecoderLayer of non-quantized models #134

Closed

EricLBuehler closed this as completed Apr 16, 2024

EricLBuehler unpinned this issue Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model wishlist #49

Model wishlist #49

EricLBuehler commented Mar 30, 2024 •

edited

Loading

lucasavila00 commented Apr 6, 2024 •

edited

Loading

EricLBuehler commented Apr 6, 2024

hugoabonizio commented Apr 10, 2024

EricLBuehler commented Apr 11, 2024

hugoabonizio commented Apr 11, 2024 •

edited

Loading

EricLBuehler commented Apr 12, 2024

hugoabonizio commented Apr 12, 2024

EricLBuehler commented Apr 12, 2024

hugoabonizio commented Apr 12, 2024

EricLBuehler commented Apr 13, 2024

lucasavila00 commented Apr 13, 2024

EricLBuehler commented Apr 13, 2024

lucasavila00 commented Apr 13, 2024 •

edited

Loading

lucasavila00 commented Apr 13, 2024

EricLBuehler commented Apr 13, 2024

lucasavila00 commented Apr 13, 2024 •

edited

Loading

EricLBuehler commented Apr 13, 2024

lucasavila00 commented Apr 13, 2024 •

edited

Loading

lucasavila00 commented Apr 13, 2024 •

edited

Loading

lucasavila00 commented Apr 13, 2024

EricLBuehler commented Apr 14, 2024

lucasavila00 commented Apr 14, 2024

EricLBuehler commented Apr 16, 2024

Model wishlist #49

Model wishlist #49

Comments

EricLBuehler commented Mar 30, 2024 • edited Loading

lucasavila00 commented Apr 6, 2024 • edited Loading

EricLBuehler commented Apr 6, 2024

hugoabonizio commented Apr 10, 2024

EricLBuehler commented Apr 11, 2024

hugoabonizio commented Apr 11, 2024 • edited Loading

EricLBuehler commented Apr 12, 2024

hugoabonizio commented Apr 12, 2024

EricLBuehler commented Apr 12, 2024

hugoabonizio commented Apr 12, 2024

EricLBuehler commented Apr 13, 2024

lucasavila00 commented Apr 13, 2024

EricLBuehler commented Apr 13, 2024

lucasavila00 commented Apr 13, 2024 • edited Loading

lucasavila00 commented Apr 13, 2024

EricLBuehler commented Apr 13, 2024

lucasavila00 commented Apr 13, 2024 • edited Loading

EricLBuehler commented Apr 13, 2024

lucasavila00 commented Apr 13, 2024 • edited Loading

lucasavila00 commented Apr 13, 2024 • edited Loading

lucasavila00 commented Apr 13, 2024

EricLBuehler commented Apr 14, 2024

lucasavila00 commented Apr 14, 2024

EricLBuehler commented Apr 16, 2024

EricLBuehler commented Mar 30, 2024 •

edited

Loading

lucasavila00 commented Apr 6, 2024 •

edited

Loading

hugoabonizio commented Apr 11, 2024 •

edited

Loading

lucasavila00 commented Apr 13, 2024 •

edited

Loading

lucasavila00 commented Apr 13, 2024 •

edited

Loading

lucasavila00 commented Apr 13, 2024 •

edited

Loading

lucasavila00 commented Apr 13, 2024 •

edited

Loading