-
Notifications
You must be signed in to change notification settings - Fork 466
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Using tcals/code-llama-7b-query0809-200w-completion-2048-400step as plain at q4k.
2025-06-20T21:13:40.720909Z INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q4K) to 225 tensors out of 225 total tensors. Took 15.47s
mistralrs-svc2 | 2025-06-20T21:13:40.721021Z INFO mistralrs_core::paged_attention: Allocating 16384 MB for PagedAttention KV cache per GPU
mistralrs-svc2 | 2025-06-20T21:13:40.721029Z INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 1024 GPU blocks: available context length is 32768 tokens
mistralrs-svc2 | 2025-06-20T21:13:40.879084Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "</s>", unk_tok = <unk>
mistralrs-svc2 | 2025-06-20T21:13:40.881196Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
mistralrs-svc2 | 2025-06-20T21:13:40.881405Z INFO mistralrs_core: Pipeline input modalities are [📝 Text]
mistralrs-svc2 | 2025-06-20T21:13:40.881411Z INFO mistralrs_core: Pipeline output modalities are [📝 Text]
mistralrs-svc2 | 2025-06-20T21:13:40.881471Z INFO mistralrs_core: Beginning dummy run.
mistralrs-svc2 | 2025-06-20T21:13:40.883572Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
mistralrs-svc2 | 2025-06-20T21:13:52.204346Z ERROR mistralrs_core::engine: step - Model failed with error: WithBacktrace { inner: Cuda(MatMulNonContiguous { lhs_stride: Layout { shape: [1, 32, 2, 2], stride: [128, 4, 2, 1], start_offset: 0 }, rhs_stride: Layout { shape: [1, 32, 2, 128], stride: [8192, 128, 4096, 1], start_offset: 0 }, mnk: (2, 128, 2) }), backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "candle_core::cuda_backend::gemm_config" }, { fn: "<candle_core::cuda_backend::CudaStorage as candle_core::backend::BackendStorage>::matmul_with_alpha" }, { fn: "candle_core::tensor::Tensor::matmul" }, { fn: "mistralrs_core::attention::backends::naive::naive_sdpa" }, { fn: "mistralrs_core::attention::Sdpa::run_attention" }, { fn: "mistralrs_core::paged_attention::layers::paged_attention::PagedAttention::forward" }, { fn: "mistralrs_core::models::llama::Llama::forward_embeds" }, { fn: "<mistralrs_core::models::llama::Llama as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward" }, { fn: "<mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}.61251" }, { fn: "tokio::runtime::runtime::Runtime::block_on" }, { fn: "std::sys::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "clone" }] }
mistralrs-svc2 | 2025-06-20T21:13:52.207236Z INFO mistralrs_core: Dummy run completed in 11.325758642s.
mistralrs-svc2 | 2025-06-20T21:13:52.207262Z INFO mistralrs_server: MCP server listening on http://0.0.0.0:6652/mcp.
mistralrs-svc2 | 2025-06-20T21:13:52.207266Z INFO mistralrs_server: MCP protocol version is 2025-03-26.
mistralrs-svc2 | 2025-06-20T21:13:52.208001Z INFO mistralrs_server: OpenAI-compatible server listening on http://0.0.0.0:7652.
mistralrs-svc2 | 2025-06-20T21:14:00.883935Z INFO mistralrs_core::engine::logger: Throughput (T/s) 0.40, Prefix cache hitrate 0.00%, 0 running, 0 waitingThe model presents as an API-callable target but get nothing back on calls to it.
Latest commit or version
Which commit or version you ran with.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working