Using metal and `n_gpu_layers` produces no tokens #30

jasonw247 · 2024-01-04T20:48:52Z

I'm running the example script with a few different models:

use llama_cpp_rs::{
    options::{ModelOptions, PredictOptions},
    LLama,
};

pub fn llama_predict() -> Result<String, anyhow::Error> {
    
    // metal seems to give really bad results
    let model_options = ModelOptions {
          //n_gpu_layers: 1,
        ..Default::default()
    };
    
    // let model_options = ModelOptions::default();

    let llama = LLama::new(
        "models/mistral-7b-instruct-v0.1.Q4_0.gguf".into(),
        &model_options,
    )
    .unwrap();

    let predict_options = PredictOptions {
        //top_k: 20,
        // top_p: 0.1,
        // f16_kv: true,

        token_callback: Some(Box::new(|token| {
            println!("token: {}", token);
            true
        })),
        ..Default::default()
    };

    // TODO: get this working on master. Metal support is flakey.
    let response = llama
        .predict(
            "what are the national animals of india".into(),
             predict_options,
        )
        .unwrap();
    println!("Response: {}", response);
    Ok(response)
}


#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_llama_cpp_rs() -> Result<(), anyhow::Error> {
        let response = llama_predict()?;
        println!("Response: {}", response);
        assert!(!response.is_empty());
        Ok(())
    }
}

When not using metal (not using n_gpu_layers) the models generate tokens ex:

token: ind
token: ian
token:  national
token:  animal
token:  is
token:  t
token: iger
token: 
Response: indian national animal is tiger
Response: indian national animal is tiger

When I use n_gpu_layers it does not generate tokens, ex:

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =   64.00 MiB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 76.07 MiB
llama_new_context_with_model: max tensor size =   102.54 MiB
count 0
token:
token:
token:
token:
...
Response:
Response:

Is this a known behavior?

The text was updated successfully, but these errors were encountered:

pixelspark · 2024-01-14T17:05:44Z

Is llama.cpp actually using Metal? I tried this and noticed (only after enabling some debug logging) that in fact the file ggml-metal.metal could not be found (it needs to be placed in the current working directory). After this the basic example works just fine for me (and actually uses the GPU) with a Mixtral GGUF model.

jasonw247 · 2024-01-16T16:40:51Z

I copied over the necessary metal files, otherwise I would get an error. After copying the files I encountered the no generated tokens issue.

shaqq · 2024-02-11T18:23:45Z

Is llama.cpp actually using Metal? I tried this and noticed (only after enabling some debug logging) that in fact the file ggml-metal.metal could not be found (it needs to be placed in the current working directory). After this the basic example works just fine for me (and actually uses the GPU) with a Mixtral GGUF model.

AFAIK it does: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#metal-build

shaqq · 2024-02-11T18:54:42Z

llama-cpp-python requires the user to specify CMAKE_ARGS when during pip install: https://llama-cpp-python.readthedocs.io/en/latest/install/macos/

Do users need to do something similar during cargo install for this crate?

mikecvet · 2024-02-17T21:55:59Z

Reading through here, it seems like llama.cpp needs to be built with specific flags in order for metal support to work: ggerganov/llama.cpp#1642

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using metal and `n_gpu_layers` produces no tokens #30

Using metal and `n_gpu_layers` produces no tokens #30

jasonw247 commented Jan 4, 2024

pixelspark commented Jan 14, 2024

jasonw247 commented Jan 16, 2024

shaqq commented Feb 11, 2024

shaqq commented Feb 11, 2024

mikecvet commented Feb 17, 2024

Using metal and n_gpu_layers produces no tokens #30

Using metal and n_gpu_layers produces no tokens #30

Comments

jasonw247 commented Jan 4, 2024

pixelspark commented Jan 14, 2024

jasonw247 commented Jan 16, 2024

shaqq commented Feb 11, 2024

shaqq commented Feb 11, 2024

mikecvet commented Feb 17, 2024

Using metal and `n_gpu_layers` produces no tokens #30

Using metal and `n_gpu_layers` produces no tokens #30