Inference is exceptionally slow on the L20 GPU #12440

joey9503 · 2024-11-25T08:03:48Z

speed is 0.08tokens/sec and the gpu usage is extremely low:

system info:
gpu: L20
cuda: 12.2
pytorch: 2.5.1
graphics card driver version: 535.161.08
vllm version: 0.6.4.post1

inference script:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("../Qwen2-Math-7B-Instruct")

# Pass the default decoding hyperparameters of Qwen2.5-7B-Instruct
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model="../Qwen2-Math-7B-Instruct", enforce_eager=True)

# Prepare your prompts
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# generate outputs
outputs = llm.generate([text], sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

qiuxin2012 · 2024-11-26T02:18:32Z

Thanks for your question. We don't support nvidia GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference is exceptionally slow on the L20 GPU #12440

Inference is exceptionally slow on the L20 GPU #12440

joey9503 commented Nov 25, 2024 •

edited

Loading

qiuxin2012 commented Nov 26, 2024

Inference is exceptionally slow on the L20 GPU #12440

Inference is exceptionally slow on the L20 GPU #12440

Comments

joey9503 commented Nov 25, 2024 • edited Loading

qiuxin2012 commented Nov 26, 2024

joey9503 commented Nov 25, 2024 •

edited

Loading