Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference is exceptionally slow on the L20 GPU #12440

Open
joey9503 opened this issue Nov 25, 2024 · 1 comment
Open

Inference is exceptionally slow on the L20 GPU #12440

joey9503 opened this issue Nov 25, 2024 · 1 comment

Comments

@joey9503
Copy link

joey9503 commented Nov 25, 2024

截屏2024-11-25 15 35 43 speed is 0.08tokens/sec and the gpu usage is extremely low: 截屏2024-11-25 14 24 49

system info:
gpu: L20
cuda: 12.2
pytorch: 2.5.1
graphics card driver version: 535.161.08
vllm version: 0.6.4.post1

inference script:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("../Qwen2-Math-7B-Instruct")

# Pass the default decoding hyperparameters of Qwen2.5-7B-Instruct
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model="../Qwen2-Math-7B-Instruct", enforce_eager=True)

# Prepare your prompts
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# generate outputs
outputs = llm.generate([text], sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
@qiuxin2012
Copy link
Contributor

Thanks for your question. We don't support nvidia GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants