LMDeploy LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:
-
Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.
-
Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.
-
Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.
-
Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.
- An ACK Pro cluster that contains GPU-accelerated nodes is created. The Kubernetes version of the cluster is 1.22 or later. Each GPU-accelerated node provides 16 GB of GPU memory or above. For more information, see Create an ACK managed cluster.
In this example, the Qwen1.5-4B-Chat model is used to describe how to download a Qwen model, upload the model to Object Storage Service (OSS), and create a persistent volume (PV) and persistent volume claim (PVC) in an ACK cluster.
- Download the model file.
yum install git git-lfs
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
cd Qwen1.5-4B-Chat
git lfs pull
- Upload the Qwen1.5-4B-Chat model file to OSS.
ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
- Configure PVs and PVCs in the destination cluster.
You need to replace the variables in the file with real values.
kubectl apply -f ./yamls/dataset.yaml
- Run the following command to deploy the Qwen1.5-4B-Chat model as an inference service by using LMDeploy:
kubectl apply -f ./yamls/deploy.yaml
kubectl apply -f ./yamls/service.yaml
- Run the following command to view the details of the inference service:
$ kubectl get po|grep lmdeploy
---
lmdeploy-6f54847c94-k54kt 1/1 Running 0 27s
The output indicates that the inference service is running as expected and is ready to provide services.
- Run the following command to create a port forwarding rule between the inference service and the local environment:
kubectl port-forward svc/lmdeploy-service 8000:8000
Expected output:
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
- Run the following command to send an inference request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen", "messages": [{"role": "user", "content": "测试一下"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'
Expected output:
{"id":"1","object":"chat.completion","created":1720145825,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"好的,有什么我可以帮助你的吗?"},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":21,"total_tokens":29,"completion_tokens":8}}
If you no longer need the resources, delete the resources at the earliest opportunity.
- Run the following command to delete the inference service:
kubectl delete -f ./yamls
tag | Date | release |
---|---|---|
v0.4.2 | 2024-07 | init |