|
| 1 | +# Deploy QWEN3-0.6b in 10 Minutes |
| 2 | + |
| 3 | +Before deployment, ensure your environment meets the following requirements: |
| 4 | + |
| 5 | +- GPU Driver ≥ 535 |
| 6 | +- CUDA ≥ 12.3 |
| 7 | +- cuDNN ≥ 9.5 |
| 8 | +- Linux X86_64 |
| 9 | +- Python ≥ 3.10 |
| 10 | + |
| 11 | +This guide uses the lightweight QWEN3-0.6b model for demonstration, which can be deployed on most hardware configurations. Docker deployment is recommended. |
| 12 | + |
| 13 | +For more information about how to install FastDeploy, refer to the [installation document](installation/README.md). |
| 14 | + |
| 15 | +## 1. Launch Service |
| 16 | +After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md) |
| 17 | + |
| 18 | +> ⚠️ **Note:** |
| 19 | +> When using HuggingFace models (torch format), you need to enable `--load_choices "default_v1"`. |
| 20 | +
|
| 21 | +``` |
| 22 | +export ENABLE_V1_KVCACHE_SCHEDULER=1 |
| 23 | +python -m fastdeploy.entrypoints.openai.api_server \ |
| 24 | + --model Qwen/QWEN3-0.6b \ |
| 25 | + --port 8180 \ |
| 26 | + --metrics-port 8181 \ |
| 27 | + --engine-worker-queue-port 8182 \ |
| 28 | + --max-model-len 32768 \ |
| 29 | + --max-num-seqs 32 \ |
| 30 | + --load_choices "default_v1" |
| 31 | +``` |
| 32 | + |
| 33 | +> 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```Qwen/QWEN3-0.6b```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md). |
| 34 | +```--max-model-len``` indicates the maximum number of tokens supported by the currently deployed service. |
| 35 | +```--max-num-seqs``` indicates the maximum number of concurrent processing supported by the currently deployed service. |
| 36 | + |
| 37 | +**Related Documents** |
| 38 | +- [Service Deployment](../online_serving/README.md) |
| 39 | +- [Service Monitoring](../online_serving/metrics.md) |
| 40 | + |
| 41 | +## 2. Request the Service |
| 42 | +After starting the service, the following output indicates successful initialization: |
| 43 | + |
| 44 | +```shell |
| 45 | +api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics |
| 46 | +api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions |
| 47 | +api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions |
| 48 | +INFO: Started server process [13909] |
| 49 | +INFO: Waiting for application startup. |
| 50 | +INFO: Application startup complete. |
| 51 | +INFO: Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit) |
| 52 | +``` |
| 53 | + |
| 54 | +### Health Check |
| 55 | + |
| 56 | +Verify service status (HTTP 200 indicates success): |
| 57 | + |
| 58 | +```shell |
| 59 | +curl -i http://0.0.0.0:8180/health |
| 60 | +``` |
| 61 | + |
| 62 | +### cURL Request |
| 63 | + |
| 64 | +Send requests to the service with the following command: |
| 65 | + |
| 66 | +```shell |
| 67 | +curl -X POST "http://0.0.0.0:1822/v1/chat/completions" \ |
| 68 | +-H "Content-Type: application/json" \ |
| 69 | +-d '{ |
| 70 | + "messages": [ |
| 71 | + {"role": "user", "content": "Write me a poem about large language model."} |
| 72 | + ], |
| 73 | + "stream": true |
| 74 | +}' |
| 75 | +``` |
| 76 | + |
| 77 | +### Python Client (OpenAI-compatible API) |
| 78 | + |
| 79 | +FastDeploy's API is OpenAI-compatible. You can also use Python for requests: |
| 80 | + |
| 81 | +```python |
| 82 | +import openai |
| 83 | +host = "0.0.0.0" |
| 84 | +port = "8180" |
| 85 | +client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null") |
| 86 | + |
| 87 | +response = client.chat.completions.create( |
| 88 | + model="null", |
| 89 | + messages=[ |
| 90 | + {"role": "system", "content": "I'm a helpful AI assistant."}, |
| 91 | + {"role": "user", "content": "Write me a poem about large language model."}, |
| 92 | + ], |
| 93 | + stream=True, |
| 94 | +) |
| 95 | +for chunk in response: |
| 96 | + if chunk.choices[0].delta: |
| 97 | + print(chunk.choices[0].delta.content, end='') |
| 98 | +print('\n') |
| 99 | +``` |
0 commit comments