Hi, i am doing a load test on vllm server. below is the way to reproduce:
instance: 1xRTX 3090
load test tool: k6
server command:
python -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --disable-log-requests --port 9009 --max-num-seqs 500
then run k6 with 100 VU:
export const options = {
vus: 100, // simulate 100 virtual users
duration: '60s', // running the test for 60 seconds
};
i tried to adjust the --max-num-seqs and --max-num-batched-tokens but still cant pass 100 VU. is there any best config for the server?
any help is appreciate, thank you.