You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sat Jan 11 07:53:23 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03 Driver Version: 535.216.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX 6000 Ada Gene... Off | 00000000:21:00.0 Off | Off |
| 30% 56C P8 31W / 300W | 45914MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX 6000 Ada Gene... Off | 00000000:22:00.0 Off | Off |
| 30% 57C P8 37W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX 6000 Ada Gene... Off | 00000000:41:00.0 Off | Off |
| 30% 40C P8 7W / 300W | 3180MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX 6000 Ada Gene... Off | 00000000:42:00.0 Off | Off |
| 30% 48C P8 8W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Model used is Qwen/Qwen2.5-0.5B-Instruct, but the issue is not specific to any model.
Docker deployment, official image version 3.0.1.
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Run with any model, e.g. text-generation-launcher --model-id Qwen/Qwen2.5-0.5B-Instruct
Call the API with a prompt without any grammar constraints, to see how long it would take without (shouldn't be more than a second with a small model): Repeat the following JSON, don't write anything other than a valid JSON, not even any backticks like (```): {"firstName": "Jane", "lastName": "Doe"}
Call the API with the same prompt, but now with grammar constraints. Below is a simple response format with a JSON schema with 8 properties which takes 8 seconds to run on my machine. 9 keys takes 22 seconds. More keys seems to take exponentially more time. All for the same output. Example response_format:
As a workaround, marking all properties as required in the JSON schema makes it fast again. So I assume it has to do with permutations.
I tried separating Outlines into a separate project to see whether that library causes the slowdowns, but it transforms the JSON schema into regex almost instantly.
While waiting, num-shard threads on the CPU are at 100% while the GPU isn't doing anything.
For simplicity of reproduction, here's some Python code to do the comparison:
System Info
Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.1
Commit sha: bb9095a
Docker label: sha-bb9095a
nvidia-smi:
Model used is Qwen/Qwen2.5-0.5B-Instruct, but the issue is not specific to any model.
Docker deployment, official image version 3.0.1.
Information
Tasks
Reproduction
text-generation-launcher --model-id Qwen/Qwen2.5-0.5B-Instruct
Repeat the following JSON, don't write anything other than a valid JSON, not even any backticks like (```): {"firstName": "Jane", "lastName": "Doe"}
As a workaround, marking all properties as required in the JSON schema makes it fast again. So I assume it has to do with permutations.
I tried separating Outlines into a separate project to see whether that library causes the slowdowns, but it transforms the JSON schema into regex almost instantly.
While waiting,
num-shard
threads on the CPU are at 100% while the GPU isn't doing anything.For simplicity of reproduction, here's some Python code to do the comparison:
Expected behavior
I expect running with grammar to not take much longer than without. Especially with larger models, I expect inference time to dominate.
The text was updated successfully, but these errors were encountered: