Slow when using response format with JSON schemas with 8+ optional properties #2902

TwirreM · 2025-01-11T09:20:49Z

System Info

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.1
Commit sha: bb9095a
Docker label: sha-bb9095a
nvidia-smi:

Sat Jan 11 07:53:23 2025       
   +---------------------------------------------------------------------------------------+
   | NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.2     |
   |-----------------------------------------+----------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
   |                                         |                      |               MIG M. |
   |=========================================+======================+======================|
   |   0  NVIDIA RTX 6000 Ada Gene...    Off | 00000000:21:00.0 Off |                  Off |
   | 30%   56C    P8              31W / 300W |  45914MiB / 49140MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+
   |   1  NVIDIA RTX 6000 Ada Gene...    Off | 00000000:22:00.0 Off |                  Off |
   | 30%   57C    P8              37W / 300W |      3MiB / 49140MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+
   |   2  NVIDIA RTX 6000 Ada Gene...    Off | 00000000:41:00.0 Off |                  Off |
   | 30%   40C    P8               7W / 300W |   3180MiB / 49140MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+
   |   3  NVIDIA RTX 6000 Ada Gene...    Off | 00000000:42:00.0 Off |                  Off |
   | 30%   48C    P8               8W / 300W |      3MiB / 49140MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+
                                                                                            
   +---------------------------------------------------------------------------------------+
   | Processes:                                                                            |
   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
   |        ID   ID                                                             Usage      |
   |=======================================================================================|
   +---------------------------------------------------------------------------------------+

Model used is Qwen/Qwen2.5-0.5B-Instruct, but the issue is not specific to any model.

Docker deployment, official image version 3.0.1.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Run with any model, e.g. text-generation-launcher --model-id Qwen/Qwen2.5-0.5B-Instruct
Call the API with a prompt without any grammar constraints, to see how long it would take without (shouldn't be more than a second with a small model):
Repeat the following JSON, don't write anything other than a valid JSON, not even any backticks like (```): {"firstName": "Jane", "lastName": "Doe"}
Call the API with the same prompt, but now with grammar constraints. Below is a simple response format with a JSON schema with 8 properties which takes 8 seconds to run on my machine. 9 keys takes 22 seconds. More keys seems to take exponentially more time. All for the same output. Example response_format:

{
  "type": "json_object",
  "value": {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
      "id": {"type": "string"},
      "firstName": {"type": "string"},
      "lastName": {"type": "string"},
      "email": {"type": "string"},
      "phoneNumber": {"type": "string"},
      "hireDate": {"type": "string"},
      "position": {"type": "string"},
      "department": {"type": "string"}
    }
  }
}

As a workaround, marking all properties as required in the JSON schema makes it fast again. So I assume it has to do with permutations.

I tried separating Outlines into a separate project to see whether that library causes the slowdowns, but it transforms the JSON schema into regex almost instantly.

While waiting, num-shard threads on the CPU are at 100% while the GPU isn't doing anything.

For simplicity of reproduction, here's some Python code to do the comparison:

import requests
from time import time

URL = "http://localhost:8080"

def do_request(constrained: bool):
    begin = time()
    requests.post(
        f"{URL}/v1/chat/completions",
        headers={
            "Content-Type": "application/json",
        },
        json={
            "model": "tgi",
            "messages": [
                {
                    "role": "user",
                    "content": """Repeat the following JSON, don't write anything other than a valid JSON, not even any backticks like (```): {"firstName": "Jane", "lastName": "Doe"}""",
                }
            ],
            "stream": False,
            "response_format": {
                "type": "json_object",
                "value": {
                    "$schema": "http://json-schema.org/draft-07/schema#",
                    "type": "object",
                    "properties": {
                        "id": {"type": "string"},
                        "firstName": {"type": "string"},
                        "lastName": {"type": "string"},
                        "email": {"type": "string"},
                        "phoneNumber": {"type": "string"},
                        "hireDate": {"type": "string"},
                        "position": {"type": "string"},
                        "department": {"type": "string"},
                    }
                }
            } if constrained else None
        }
    )
    print(f"Took {time() - begin} seconds {'with' if constrained else 'without'} grammar constraints.")

do_request(constrained=False)
do_request(constrained=True)

Expected behavior

I expect running with grammar to not take much longer than without. Especially with larger models, I expect inference time to dominate.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow when using response format with JSON schemas with 8+ optional properties #2902

Slow when using response format with JSON schemas with 8+ optional properties #2902

TwirreM commented Jan 11, 2025

Slow when using response format with JSON schemas with 8+ optional properties #2902

Slow when using response format with JSON schemas with 8+ optional properties #2902

Comments

TwirreM commented Jan 11, 2025

System Info

Information

Tasks

Reproduction

Expected behavior