Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow when using response format with JSON schemas with 8+ optional properties #2902

Open
2 of 4 tasks
TwirreM opened this issue Jan 11, 2025 · 0 comments
Open
2 of 4 tasks

Comments

@TwirreM
Copy link

TwirreM commented Jan 11, 2025

System Info

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.1
Commit sha: bb9095a
Docker label: sha-bb9095a
nvidia-smi:

Sat Jan 11 07:53:23 2025       
   +---------------------------------------------------------------------------------------+
   | NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.2     |
   |-----------------------------------------+----------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
   |                                         |                      |               MIG M. |
   |=========================================+======================+======================|
   |   0  NVIDIA RTX 6000 Ada Gene...    Off | 00000000:21:00.0 Off |                  Off |
   | 30%   56C    P8              31W / 300W |  45914MiB / 49140MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+
   |   1  NVIDIA RTX 6000 Ada Gene...    Off | 00000000:22:00.0 Off |                  Off |
   | 30%   57C    P8              37W / 300W |      3MiB / 49140MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+
   |   2  NVIDIA RTX 6000 Ada Gene...    Off | 00000000:41:00.0 Off |                  Off |
   | 30%   40C    P8               7W / 300W |   3180MiB / 49140MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+
   |   3  NVIDIA RTX 6000 Ada Gene...    Off | 00000000:42:00.0 Off |                  Off |
   | 30%   48C    P8               8W / 300W |      3MiB / 49140MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+
                                                                                            
   +---------------------------------------------------------------------------------------+
   | Processes:                                                                            |
   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
   |        ID   ID                                                             Usage      |
   |=======================================================================================|
   +---------------------------------------------------------------------------------------+

Model used is Qwen/Qwen2.5-0.5B-Instruct, but the issue is not specific to any model.

Docker deployment, official image version 3.0.1.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. Run with any model, e.g. text-generation-launcher --model-id Qwen/Qwen2.5-0.5B-Instruct
  2. Call the API with a prompt without any grammar constraints, to see how long it would take without (shouldn't be more than a second with a small model):
    Repeat the following JSON, don't write anything other than a valid JSON, not even any backticks like (```): {"firstName": "Jane", "lastName": "Doe"}
  3. Call the API with the same prompt, but now with grammar constraints. Below is a simple response format with a JSON schema with 8 properties which takes 8 seconds to run on my machine. 9 keys takes 22 seconds. More keys seems to take exponentially more time. All for the same output. Example response_format:
{
  "type": "json_object",
  "value": {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
      "id": {"type": "string"},
      "firstName": {"type": "string"},
      "lastName": {"type": "string"},
      "email": {"type": "string"},
      "phoneNumber": {"type": "string"},
      "hireDate": {"type": "string"},
      "position": {"type": "string"},
      "department": {"type": "string"}
    }
  }
}

As a workaround, marking all properties as required in the JSON schema makes it fast again. So I assume it has to do with permutations.

I tried separating Outlines into a separate project to see whether that library causes the slowdowns, but it transforms the JSON schema into regex almost instantly.

While waiting, num-shard threads on the CPU are at 100% while the GPU isn't doing anything.

For simplicity of reproduction, here's some Python code to do the comparison:

import requests
from time import time

URL = "http://localhost:8080"

def do_request(constrained: bool):
    begin = time()
    requests.post(
        f"{URL}/v1/chat/completions",
        headers={
            "Content-Type": "application/json",
        },
        json={
            "model": "tgi",
            "messages": [
                {
                    "role": "user",
                    "content": """Repeat the following JSON, don't write anything other than a valid JSON, not even any backticks like (```): {"firstName": "Jane", "lastName": "Doe"}""",
                }
            ],
            "stream": False,
            "response_format": {
                "type": "json_object",
                "value": {
                    "$schema": "http://json-schema.org/draft-07/schema#",
                    "type": "object",
                    "properties": {
                        "id": {"type": "string"},
                        "firstName": {"type": "string"},
                        "lastName": {"type": "string"},
                        "email": {"type": "string"},
                        "phoneNumber": {"type": "string"},
                        "hireDate": {"type": "string"},
                        "position": {"type": "string"},
                        "department": {"type": "string"},
                    }
                }
            } if constrained else None
        }
    )
    print(f"Took {time() - begin} seconds {'with' if constrained else 'without'} grammar constraints.")

do_request(constrained=False)
do_request(constrained=True)

Expected behavior

I expect running with grammar to not take much longer than without. Especially with larger models, I expect inference time to dominate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant