Skip to content

feat: Add structured Output decoding support for vLLM-spyre#903

Open
R3hankhan123 wants to merge 1 commit intotorch-spyre:mainfrom
R3hankhan123:vllm-structured-output
Open

feat: Add structured Output decoding support for vLLM-spyre#903
R3hankhan123 wants to merge 1 commit intotorch-spyre:mainfrom
R3hankhan123:vllm-structured-output

Conversation

@R3hankhan123
Copy link
Copy Markdown
Collaborator

@R3hankhan123 R3hankhan123 commented Apr 7, 2026

Description

Add structured decoding support for vllm spyre.

Test Plan

Run the vllm server and set structured output backend to xgrammar and outlines

Test output

  1. With using xgrammar
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "Classify sentiment: I love this product"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "enum",
       "schema": {
         "type": "object",
         "properties": {
           "sentiment": {
             "type": "string",
             "enum": ["positive", "negative", "neutral"]
           }
         },
         "required": ["sentiment"]
       }
     }
   }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1161  100   623  100   538    629    543 --:--:-- --:--:-- --:--:--  1172
{
 "sentiment": "positive"
}
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "Score this answer: The result is correct"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "score",
       "schema": {
         "type": "object",
         "properties": {
           "score": { "type": "number" },
           "passed": { "type": "boolean" }
         },
         "required": ["score", "passed"]
       }
     }
   }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1129  100   627  100   502    430    345  0:00:01  0:00:01 --:--:--   775
{
 "score": 1,
 "passed": true
}
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "Extract name and optional email: John (no email provided)"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "optional",
       "schema": {
         "type": "object",
         "properties": {
           "name": { "type": "string" },
           "email": { "type": "string" }
         },
         "required": ["name"]
       }
     }
   }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1156  100   648  100   508    419    328  0:00:01  0:00:01 --:--:--   748
{
 "name": "John",
 "email": "not provided"
}
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "Extract order: Order 123 has 2 apples and 3 bananas"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "order",
       "schema": {
         "type": "object",
         "properties": {
           "order": {
             "type": "object",
             "properties": {
               "id": { "type": "string" },
               "items": {
                 "type": "array",
                 "items": {
                   "type": "object",
                   "properties": {
                     "product": { "type": "string" },
                     "quantity": { "type": "integer" }
                   },
                   "required": ["product", "quantity"]
                 }
               }
             },
             "required": ["id", "items"]
           }
         },
         "required": ["order"]
       }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1830  100   819  100  1011    179    222  0:00:04  0:00:04 --:--:--   244
{
 "order": {
   "id": "123",
   "items": [
     {
       "product": "apples",
       "quantity": 2
     },
     {
       "product": "bananas",
       "quantity": 3
     }
   ]
 }
}
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "What is the capital of India?"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "qa",
       "schema": {
         "type": "object",
         "properties": {
           "answer": { "type": "string" },
           "confidence": { "type": "number" }
         },
         "required": ["answer"]
       }
     }
   }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1124  100   641  100   483    411    309  0:00:01  0:00:01 --:--:--   720
{
 "answer": "New Delhi",
 "confidence": 1
}

With outlines

 [root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -X POST \
  -d '{
    "model": "ibm-granite/granite-3.3-8b-instruct",
    "messages": [
      {"role": "user", "content": "Extract name and age from: John is 28 and lives in Bangalore"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "nested",
        "schema": {
          "type": "object",
          "properties": {
            "person": {
              "type": "object",
              "properties": {
                "name": { "type": "string" },
                "age": { "type": "integer" }
              },
              "required": ["name", "age"]
            },
            "city": { "type": "string" }
          },
          "required": ["person", "city"]
        }
      }
    }
  }' | jq -r '.choices[0].message.content' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2545  100  1818  100   727    224     89  0:00:08  0:00:08 --:--:--   464
{
  "person": {
    "name": "John",
    "age": 28
  },
  "city": "Bangalore"
}
[root@b314lp81 ~]# 
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -X POST \
  -d '{
    "model": "ibm-granite/granite-3.3-8b-instruct",
    "messages": [
      {"role": "user", "content": "Extract people: John is 28, Alice is 30"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "array",
        "schema": {
          "type": "object",
          "properties": {
            "people": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": { "type": "string" },
                  "age": { "type": "integer" }
                },
                "required": ["name", "age"]
              }
            }
          },
          "required": ["people"]
        }
      }
    }
  }' | jq -r '.choices[0].message.content' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1478  100   739  100   739    214    214  0:00:03  0:00:03 --:--:--   428
{
  "people": [
    {
      "name": "John",
      "age": 28
    },
    {
      "name": "Alice",
      "age": 30
    }
  ]
}

Checklist

  • I have read the contributing guidelines
  • My code follows the project's code style (run bash format.sh)
  • I have added tests for my changes (if applicable)
  • I have updated the documentation (if applicable)
  • My commits include a Signed-off-by: line (DCO compliance)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

Comment thread tests/v1/core/test_scheduler_structured_outputs.py Outdated
Comment thread vllm_spyre/v1/worker/spyre_model_runner.py
Comment thread vllm_spyre/v1/core/scheduler.py Outdated
Comment thread vllm_spyre/v1/core/scheduler.py Outdated
@R3hankhan123 R3hankhan123 force-pushed the vllm-structured-output branch 4 times, most recently from 6f52711 to e8ba677 Compare April 9, 2026 09:20
Comment thread tests/v1/core/test_scheduler_structured_outputs.py
@R3hankhan123 R3hankhan123 force-pushed the vllm-structured-output branch from e8ba677 to c75c86d Compare April 9, 2026 14:10
## llguidance is not supported on s390x due to endianness issues.
if platform.machine() == "s390x":
backend = self.vllm_config.structured_outputs_config.backend
if backend == "guidance":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return an error to the user instead?

Before this PR, we always silently ignored structured output requests since we didn't support it at all and didn't want to break tool calling integrations that sometimes requested it. But now that we do support structured output, it might be better to return errors where it's misconfigured so that users know to go fix their deployments. Otherwise we'll be in a mixed state where sometimes structured output works, and sometimes it's not applied.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now raising RuntimeError instead of warning

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh- @R3hankhan123 I'm not sure this can be done here in the model runner though. Any assertions that happen here generally crash the server. We would need to return an error for this request only, and the easiest way to do this is to validate the request before it even hits the engine, which we can do in SpyrePlatform.validate_request

@R3hankhan123 R3hankhan123 force-pushed the vllm-structured-output branch from c75c86d to 8fb0923 Compare April 9, 2026 14:39
@joerunde
Copy link
Copy Markdown
Collaborator

bot:test

Copy link
Copy Markdown
Collaborator

@joerunde joerunde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things we need to clean up before merging:

  • We should probably check for the existence of llguidance rather than doing a platform check on s390x, so that we can be forward-compatible if a version of guidance is released that works on Z
  • We should validate the structured outputs config up-front when we validate the request in platform.py to avoid setting the guidance backend when llguidance is not installed
  • We should validate that the model is not deployed with guidance as the default structured output backend if llguidance is not installed
  • We should add tests for these validation cases, as well as one small test that calls a model with a structured output option so we can ensure it actually works

@R3hankhan123 R3hankhan123 force-pushed the vllm-structured-output branch from 8fb0923 to ed4f2b2 Compare April 21, 2026 04:57
@R3hankhan123 R3hankhan123 requested a review from joerunde April 21, 2026 04:58
@R3hankhan123
Copy link
Copy Markdown
Collaborator Author

@joerunde llgudiance version 1.7.3 contains the fix and i have added it in the overrides of pyproject, can you take a look thanks

@R3hankhan123 R3hankhan123 force-pushed the vllm-structured-output branch 2 times, most recently from 4b0cf02 to 31a7742 Compare April 21, 2026 05:05
Comment thread vllm_spyre/v1/worker/spyre_model_runner.py Outdated
):
logger.debug("Scheduled tokens in this step: %s", outputs.num_scheduled_tokens)

outputs._spyre_grammar_output = self.get_grammar_bitmask(outputs) # type: ignore[attr-defined]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like this is here because we combine the token sampling in model_executor.execute_model, and if we were to implement sample_tokens in the model runner then we wouldn't have to do this.

The engine core runs this:

        scheduler_output = self.scheduler.schedule()
        future = self.model_executor.execute_model(scheduler_output, non_block=True)
        grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)

        model_output = future.result()
        if model_output is None:
            model_output = self.model_executor.sample_tokens(grammar_output)

The important thing that I see here is that this collects the grammar bitmask asynchronously while the model is running, which seems like something that we should be doing.

Assuming the above is correct, then:

  1. At a minimum we need to put some comments here explaining this is what we're doing
  2. I have a heavy preference for actually implementing sample_tokens() so that we can take advantage of the performance benefit

Given the time constraints of releasing this feature for sendnn_inference==2.0.0, we could just add some comments here and open an issue to follow up

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment and TODO

Comment thread pyproject.toml Outdated
@joerunde
Copy link
Copy Markdown
Collaborator

@R3hankhan123 let's update the PR description to remove the bit about llguidance not supported on s390x

Also before merging we still need to encode the examples from the description into unit tests. We should probably flex all the different backends, and be sure that a prompt in the batch that doesn't request structured outputs does not accidentally have structured outputs applied. See test_sampling_params.py for examples of testing other sampling params, that should be a good fit for this test.

@R3hankhan123 R3hankhan123 force-pushed the vllm-structured-output branch 2 times, most recently from 4349ec8 to 271b010 Compare April 22, 2026 05:21
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
@R3hankhan123 R3hankhan123 force-pushed the vllm-structured-output branch from 271b010 to f0bc67d Compare April 22, 2026 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants