feat: Add structured Output decoding support for vLLM-spyre by R3hankhan123 · Pull Request #903 · torch-spyre/sendnn-inference

R3hankhan123 · 2026-04-07T11:13:06Z

Description

Add structured decoding support for vllm spyre.

Test Plan

Run the vllm server and set structured output backend to xgrammar and outlines

Test output

With using xgrammar

[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "Classify sentiment: I love this product"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "enum",
       "schema": {
         "type": "object",
         "properties": {
           "sentiment": {
             "type": "string",
             "enum": ["positive", "negative", "neutral"]
           }
         },
         "required": ["sentiment"]
       }
     }
   }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1161  100   623  100   538    629    543 --:--:-- --:--:-- --:--:--  1172
{
 "sentiment": "positive"
}
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "Score this answer: The result is correct"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "score",
       "schema": {
         "type": "object",
         "properties": {
           "score": { "type": "number" },
           "passed": { "type": "boolean" }
         },
         "required": ["score", "passed"]
       }
     }
   }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1129  100   627  100   502    430    345  0:00:01  0:00:01 --:--:--   775
{
 "score": 1,
 "passed": true
}
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "Extract name and optional email: John (no email provided)"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "optional",
       "schema": {
         "type": "object",
         "properties": {
           "name": { "type": "string" },
           "email": { "type": "string" }
         },
         "required": ["name"]
       }
     }
   }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1156  100   648  100   508    419    328  0:00:01  0:00:01 --:--:--   748
{
 "name": "John",
 "email": "not provided"
}
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "Extract order: Order 123 has 2 apples and 3 bananas"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "order",
       "schema": {
         "type": "object",
         "properties": {
           "order": {
             "type": "object",
             "properties": {
               "id": { "type": "string" },
               "items": {
                 "type": "array",
                 "items": {
                   "type": "object",
                   "properties": {
                     "product": { "type": "string" },
                     "quantity": { "type": "integer" }
                   },
                   "required": ["product", "quantity"]
                 }
               }
             },
             "required": ["id", "items"]
           }
         },
         "required": ["order"]
       }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1830  100   819  100  1011    179    222  0:00:04  0:00:04 --:--:--   244
{
 "order": {
   "id": "123",
   "items": [
     {
       "product": "apples",
       "quantity": 2
     },
     {
       "product": "bananas",
       "quantity": 3
     }
   ]
 }
}
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "What is the capital of India?"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "qa",
       "schema": {
         "type": "object",
         "properties": {
           "answer": { "type": "string" },
           "confidence": { "type": "number" }
         },
         "required": ["answer"]
       }
     }
   }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1124  100   641  100   483    411    309  0:00:01  0:00:01 --:--:--   720
{
 "answer": "New Delhi",
 "confidence": 1
}

With outlines

 [root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -X POST \
  -d '{
    "model": "ibm-granite/granite-3.3-8b-instruct",
    "messages": [
      {"role": "user", "content": "Extract name and age from: John is 28 and lives in Bangalore"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "nested",
        "schema": {
          "type": "object",
          "properties": {
            "person": {
              "type": "object",
              "properties": {
                "name": { "type": "string" },
                "age": { "type": "integer" }
              },
              "required": ["name", "age"]
            },
            "city": { "type": "string" }
          },
          "required": ["person", "city"]
        }
      }
    }
  }' | jq -r '.choices[0].message.content' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2545  100  1818  100   727    224     89  0:00:08  0:00:08 --:--:--   464
{
  "person": {
    "name": "John",
    "age": 28
  },
  "city": "Bangalore"
}
[root@b314lp81 ~]# 
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -X POST \
  -d '{
    "model": "ibm-granite/granite-3.3-8b-instruct",
    "messages": [
      {"role": "user", "content": "Extract people: John is 28, Alice is 30"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "array",
        "schema": {
          "type": "object",
          "properties": {
            "people": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": { "type": "string" },
                  "age": { "type": "integer" }
                },
                "required": ["name", "age"]
              }
            }
          },
          "required": ["people"]
        }
      }
    }
  }' | jq -r '.choices[0].message.content' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1478  100   739  100   739    214    214  0:00:03  0:00:03 --:--:--   428
{
  "people": [
    {
      "name": "John",
      "age": 28
    },
    {
      "name": "Alice",
      "age": 30
    }
  ]
}

Checklist

I have read the contributing guidelines
My code follows the project's code style (run bash format.sh)
I have added tests for my changes (if applicable)
I have updated the documentation (if applicable)
My commits include a Signed-off-by: line (DCO compliance)

github-actions · 2026-04-07T11:13:16Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

joerunde · 2026-04-09T14:25:16Z

+        ## llguidance is not supported on s390x due to endianness issues.
+        if platform.machine() == "s390x":
+            backend = self.vllm_config.structured_outputs_config.backend
+            if backend == "guidance":


Should this return an error to the user instead?

Before this PR, we always silently ignored structured output requests since we didn't support it at all and didn't want to break tool calling integrations that sometimes requested it. But now that we do support structured output, it might be better to return errors where it's misconfigured so that users know to go fix their deployments. Otherwise we'll be in a mixed state where sometimes structured output works, and sometimes it's not applied.

Now raising RuntimeError instead of warning

Oh- @R3hankhan123 I'm not sure this can be done here in the model runner though. Any assertions that happen here generally crash the server. We would need to return an error for this request only, and the easiest way to do this is to validate the request before it even hits the engine, which we can do in SpyrePlatform.validate_request

joerunde · 2026-04-13T15:32:08Z

bot:test

joerunde

A few things we need to clean up before merging:

We should probably check for the existence of llguidance rather than doing a platform check on s390x, so that we can be forward-compatible if a version of guidance is released that works on Z
We should validate the structured outputs config up-front when we validate the request in platform.py to avoid setting the guidance backend when llguidance is not installed
We should validate that the model is not deployed with guidance as the default structured output backend if llguidance is not installed
We should add tests for these validation cases, as well as one small test that calls a model with a structured output option so we can ensure it actually works

R3hankhan123 · 2026-04-21T04:59:08Z

@joerunde llgudiance version 1.7.3 contains the fix and i have added it in the overrides of pyproject, can you take a look thanks

joerunde · 2026-04-21T23:12:06Z

        ):
            logger.debug("Scheduled tokens in this step: %s", outputs.num_scheduled_tokens)
+
+        outputs._spyre_grammar_output = self.get_grammar_bitmask(outputs)  # type: ignore[attr-defined]


It looks to me like this is here because we combine the token sampling in model_executor.execute_model, and if we were to implement sample_tokens in the model runner then we wouldn't have to do this.

The engine core runs this:

scheduler_output = self.scheduler.schedule() future = self.model_executor.execute_model(scheduler_output, non_block=True) grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output) model_output = future.result() if model_output is None: model_output = self.model_executor.sample_tokens(grammar_output)

The important thing that I see here is that this collects the grammar bitmask asynchronously while the model is running, which seems like something that we should be doing.

Assuming the above is correct, then:

At a minimum we need to put some comments here explaining this is what we're doing

I have a heavy preference for actually implementing sample_tokens() so that we can take advantage of the performance benefit

Given the time constraints of releasing this feature for sendnn_inference==2.0.0, we could just add some comments here and open an issue to follow up

Added the comment and TODO

joerunde · 2026-04-21T23:24:13Z

@R3hankhan123 let's update the PR description to remove the bit about llguidance not supported on s390x

Also before merging we still need to encode the examples from the description into unit tests. We should probably flex all the different backends, and be sure that a prompt in the batch that doesn't request structured outputs does not accidentally have structured outputs applied. See test_sampling_params.py for examples of testing other sampling params, that should be a good fit for this test.

Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>

R3hankhan123 requested review from nikolaospapandreou, prashantgupta24, rafvasq, sducouedic, tdoublep and yannicks1 as code owners April 7, 2026 11:13

R3hankhan123 requested a review from joerunde April 7, 2026 11:13

rafvasq reviewed Apr 8, 2026

View reviewed changes

Comment thread tests/v1/core/test_scheduler_structured_outputs.py Outdated

Comment thread vllm_spyre/v1/worker/spyre_model_runner.py

Comment thread vllm_spyre/v1/core/scheduler.py Outdated

Comment thread vllm_spyre/v1/core/scheduler.py Outdated

R3hankhan123 force-pushed the vllm-structured-output branch 4 times, most recently from 6f52711 to e8ba677 Compare April 9, 2026 09:20

R3hankhan123 requested review from rafvasq and tjohnson31415 April 9, 2026 09:21

rafvasq reviewed Apr 9, 2026

View reviewed changes

Comment thread tests/v1/core/test_scheduler_structured_outputs.py

R3hankhan123 force-pushed the vllm-structured-output branch from e8ba677 to c75c86d Compare April 9, 2026 14:10

joerunde reviewed Apr 9, 2026

View reviewed changes

R3hankhan123 force-pushed the vllm-structured-output branch from c75c86d to 8fb0923 Compare April 9, 2026 14:39

joerunde requested changes Apr 14, 2026

View reviewed changes

R3hankhan123 force-pushed the vllm-structured-output branch from 8fb0923 to ed4f2b2 Compare April 21, 2026 04:57

R3hankhan123 requested a review from joerunde April 21, 2026 04:58

R3hankhan123 force-pushed the vllm-structured-output branch 2 times, most recently from 4b0cf02 to 31a7742 Compare April 21, 2026 05:05

joerunde reviewed Apr 21, 2026

View reviewed changes

Comment thread vllm_spyre/v1/worker/spyre_model_runner.py Outdated

joerunde reviewed Apr 21, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

R3hankhan123 force-pushed the vllm-structured-output branch 2 times, most recently from 4349ec8 to 271b010 Compare April 22, 2026 05:21

feat: Add structured Output decoding support for vLLM-spyre

f0bc67d

Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>

R3hankhan123 force-pushed the vllm-structured-output branch from 271b010 to f0bc67d Compare April 22, 2026 06:06

Conversation

R3hankhan123 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Plan

Test output

Checklist

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joerunde Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

R3hankhan123 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

joerunde Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

joerunde commented Apr 13, 2026

Uh oh!

joerunde left a comment

Choose a reason for hiding this comment

Uh oh!

R3hankhan123 commented Apr 21, 2026

Uh oh!

Uh oh!

joerunde Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

R3hankhan123 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joerunde commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

R3hankhan123 commented Apr 7, 2026 •

edited

Loading