feat: Add structured Output decoding support for vLLM-spyre#903
feat: Add structured Output decoding support for vLLM-spyre#903R3hankhan123 wants to merge 1 commit intotorch-spyre:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to vLLM support on Spyre. We also recommend installing prek and configuring it to check your code before every local commit. |
6f52711 to
e8ba677
Compare
e8ba677 to
c75c86d
Compare
| ## llguidance is not supported on s390x due to endianness issues. | ||
| if platform.machine() == "s390x": | ||
| backend = self.vllm_config.structured_outputs_config.backend | ||
| if backend == "guidance": |
There was a problem hiding this comment.
Should this return an error to the user instead?
Before this PR, we always silently ignored structured output requests since we didn't support it at all and didn't want to break tool calling integrations that sometimes requested it. But now that we do support structured output, it might be better to return errors where it's misconfigured so that users know to go fix their deployments. Otherwise we'll be in a mixed state where sometimes structured output works, and sometimes it's not applied.
There was a problem hiding this comment.
Now raising RuntimeError instead of warning
There was a problem hiding this comment.
Oh- @R3hankhan123 I'm not sure this can be done here in the model runner though. Any assertions that happen here generally crash the server. We would need to return an error for this request only, and the easiest way to do this is to validate the request before it even hits the engine, which we can do in SpyrePlatform.validate_request
c75c86d to
8fb0923
Compare
|
bot:test |
joerunde
left a comment
There was a problem hiding this comment.
A few things we need to clean up before merging:
- We should probably check for the existence of llguidance rather than doing a platform check on s390x, so that we can be forward-compatible if a version of guidance is released that works on Z
- We should validate the structured outputs config up-front when we validate the request in platform.py to avoid setting the guidance backend when llguidance is not installed
- We should validate that the model is not deployed with guidance as the default structured output backend if llguidance is not installed
- We should add tests for these validation cases, as well as one small test that calls a model with a structured output option so we can ensure it actually works
8fb0923 to
ed4f2b2
Compare
|
@joerunde llgudiance version 1.7.3 contains the fix and i have added it in the overrides of pyproject, can you take a look thanks |
4b0cf02 to
31a7742
Compare
| ): | ||
| logger.debug("Scheduled tokens in this step: %s", outputs.num_scheduled_tokens) | ||
|
|
||
| outputs._spyre_grammar_output = self.get_grammar_bitmask(outputs) # type: ignore[attr-defined] |
There was a problem hiding this comment.
It looks to me like this is here because we combine the token sampling in model_executor.execute_model, and if we were to implement sample_tokens in the model runner then we wouldn't have to do this.
The engine core runs this:
scheduler_output = self.scheduler.schedule()
future = self.model_executor.execute_model(scheduler_output, non_block=True)
grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
model_output = future.result()
if model_output is None:
model_output = self.model_executor.sample_tokens(grammar_output)
The important thing that I see here is that this collects the grammar bitmask asynchronously while the model is running, which seems like something that we should be doing.
Assuming the above is correct, then:
- At a minimum we need to put some comments here explaining this is what we're doing
- I have a heavy preference for actually implementing
sample_tokens()so that we can take advantage of the performance benefit
Given the time constraints of releasing this feature for sendnn_inference==2.0.0, we could just add some comments here and open an issue to follow up
There was a problem hiding this comment.
Added the comment and TODO
|
@R3hankhan123 let's update the PR description to remove the bit about llguidance not supported on s390x Also before merging we still need to encode the examples from the description into unit tests. We should probably flex all the different backends, and be sure that a prompt in the batch that doesn't request structured outputs does not accidentally have structured outputs applied. See |
271b010 to
f0bc67d
Compare
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
f0bc67d to
9c25097
Compare
Description
Add structured decoding support for vllm spyre.
Test Plan
Run the vllm server and set structured output backend to xgrammar and outlines
Test output
With outlines
Checklist
bash format.sh)Signed-off-by:line (DCO compliance)