Conversation
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
We were previously reusing the GPU SamplingMetadata class but there have been incompatible changes upstream (PR vllm-project/vllm#16728) Since it's not clear for now whether we want, should or can reuse the LogitsProcessor implementation as is, I'm making a copy of the old version of the class for the spyre backend. This won't affect any features for now since the vllm change was an internal refactoring without UX impact. Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
|
👋 Hi! Thank you for contributing to vLLM support on Spyre. Or this can be done with Now you are good to go 🚀 |
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
|
quick question before reviewing this one thoroughly: |
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
I think the general consensus is that we want to pin to |
|
@yannicks1 Yeah we talked very briefly about it last Thursday, our assertion here is that our current delivery mechanism is images that bundle vllm-spyre with vllm, so we're the ones that have to worry about building images with compatible versions. The small risk here is that there could be a regression in vllm 0.9.2 with some model or models that we then can't easily roll back, but we haven't matured this stack yet to GA support anywhere so there's currently no chance of that causing a product regression |
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
| ] | ||
|
|
||
| [[package]] | ||
| name = "mlx-lm" |
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
# Description Fixes a bug introduced in #283 where the decode pass was moved out of the warmup context. AKA: @prashantgupta24's editor likes to unindent things, and @joerunde likes to view diffs with whitespace changes ignored, so here we are 😅 --------- Signed-off-by: Joe Runde <joe@joerun.de>
# Description Fixes a bug introduced in #283 where the non-driver workers did not cache the output tokens for the next decode iteration. This also allows TP tests with TP=2 to run on cpu, so that we can catch these bugs on GHA runs. --------- Signed-off-by: Joe Runde <joe@joerun.de> Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com>
Description
This branch has a fix for:
execute_modelinstead ofupdate_states. This is because of [Optimization] Cache sampled token ids in model runner vllm#20291. )CachedRequestData(✨ Use shared CachedRequestData as vllm:main #273)Related Issues
Fix for #271