[Spyre-Next] Wrapped Embedding layer for spyre#836
[Spyre-Next] Wrapped Embedding layer for spyre#836bohnstingl merged 14 commits intotorch-spyre:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to vLLM support on Spyre. We also recommend installing prek and configuring it to check your code before every local commit. |
|
bot:next-test |
|
@coderfornow looks like something is busted here, tests are failing with: |
ea1e192 to
470eb1f
Compare
72e0a17 to
cce91f9
Compare
cce91f9 to
af2fcf4
Compare
…n and tests Signed-off-by: coderfornow <ritikdhiranan@icloud.com>
5300594 to
f76e565
Compare
2a18c7a to
d009c04
Compare
|
All tests passing for SpyreVocabParallelEmbedding: SKIP_UPSTREAM_TESTS=1 python -m pytest tests/test_vocab_parallel_embedding.py -v |
bohnstingl
left a comment
There was a problem hiding this comment.
@coderfornow Thank you for the PR. I took a first look at it and made some comments. In addition, I would have two general questions:
- Are there tests in vllm upstream for the
VocabParallelEmbeddinglayer? If so, is there a good reason to have separate tests, or can we reuse them from upstream? cc @joerunde - Did you run the simple E2E test (https://github.com/vllm-project/vllm-spyre/blob/main/vllm_spyre_next/tests/test_vllm_spyre_next.py) and do you get the same outputs when using your wrapper and without it? Here this should be really simple, as there is not really much happening.
d009c04 to
903837a
Compare
…n and tests Signed-off-by: coderfornow <ritikdhiranan@icloud.com> Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
…move static forward context, and consolidate custom op handling Signed-off-by: coderfornow <ritikdhiranan@icloud.com>
…move static forward context, and consolidate custom op handling Signed-off-by: coderfornow <ritikdhiranan@icloud.com>
…bedding-support-spyre # Conflicts: # vllm_spyre_next/vllm_spyre_next/custom_ops/vocab_parallel_embedding.py
d5609a1 to
08c67f4
Compare
bohnstingl
left a comment
There was a problem hiding this comment.
I left some minor comments for code quality and functionality reuse
| if input_.device.type != "cpu": | ||
| raise NotImplementedError( | ||
| f"Expected input on CPU, got device={input_.device}. " | ||
| "Spyre has no native embedding kernel; input must stay on CPU." | ||
| ) |
There was a problem hiding this comment.
I think this check is not necessary. F.embedding would work with an input tensor on cpu and the dispatch machinery in torch-spyre will take care of doing the correct thing.
However, we should bring the tensor to spyre at this point using the convert function similar to the rms_norm example.
| """Embedding execution on CPU. | ||
|
|
||
| F.embedding runs on CPU via torch-spyre's spyre__embedding fallback. | ||
| torch-spyre has no native embedding kernel; aten.embedding.default falls | ||
| back to CPU (spyre__embedding). Moving tensors to Spyre would cause a | ||
| DtException (unsupported data format), so no device transfer is performed. | ||
|
|
||
| No TP masking or all_reduce is performed (tp_size > 1 is not supported). | ||
|
|
||
| Args: | ||
| input_: Token index tensor [num_tokens] on CPU (int64) | ||
|
|
||
| Returns: | ||
| Embedding output [num_tokens, embedding_dim] in weight dtype on CPU | ||
|
|
||
| Raises: | ||
| NotImplementedError: If input tensor is not on CPU. | ||
| """ |
There was a problem hiding this comment.
I feel that we should update this description a bit. From the vLLM perspective, the embedding is running on spyre, i.e., it should bring the input tensor to spyre and then torch-spyre will perform the operation. We can leave a note that currently torch-spyre has a fallback for the embedding to cpu and we could also leave the link to the corresponding issue: torch-spyre/torch-spyre#420
08c67f4 to
401ad79
Compare
Signed-off-by: coderfornow <ritikdhiranan@icloud.com> Signed-off-by: coderfornow <ritikdhiranan@icloud.com>
…ling, and update Spyre fallback logic for indirect indexing Signed-off-by: coderfornow <ritikdhiranan@icloud.com>
ca29458 to
990ff00
Compare
bohnstingl
left a comment
There was a problem hiding this comment.
LGTM! Can you confirm that the end-to-end example with a Granite3.3-8B model and only the VocabParallelEmbedding layer wrapper for spyre works and produces meaningful tokens?
|
@joerunde, is the readthedocs CI fail blocking? |
No, not blocking. It has been fixed on main: #860 Joe is OOO, but I can help get this merged once we have the confirmation with Granite 3.3 8b. Please also update the PR description to match the current content of the PR. |
|
I've tested with Granite 3.3 8b here's output INFO 03-27 10:12:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
|
The PR looks good to me in general. It has been observed though that the current way of wrapping for torch-spyre interferes with the enablement of upstream vLLM tests, see #863. To address this, I've opened a PR (#872) that reworks the forward call chain a bit and uses @coderfornow what do you think? |
|
@coderfornow #872 has landed. Could you please overtake the modified forward call structure? I will then push for a quick merge |
… forward dispatch, and clarify separate compilation for Spyre-specific kernels Signed-off-by: coderfornow <ritikdhiranan@icloud.com>
…ng during device transfers Signed-off-by: coderfornow <ritikdhiranan@icloud.com>
bohnstingl
left a comment
There was a problem hiding this comment.
LGMT, thanks for merging the new call-chain in.

Description
Adds SpyreVocabParallelEmbedding, a Spyre-optimized out-of-tree (OOT) replacement for vLLM's VocabParallelEmbedding, following the custom op pattern from #842 (same as SpyreRMSNorm and SpyreSiluAndMul).
Related Issues
Test Plan
Checklist
bash format.sh)Signed-off-by:line (DCO compliance)