[Spyre-Next] Simplify SiluAndMul by removing custom op barrier#906
[Spyre-Next] Simplify SiluAndMul by removing custom op barrier#906GOavi101 wants to merge 1 commit intotorch-spyre:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to vLLM support on Spyre. We also recommend installing prek and configuring it to check your code before every local commit. |
c7a190f to
bc06b5e
Compare
bohnstingl
left a comment
There was a problem hiding this comment.
Looks good to me in principle.
| self.maybe_compiled_forward_spyre = self.maybe_compile(self.forward_spyre) | ||
|
|
There was a problem hiding this comment.
So we don't need this anymore and this will be triggered by the upstream implementation through enforce_eager, right? Can you check the difference when using enforce_eager=True and without?
There was a problem hiding this comment.
sure I will run a test comparing the two modes
There was a problem hiding this comment.
I tested both modes to verify the behavior:
With enforce_eager=True (eager execution):
- Compilation mode: NONE
- Custom ops: 'all'
- Direct eager execution through forward_oot()
With enforce_eager=False (compilation enabled):
- Compilation mode: DYNAMO_TRACE_ONCE
- Custom ops: 'none'
- torch.compile wraps the entire model including SpyreSiluAndMul
- Longer warmup (130s vs 68s) due to compilation overhead
- ~2x faster inference (92s vs 180s)
| logger.debug_once( | ||
| "SpyreSiluAndMul: Dispatch: enabled=%s, Forward method=%s, Compiled=%s", | ||
| self.enabled(), | ||
| self._forward_method.__name__, | ||
| self.maybe_compiled_forward_spyre is not self.forward_spyre, | ||
| ) |
There was a problem hiding this comment.
Would a logging like this not make sense anymore? Like, the CustomOp can still be enabled or not, right?
There was a problem hiding this comment.
Good point! The old logging referenced attributes that no longer exist (_forward_method,maybe_compiled_forward_spyre ). In the simplified implementation, we could add simpler logging like:
logger.debug_once(
"SpyreSiluAndMul: enabled=%s",
self.enabled(),
)
something like this? It won't show compilation status since that's now handled by the outer graph, but it would still indicate if the CustomOp is active.
bc06b5e to
2734cec
Compare
| this method is called (see _forward_spyre_impl). | ||
| The convert() utility handles device/dtype transfers efficiently, | ||
| skipping data movement when the tensor is already on the correct | ||
| device/dtype (vllm-spyre#863 optimization). |
There was a problem hiding this comment.
so we have a dependency on #863? Or can we merge it independently?
There was a problem hiding this comment.
The convert()utility function already exists and works correctly. PR #863 added an optimization to convert() that skips redundant data movement when a tensor is already on the correct device/dtype, but it's not a hard dependency.
There was a problem hiding this comment.
ok, make sense. Can we then just remove the comment that links it to PR 863?
| # Call .contiguous() to ensure zero storage offset (Spyre's Flex backend | ||
| # rejects non-zero offsets). convert() then transfers to Spyre device/dtype, | ||
| # skipping transfer if already correct (vllm-spyre#863 optimization). | ||
| x1 = convert(x[..., :d].contiguous(), self._target_device, self._target_dtype) |
There was a problem hiding this comment.
this looks like an elegant solution. But out of curiosity, do we know how these slicing is handled by torch-spyre?
There was a problem hiding this comment.
The slicing happens on the CPU before transfer to Spyre.
The split cannot happen on Spyre inside a compiled graph — Spyre's Flex backend rejects non-zero storage offsets (aten.slice.Tensor unsupported)(torch-spyre/torch-spyre#1333)
There was a problem hiding this comment.
ok, but if the preceding op is also executed on spyre, how should this work? (but maybe out-of-scope for this PR)
There was a problem hiding this comment.
Spyre tensor → CPU (convert detects device mismatch) → slice on CPU → contiguous → back to Spyre
it isn't optimal, but it's the current workaround for Spyre's aten.slice.Tensor limitation.
- Remove opaque custom-op boundary in SpyreSiluAndMul - Simplify forward_oot() to compute directly without layer registry - Add .contiguous() calls to ensure zero storage offset for Spyre Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
2734cec to
a4c4a28
Compare
Description
This PR simplifies the
SpyreSiluAndMulimplementation by removing the custom op barrier pattern.Changes
Technical Details
Background:
The split cannot happen on Spyre inside a compiled graph — Spyre's Flex backend rejects non-zero storage offsets (aten.slice.Tensor unsupported), confirmed with a standalone repro script (torch-spyre/torch-spyre#1333)
Implementation:
.contiguous()to ensure zero storage offsetsilu(x1) * x2on Spyre deviceRelated Issues
Related to #887
Test Plan
Tested on
ibm-granite/granite-3.3-8b-instruct:Checklist
bash format.sh)Signed-off-by:line (DCO compliance)