Skip to content

[Engine]Refactor output processing for multimodal capabilities in vLLM-omni#20

Merged
hsliuustc0106 merged 2 commits into
vllm-project:mainfrom
tzhouam:feat/multimodal-output-processor
Oct 24, 2025
Merged

[Engine]Refactor output processing for multimodal capabilities in vLLM-omni#20
hsliuustc0106 merged 2 commits into
vllm-project:mainfrom
tzhouam:feat/multimodal-output-processor

Conversation

@tzhouam
Copy link
Copy Markdown
Collaborator

@tzhouam tzhouam commented Oct 22, 2025

Purpose

This PR implements Phase 2 features of https://github.com/hsliuustc0106/vllm-omni/issues/10 . Refactor output processing for multimodal capabilities in vLLM-omni

  • Introduced OmniRequestState to manage multimodal request states.
  • Enhanced MultimodalOutputProcessor to handle various output types including images, text, and latents.
  • Implemented methods for accumulating multimodal tensors and processing outputs.
  • Updated output handling to ensure compatibility with vLLM's base processor while allowing custom modality handlers.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

- Introduced OmniRequestState to manage multimodal request states.
- Enhanced MultimodalOutputProcessor to handle various output types including images, text, and latents.
- Implemented methods for accumulating multimodal tensors and processing outputs.
- Updated output handling to ensure compatibility with vLLM's base processor while allowing custom modality handlers.
@tzhouam tzhouam changed the title Refactor output processing for multimodal capabilities in vLLM-omni [Engine]Refactor output processing for multimodal capabilities in vLLM-omni Oct 22, 2025
@tzhouam tzhouam marked this pull request as ready for review October 22, 2025 08:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the multimodal output processing system in vLLM-omni by replacing the original standalone processor with a vLLM-compatible architecture that extends vLLM's base OutputProcessor. The changes introduce better state management for multimodal requests and normalize different output types before delegating to vLLM's processing pipeline.

Key changes:

  • Introduced OmniRequestState to track multimodal tensor accumulation across request lifetime
  • Replaced the original MultimodalOutputProcessor with a vLLM-compatible version that extends VLLMOutputProcessor
  • Implemented modality-specific routing and normalization methods to handle images, text, latents, and audio outputs

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

return
try:
if mm_type:
self.mm_type = (mm_type or "").lower()
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant check: if mm_type is truthy, the or \"\" fallback in the .lower() call is unnecessary since mm_type is already confirmed to be non-empty.

Suggested change
self.mm_type = (mm_type or "").lower()
self.mm_type = mm_type.lower()

Copilot uses AI. Check for mistakes.
Comment thread vllm_omni/engine/output_processor.py Outdated
Comment on lines +93 to +100
except Exception:
pass
if self.mm_accumulated is None:
self.mm_accumulated = t
else:
return "image"

# Check for latent-related attributes
if hasattr(output, 'latents') or hasattr(output, 'latent_representation'):
return "latents"

# Check for pooling output
if hasattr(output, 'pooler_output') and output.pooler_output is not None:
return "pooling"

# Default to text
return "text"

def _process_text_output(self, output: Any) -> List[RequestOutput]:
"""Process text output."""
if isinstance(output, RequestOutput):
return [output]

# Create a mock RequestOutput for text
completion_output = CompletionOutput(
index=0,
text=getattr(output, 'text', ''),
token_ids=getattr(output, 'token_ids', []),
cumulative_logprob=getattr(output, 'cumulative_logprob', 0.0),
logprobs=getattr(output, 'logprobs', None),
finish_reason=getattr(output, 'finish_reason', 'length')
)

return [self._build_request_output(output, [completion_output])]
self.mm_accumulated = torch.cat([self.mm_accumulated, t], dim=0)
except Exception:
pass
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare exception handler silently swallows all errors. Consider logging the exception or handling specific exception types to aid debugging.

Suggested change
except Exception:
pass
if self.mm_accumulated is None:
self.mm_accumulated = t
else:
return "image"
# Check for latent-related attributes
if hasattr(output, 'latents') or hasattr(output, 'latent_representation'):
return "latents"
# Check for pooling output
if hasattr(output, 'pooler_output') and output.pooler_output is not None:
return "pooling"
# Default to text
return "text"
def _process_text_output(self, output: Any) -> List[RequestOutput]:
"""Process text output."""
if isinstance(output, RequestOutput):
return [output]
# Create a mock RequestOutput for text
completion_output = CompletionOutput(
index=0,
text=getattr(output, 'text', ''),
token_ids=getattr(output, 'token_ids', []),
cumulative_logprob=getattr(output, 'cumulative_logprob', 0.0),
logprobs=getattr(output, 'logprobs', None),
finish_reason=getattr(output, 'finish_reason', 'length')
)
return [self._build_request_output(output, [completion_output])]
self.mm_accumulated = torch.cat([self.mm_accumulated, t], dim=0)
except Exception:
pass
except Exception as e:
logger.exception("Failed to move tensor to CPU in add_multimodal_tensor.")
if self.mm_accumulated is None:
self.mm_accumulated = t
else:
self.mm_accumulated = torch.cat([self.mm_accumulated, t], dim=0)
except Exception as e:
logger.exception("Exception occurred in add_multimodal_tensor.")

Copilot uses AI. Check for mistakes.
Comment thread vllm_omni/engine/output_processor.py Outdated
Comment on lines +99 to +100
except Exception:
pass
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare exception handler silently swallows all errors. Consider logging the exception or handling specific exception types to aid debugging.

Suggested change
except Exception:
pass
except Exception as e:
logger.exception("Error in add_multimodal_tensor")

Copilot uses AI. Check for mistakes.
if num_cached_tokens is not None:
# Keep num_cached_tokens in RequestOutput for compatibility
try:
self.num_cached_tokens = num_cached_tokens # type: ignore[attr-defined]
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type: ignore[attr-defined] comment suggests this attribute is being set dynamically on a class that doesn't define it. Consider documenting why this dynamic attribute is necessary or adding it to the class definition.

Copilot uses AI. Check for mistakes.
Comment thread vllm_omni/engine/output_processor.py Outdated
setattr(base_output, "multimodal_output", {})
setattr(base_output, "multimodal_output", {self.mm_type: tensor})
except Exception as e:
logger.warning("Error in _new_completion_output", e)
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger.warning() call is missing the exception argument. It should be logger.warning(\"Error in _new_completion_output: %s\", e) to properly format the exception message.

Suggested change
logger.warning("Error in _new_completion_output", e)
logger.warning("Error in _new_completion_output: %s", e)

Copilot uses AI. Check for mistakes.
Comment on lines +253 to +260
assert req_state.detokenizer is not None
assert req_state.logprobs_processor is not None
stop_string = req_state.detokenizer.update(
new_token_ids, finish_reason == FinishReason.STOP)
if stop_string:
finish_reason = FinishReason.STOP
stop_reason = stop_string
req_state.logprobs_processor.update_from_output(eco)
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions will fail for pooling-only requests where detokenizer and logprobs_processor are set to None (lines 60-61). The assertions should be conditional or removed to support non-text output modes.

Suggested change
assert req_state.detokenizer is not None
assert req_state.logprobs_processor is not None
stop_string = req_state.detokenizer.update(
new_token_ids, finish_reason == FinishReason.STOP)
if stop_string:
finish_reason = FinishReason.STOP
stop_reason = stop_string
req_state.logprobs_processor.update_from_output(eco)
if getattr(req_state, "output_kind", None) == RequestOutputKind.TEXT:
assert req_state.detokenizer is not None
assert req_state.logprobs_processor is not None
stop_string = req_state.detokenizer.update(
new_token_ids, finish_reason == FinishReason.STOP)
if stop_string:
finish_reason = FinishReason.STOP
stop_reason = stop_string
req_state.logprobs_processor.update_from_output(eco)

Copilot uses AI. Check for mistakes.
Comment thread vllm_omni/engine/output_processor.py Outdated
Comment on lines +267 to +268
except Exception:
pass
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare exception handler silently swallows all errors. Consider logging the exception or handling specific exception types to aid debugging.

Suggested change
except Exception:
pass
except Exception as e:
logger.warning("Error accumulating multimodal tensor: %s", e)

Copilot uses AI. Check for mistakes.
Comment thread vllm_omni/engine/output_processor.py Outdated
setattr(ro, "multimodal_output", {})
ro.multimodal_output[mm_key] = req_state.mm_accumulated
except Exception as e:
logger.warning("Error in process_outputs", e)
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger.warning() call is missing the exception argument. It should be logger.warning(\"Error in process_outputs: %s\", e) to properly format the exception message.

Suggested change
logger.warning("Error in process_outputs", e)
logger.warning("Error in process_outputs: %s", e)

Copilot uses AI. Check for mistakes.
Comment thread vllm_omni/engine/output_processor.py Outdated
Comment on lines +325 to +326
except Exception:
pass
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare exception handler silently swallows all errors. Consider logging the exception or handling specific exception types to aid debugging.

Suggested change
except Exception:
pass
except Exception as e:
logger.exception(f"Exception in output handler for type '{output_type}': {e}")

Copilot uses AI. Check for mistakes.
Comment on lines +394 to +395
except Exception:
pass
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare exception handler silently swallows all errors. Consider logging the exception or handling specific exception types to aid debugging.

Suggested change
except Exception:
pass
except Exception as e:
logger.warning(
"Failed to convert pooling_output to tensor: %r. Exception: %s",
eco.pooling_output, e)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this file work for different output modality including text, image, wav and etc.

@hsliuustc0106 hsliuustc0106 linked an issue Oct 23, 2025 that may be closed by this pull request
15 tasks
@tzhouam
Copy link
Copy Markdown
Collaborator Author

tzhouam commented Oct 23, 2025

does this file work for different output modality including text, image, wav and etc.

Yes, we have left the interface for the text, image, image+text, audio, hidden states, etc. But the actual implementation and postprocessing need further discussion depending on the user interface design.

- Added logging for exceptions during tensor movement to CPU in OmniRequestState and MultimodalOutputProcessor.
- Improved robustness by ensuring the output pipeline continues without crashing on errors.
- Updated comments for clarity on error handling behavior.
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

lgtm
approve

@hsliuustc0106 hsliuustc0106 merged commit 4a4c3c1 into vllm-project:main Oct 24, 2025
@tzhouam tzhouam deleted the feat/multimodal-output-processor branch November 30, 2025 04:57
princepride pushed a commit to princepride/vllm-omni that referenced this pull request Jan 10, 2026
…t-processor

[Engine]Refactor output processing for multimodal capabilities in vLLM-omni
lishunyang12 referenced this pull request in lishunyang12/vllm-omni Mar 17, 2026
fix: use legacy config loading path instead of StageConfigFactory
lishunyang12 referenced this pull request in lishunyang12/vllm-omni Mar 17, 2026
fix: use legacy config loading path instead of StageConfigFactory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants