Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions vllm/model_executor/models/granite_speech.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,14 +64,15 @@ class GraniteSpeechAudioInputs(TensorSchema):

Dimensions:
- b: Batch size
- nf: Number of audio features (variable length)
- fi: Number of input features from the Mel spectrogram.
- fo: Number of output features, i.e. the embedding size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The term "embedding size" is ambiguous. It is commonly used to refer to the dimensionality of an embedding vector (e.g., the hidden size), but here fo represents the number of output feature vectors (i.e., the sequence length of the embeddings). This could cause confusion for future developers.

To improve clarity, I suggest rephrasing this to avoid ambiguity.

Suggested change
- fo: Number of output features, i.e. the embedding size.
- fo: Number of output features (i.e., number of embeddings).

- 160: Fixed feature dimension for Mel spectrogram features
"""

input_features: Annotated[torch.Tensor, TensorShape("b", "nf", 160)]
input_features: Annotated[torch.Tensor, TensorShape("b", "fi", 160)]
"""Audio input features."""

input_features_mask: Annotated[torch.Tensor, TensorShape("b", "nf")]
input_features_mask: Annotated[torch.Tensor, TensorShape("b", "fo")]
"""Mask for variable length audio features."""

audio_embed_sizes: Annotated[list[int], TensorShape("b")]
Expand Down