Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@ The execution duration of each stage (including pre/post-processing, model forwa
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.

**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
```
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
```

## Example Output

```
Expand Down
4 changes: 2 additions & 2 deletions vllm_ascend/worker/model_runner_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -1298,8 +1298,8 @@ def execute_model(
for tag, duration in durations.items()
]
captured_name = "Decode" if self.attn_state == AscendAttentionState.DecodeOnly else "Prefill"
print(f"Profile execute duration [{captured_name}]:",
" ".join(dr_str))
logger.info("Profile execute duration [%s]:%s", captured_name,
" ".join(dr_str))

return model_runner_output

Expand Down