[Core][Feature] Input metadata dump on crash#13407
[Core][Feature] Input metadata dump on crash#13407njhill merged 29 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Wallas Santos <wallashss@ibm.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
|
@wallashss Thanks for writing up this PR. I think it will be useful to have details for debugging printed to the logs at crash! When I try out these changes in my dev environment running online mode with and sending a request with a large prompt and requesting I see the The above seems to happen only when the first reques to the server crashes it. If I send a shortened request first (e.g. prompt from |
|
Tks for try it out @tjohnson31415, I'll take a look in you examples and address these issues. |
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
|
Hey @tjohnson31415 I think I addressed your issues:
I repro your setup and I saw the logs, the dump is before the stacktrace. The server was indeed hanging, the issue was a pickling of my custom error, I fixed that.
Fixed that too, thanks for reporting this. |
Signed-off-by: Wallas Santos <wallashss@ibm.com>
…rash Signed-off-by: Wallas Santos <wallashss@ibm.com>
|
Hey @comaniac, do you think this feature implemented this way makes sense for vLLM? I'd love to hear your feedback. Thanks! |
comaniac
left a comment
There was a problem hiding this comment.
The idea looks good. Let's polish the code and I'm ok with a green light. The most important point I'd like to highlight again is I feel we only need to support this feature in v1 for simplicity.
vllm/error_report.py
Outdated
| return False | ||
|
|
||
|
|
||
| def prepare_object_to_dump(obj): |
There was a problem hiding this comment.
This function is basically the root cause of removing the previous input dump. It has many cases to handle and will be affected if any of them is changed. Specifically, primitive types and torch.Tensor are fine, but I'm a bit worry about SequenceData and NewRequestData.
There was a problem hiding this comment.
Yes, I share the same concern.
The custom handler for theses classes is to obfuscate the prompts. But I can not anticipate that we always have the right implementation for future changes. I guess we could add more hardcoded logs, comments, asserts, and tests to warn other developers of this feature at the cost of increase the maintenance of this feature. I am not sure of this, but I would like to hear more feedback or ideas from your side.
There was a problem hiding this comment.
That seems create burden to developers and this is the reason of removing input dump. Ideally we could have an approach to recursively traverse an input object and serialize them with tensor values ignored. Another direction is providing these methods in custom data structures (e.g., SequenceData.dump()) so that they can be in the same place to ease the maintenance.
Also cc @youkaichao
There was a problem hiding this comment.
I added a method called anon_repr for those classes, which is similar to the __repr__ implementation. They are close and I added comment there to help other contributors to be aware of that. The prepare_object_to_dump is has indirect awareness of this method, it check in the serialization if the object contains this method, and use it if so.
There was a problem hiding this comment.
BTW: I changed the format of dump to be more like how __repr__ outputs string representation of objects instead of JSON. I think it got more standardized and consistent with what we already have been using with __repr__.
vllm/error_report.py
Outdated
| execute_model_req: Union[ExecuteModelRequest, | ||
| None] = None): | ||
|
|
||
| assert engine_version == 0 or engine_version == 1 |
There was a problem hiding this comment.
I feel we don't need to support v0. Reasons:
- The code could be much cleaner.
- v0 is going to be deprecated soon.
There was a problem hiding this comment.
I see, but I guess that are still a lot of deployments running right now that are based on V0 (at least from our side). That's why we are interested in support both engines.
Do you think you can reconsider?
There was a problem hiding this comment.
Since we're planning to freeze v0 I still don't feel we should support it. However if you really need, I'd suggest that we separate the v0/v1 logic completely in different functions (e.g. xxx_v0), so that in the future when we want to deprecate v0, we can easily locate the logic and remove them.
There was a problem hiding this comment.
I splitted the functions to ease the identification of the version.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
…rash Signed-off-by: Wallas Santos <wallashss@ibm.com>
added logs for scheduler stats minor fixes Signed-off-by: Wallas Santos <wallashss@ibm.com>
|
Finally all green! This PR has been there for a while, and now I am convinced to remove the V0 support. I also did some minor updates. If any of you guys could have a second look and merge I'd appreciate it. Thanks! |
njhill
left a comment
There was a problem hiding this comment.
Thanks @wallashss, and sorry I have a couple more comments!
vllm/v1/engine/core.py
Outdated
| class ModelExecutionError(RuntimeError): | ||
| """Custom RuntimeError with input data for model execution | ||
|
|
||
| In a nutshell, this object is useful for custom handling of exception for | ||
| the case the engine raises an error. For instance, it is used to log the | ||
| input metadata that is useful for debugging on engine crashes. | ||
|
|
||
| Args: | ||
| scheduler_output: SchedulerOutput object that contains the input | ||
| data for model execution | ||
|
|
||
| """ | ||
| scheduler_output: SchedulerOutput | ||
| scheduler_stats: SchedulerStats | ||
|
|
||
| def __init__(self, *args, scheduler_output=None, scheduler_stats=None): | ||
| super().__init__(*args) | ||
| self.scheduler_output = scheduler_output | ||
| self.scheduler_stats = scheduler_stats | ||
|
|
||
| def __reduce__(self): | ||
| # To avoid pickle errors. | ||
| # This happens when we exchange this object between processes | ||
| # since scheduler_output can have objects that only makes sense | ||
| # to their context/process we remove them from the serialization | ||
| # and only send the summary of the error as a regular RuntimeError. | ||
| return (self.__class__, (self.args[0], )) |
There was a problem hiding this comment.
Move this class to dump_input.py to keep core.py cleaner?
There was a problem hiding this comment.
After removing v0 and reviewing it, I think we can totally remove this class and get a cleaner code.
Signed-off-by: Wallas Santos <wallashss@ibm.com>
…rash Signed-off-by: Wallas Santos <wallashss@ibm.com>
njhill
left a comment
There was a problem hiding this comment.
Thanks @wallashss, sorry for forgetting to come back to this. It looks good to me I just have one more minor comment.
|
Thanks @njhill ! Could you please point me out the comment? Is that a new one? Or did I missed this? |
njhill
left a comment
There was a problem hiding this comment.
@wallashss apologies I must have failed to press the button to include the comment in the review.
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
njhill
left a comment
There was a problem hiding this comment.
@wallashss sorry for the extra nits, just want to keep the core engine loop as clean as possible. I will merge as soon as these are in!
Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Wallas Santos <wallashss@ibm.com>
104a8b8 to
51596e4
Compare
|
Thanks a lot @njhill ! |
Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
This PR adds a feature to dump input metadata when vllm engine crashes. In essence, this change is the spiritual successor to #8305 that was recently removed in #12582. However, I tried to solve it differently, since this feature can give us more hints to help debug crashes in production environment. So, I would like to propose it again to the community and give it a second chance.
Summary:
V0 dump sample
V1 Dump Sample