[Feature] Add InstantTensor weight loader#36139
Conversation
|
Documentation preview: https://vllm--36139.org.readthedocs.build/en/36139/ |
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces support for InstantTensor to accelerate model weight loading. The changes span dependency management, configuration, model loading logic, and documentation. My review focuses on ensuring the CUDA-specific nature of this new feature is made clear to users. I've provided suggestions to add an explicit check for CUDA availability in the code and to update the associated docstrings and documentation to mention this requirement. These changes will help prevent runtime errors and improve the user experience for those on non-CUDA platforms.
|
Is it possible to make this the default? What considerations should we have for its usage? |
|
@robertgshaw2-redhat, thanks. We’ve responded in the RFC. |
6a7903e to
c198df6
Compare
c198df6 to
33305dd
Compare
mgoin
left a comment
There was a problem hiding this comment.
LGTM for the initial integration, thank you!
|
Hi @arlo-aisys, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
3164795 to
2f1e5c4
Compare
|
@mgoin |
2f1e5c4 to
5200dce
Compare
You can post in the PR reviews channel on the vllm slack |
5200dce to
2fd9c82
Compare
Signed-off-by: arlo <264998716+arlo-aisys@users.noreply.github.com>
2fd9c82 to
73fc7c7
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: arlo <arlo@scitix.ai>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Signed-off-by: wendyliu235 <wenjun.liu@intel.com>
PR for this RFC.
Purpose
Speed up model loading and fully utilize the bandwidth of high-speed storage (e.g., 400 Gbps networked storage).
Test Plan
Load any model with any parallelism setting, on H20 141G for example:
Test Result
See README of our repo.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.