-
Notifications
You must be signed in to change notification settings - Fork 645
[Loader] Refactor PT model loading #4532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
a526f60 to
233ca08
Compare
6db5f59 to
5b3e605
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces comprehensive support for loading PyTorch model weights in FastDeploy, with a focus on handling weight transposition between PyTorch and Paddle formats and optimizing the model loading pipeline. Key improvements include migrating to safetensors 0.7.0rc0 for direct GPU tensor loading, refactoring weight processing logic, and introducing a new post-loading processing phase.
- Migrated safetensors dependency to version 0.7.0rc0 with direct framework integration
- Implemented
process_final_after_loadingfor post-loading weight transformations - Refactored weight transpose logic into centralized
process_weight_transposeandh2d_copyfunctions - Updated quantization methods to handle PyTorch vs Paddle weight format differences
Reviewed Changes
Copilot reviewed 31 out of 32 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements.txt | Added safetensors 0.7.0rc0 for improved tensor loading |
| fastdeploy/model_executor/utils.py | Added weight transpose utilities, h2d_copy, and multi-config context manager |
| fastdeploy/model_executor/load_weight_utils.py | Updated safetensors loader to use Paddle framework, modified cache logic |
| fastdeploy/model_executor/layers/quantization/*.py | Refactored quantization methods to handle format-specific weight shapes and transpose logic |
| fastdeploy/model_executor/layers/moe/*.py | Updated MoE layers with format-aware weight handling and transpose operations |
| fastdeploy/model_executor/layers/linear.py | Added transpose processing to linear layers for PyTorch format compatibility |
| fastdeploy/model_executor/layers/lm_head.py | Implemented weight transpose in lm_head for tied embeddings |
| fastdeploy/model_executor/models/*.py | Updated all model load_weights methods to call process_final_after_loading |
| fastdeploy/engine/*.py | Set OMP_NUM_THREADS environment variable to 3 |
| weight_loader(param, loaded_weight) | ||
| model_sublayer_name = re.sub(r"\.(weight)$", "", model_param_name) | ||
| process_weights_after_loading_fn(model_sublayer_name, param) | ||
| process_final_after_loading(self, self.fd_config) |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to process_final_after_loading is placed inside the loop, causing it to be executed for every weight loaded. This should be moved outside the loop (after line 215) to run only once after all weights are loaded, matching the pattern used in other model files like qwen3.py and qwen2.py.
| process_final_after_loading(self, self.fd_config) | |
| process_final_after_loading(self, self.fd_config) |
| process_final_after_loading(self, self.fd_config) | ||
|
|
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to process_final_after_loading is placed inside the loop, causing it to be executed for every weight loaded. This should be moved outside the loop (after line 312) to run only once after all weights are loaded, matching the pattern used in other model files like qwen3.py and qwen2.py.
| process_final_after_loading(self, self.fd_config) | |
| process_final_after_loading(self, self.fd_config) |
| process_final_after_loading(self, self.fd_config) | ||
|
|
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to process_final_after_loading is placed inside the loop, causing it to be executed for every weight loaded. This should be moved outside the loop (after line 430) to run only once after all weights are loaded, matching the pattern used in other model files like qwen3.py and qwen2.py.
| process_final_after_loading(self, self.fd_config) | |
| process_final_after_loading(self, self.fd_config) |
| weight_cache_dir = None | ||
| enable_cache = False | ||
| if envs.FD_ENABLE_MODEL_LOAD_CACHE: | ||
| if envs.FD_ENABLE_MODEL_LOAD_CACHE and fd_config.quant_config is not None: |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the condition fd_config.quant_config is not None will disable caching for non-quantized models. If this is intentional behavior change, it should be documented. If caching should work for all models, this condition should be removed or modified.
| self.quant_method: Optional[QuantMethodBase] = UnquantizedLinearMethod() | ||
|
|
||
| self.bias = None | ||
| if self.with_bias: |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bias dtype was changed from self._dtype to self.weight_dtype. Ensure that self.weight_dtype is always defined when with_bias=True, as this could cause AttributeError if weight_dtype is not set during initialization.
| if self.with_bias: | |
| if self.with_bias: | |
| # Ensure self.weight_dtype is set before using it | |
| if not hasattr(self, "weight_dtype") or self.weight_dtype is None: | |
| self.weight_dtype = self._dtype |
fastdeploy/engine/engine.py
Outdated
| "PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION": "python", | ||
| "NCCL_ALGO": "Ring", | ||
| "FLAGS_max_partition_size": int(os.getenv("FLAGS_max_partition_size", 1024)), | ||
| "OMP_NUM_THREADS": 3, |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The hardcoded value of 3 for OMP_NUM_THREADS may not be optimal for all deployment scenarios. Consider making this configurable or documenting why this specific value was chosen.
| "OMP_NUM_THREADS": 3, | |
| "OMP_NUM_THREADS": int(os.getenv("OMP_NUM_THREADS", 3)), |
fastdeploy/engine/async_llm.py
Outdated
| "FLAGS_use_append_attn": 1, | ||
| "NCCL_ALGO": "Ring", | ||
| "FLAGS_max_partition_size": int(os.getenv("FLAGS_max_partition_size", 1024)), | ||
| "OMP_NUM_THREADS": 3, |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The hardcoded value of 3 for OMP_NUM_THREADS may not be optimal for all deployment scenarios. Consider making this configurable or documenting why this specific value was chosen.
| "OMP_NUM_THREADS": 3, | |
| "OMP_NUM_THREADS": int(os.getenv("OMP_NUM_THREADS", 3)), |
| else: | ||
| # v0 loader or torch model format | ||
| weight_shape = layer.weight_shape | ||
| weight_scale_inv_shape = weight_scale_inv_shape |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment assigns a variable to itself.
| weight_scale_inv_shape = weight_scale_inv_shape |
| if self.nranks > 0: | ||
| if self.with_bias: | ||
| # col parallel | ||
| _set_var_distributed(self.bias, split_axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里怎么删了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
行切的bias,不用切分
| _process_quantize() | ||
| else: | ||
| _process_quantize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看起来这三行是可以简写的,去掉else直接把_process_quantize()写到外边
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
这里PT 权重加载逻辑调整的原因是啥,哪步对性能提升有帮助呢 |
1be7aef to
3b3ecf6
Compare
eb78b71 to
909430c
Compare
Motivation
原始PT权重加载逻辑导致 H2D性能出现数倍劣化,因此修改PTloading逻辑,提升模型 loading性能
Modifications
改动概述
除 ViT / Resampler 外的模型,PT 权重加载逻辑调整如下:
原逻辑: 加载权重 -> 转置 -> param.copy_(weight)
新逻辑: 创建与 checkpoint 对齐的参数 -> param.copy_(weight) -> after_loading_fn 负责转置
依赖paddle框架PR
改动内容
修改了HF上PT模型的loading方式
已重构:
1.bf16
2.weightonly
3.deepgemm fp8 在线量化
4.trtion backend : fp8/Wfp8Afp8MoEMethod/triton weight only
Usage or Command
Accuracy Tests
ci/ce
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.