Merge branch 'main' into refactor_mllm_data_collator

modelscope · Dec 4, 2024 · ff6ee4a · ff6ee4a
2 parents a8940b1 + d8caab1
commit ff6ee4a
Show file tree

Hide file tree

Showing 22 changed files with 1,694 additions and 1,167 deletions.
diff --git a/README.md b/README.md
@@ -55,6 +55,7 @@ You can contact us and communicate with us by adding our group:
 <img src="asset/discord_qr.jpg" width="200" height="200">  |  <img src="asset/wechat.png" width="200" height="200">
 
 ## 🎉 News
+- 🎁2024.12.04: We bump the version to SWIFT3.0 . Please check [ReleaseNote and BreakChange](./docs/source/Instruction/ReleaseNote3.0.md) for details.
 - 2024.11.29: Support for glm-edge and glm-edge-v series models. Use `swift infer --model_type glm-edge-v-2b` for the experience.
 - 2024.11.28: Supports the model qwq-32b-preview, marco-o1, and the dataset open-o1. Use `swift infer --model_type qwq-32b-preview` for the experience.
 - 2024.11.12: Supports training and deployment of the qwen2.5-coder series models: 0.5b, 3b, 14b, and 32b. Use `swift infer --model_type qwen2_5-coder-3b-instruct` to experience it.

diff --git a/README_CN.md b/README_CN.md
@@ -56,6 +56,7 @@ SWIFT具有丰富全面的文档，请查看我们的文档网站:
 
 
 ## 🎉 新闻
+- 🎁2024.12.04: SWIFT3.0大版本更新. 请查看[ReleaseNote和BreakChange](./docs/source/Instruction/ReleaseNote3.0.md).
 - 2024.11.29: 支持glm-edge和glm-edge-v系列模型. 使用`swift infer --model_type glm-edge-v-2b`进行体验.
 - 2024.11.28: 支持模型qwq-32b-preview, marco-o1, 支持数据集open-o1. 使用`swift infer --model_type qwq-32b-preview`进行体验.
 - 2024.11.12: 支持qwen2.5-coder系列模型0.5b, 3b, 14b, 32b的训练到部署. 使用`swift infer --model_type qwen2_5-coder-3b-instruct`进行体验.

diff --git a/docs/Makefile b/docs/Makefile
@@ -5,7 +5,7 @@
 # from the environment for the first two.
 SPHINXOPTS    ?=
 SPHINXBUILD   ?= sphinx-build
-SOURCEDIR     = source_en
+SOURCEDIR     = source
 BUILDDIR      = build
 
 # Put it first so that "make" without argument is like "make help".

diff --git a/docs/source/Customization/新增数据集.md b/docs/source/Customization/新增数据集.md
@@ -53,9 +53,9 @@
 
 ```jsonl
 # Object detection
-{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>识别<bbox>"}, {"role": "assistant", "content": "<ref-object>"}], "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": [{"caption": "guy in red", "bbox": [138, 136, 235, 359], "bbox_type": "real", "image": 0}] }
+{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>识别<bbox>"}, {"role": "assistant", "content": "<ref-object>"}], "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }
 # Grounding to multiple bboxes
-{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>找到<ref-object>"}, {"role": "assistant", "content": "<bbox>"}], "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": [{"caption": "guy in red", "bbox": [[138, 136, 235, 359], [1,2,3,4]], "bbox_type": "real", "image": 0}] }
+{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>找到<ref-object>"}, {"role": "assistant", "content": "<bbox>"}], "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": "[{\"caption\": \"guy in red\", \"bbox\": [[138, 136, 235, 359], [1,2,3,4]], \"bbox_type\": \"real\", \"image\": 0}]" }
 ```
 
 该格式比通用格式多了objects字段，该字段包含的字段有：
@@ -72,7 +72,7 @@
 
 ### Agent格式
 
-Agent格式比较复杂，请参考[Agent文档](../Instruction/智能体.md).
+Agent格式比较复杂，请参考[Agent文档](../Instruction/智能体的支持.md).
 
 ## 注册hub数据集
 

diff --git a/docs/source/Instruction/ReleaseNote3.0.md b/docs/source/Instruction/ReleaseNote3.0.md
@@ -0,0 +1,86 @@
+# ReleaseNote
+
+> 如果您在3.x版本使用上遇到任何问题，请提交issue给我们。如存在2.x可用而3.x不可用的情况请暂时使用2.x版本等待我们修复完成。
+
+## 新功能
+
+1. 数据集模块重构。数据集加载速度提升2-20倍，encode速度提升2-4倍，支持streaming模式
+    - 移除了dataset_name机制，采用dataset_id、dataset_dir、dataset_path方式指定数据集
+    - 使用`--dataset_num_proc`支持多进程加速处理、使用`--load_from_cache_file true`支持使用数据前处理缓存
+    - 使用`--streaming`支持流式加载hub端和本地数据集
+    - 支持`--packing`命令以获得更稳定的训练效率
+    - 指定`--dataset <dataset_dir>`支持本地加载开源数据集
+2. 对模型进行了重构：
+    - 移除了model_type机制，使用`--model <model_id>/<model_path>`来训练和推理
+    - 若是新模型，直接使用`--model <model_id>/<model_path> --template xxx --model_type xxx`，无需书写python脚本进行模型注册
+3. template模块重构：
+    - 使用`--template_backend jinja`采用jinja模式推理
+    - 采用messages格式作为入参接口
+4. 支持了plugin机制，用于定制训练过程，目前支持的plugin有：
+    - callback 定制训练回调方法
+    - custom_trainer 定制trainer
+    - loss 定制loss方法
+    - loss_scale 定制每个token的权重
+    - metric 定制交叉验证的指标
+    - optimizer 定制训练使用的optimizer和lr_scheduler
+    - tools 定制agent训练的system格式
+    - tuner 定制新的tuner
+4. 训练模块重构：
+    - 支持了一行命令启动多机训练，详情查看[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node/deepspeed/README.md)
+    - 支持所有多模态LLM的PreTrain
+    - 训练中的predict_with_generate采用infer模块，支持多模态LLM和多卡
+    - 人类对齐KTO算法支持多模态LLM
+5. 推理与部署模块重构：
+    - 支持pt backend下的batch推理，支持多卡推理
+    - 推理和部署模块统一采用openai格式接口
+    - 支持了异步推理接口
+6. app-ui合并入web-ui，app-ui支持多模态推理
+7. 支持All-to-All模型，即Emu3-Gen或Janus等文生图或全模态模型的训练和部署等
+8. 对examples进行了功能提升，目前examples可以全面反映SWIFT的能力，易用性更强
+9. 使用`--use_hf true/false`来切换HuggingFace社区和ModelScope社区的数据集模型的下载上传
+10. 更好地支持了以代码形式进行训练、推理，代码结构更清晰，并补充了大量的代码注释
+
+
+## BreakChange
+
+本文档列举3.x版本和2.x版本的BreakChange。开发者在使用时应当注意这些不同。
+
+### 参数差异
+
+- model_type的含义发生了变化。3.0版本需要指定--model或--ckpt_dir，model_type仅当模型为SWIFT不支持模型时才需要额外指定
+- sft_type更名为train_type
+- model_id_or_path更名为model
+- template_type更名为template
+- quantization_bit更名为quant_bits
+- check_model_is_latest更名为check_model
+- batch_size更名为per_device_train_batch_size，沿用了transformers的命名规则
+- eval_batch_size更名为per_device_eval_batch_size，沿用了transformers的命名规则
+- tuner_backend移除了swift选项
+- use_flash_attn更名为attn_impl
+- bnb_4bit_comp_dtype更名为bnb_4bit_compute_dtype
+- 移除了train_dataset_sample和val_dataset_sample
+- dtype更名为torch_dtype，同时选项名称从bf16变更为标准的bfloat16，fp16变更为float16，fp32变更为float32
+- 移除了eval_human选项
+- dataset选项移除了HF::使用方式，使用新增的--use_hf控制下载和上传
+- 移除了do_sample选项，使用temperature进行控制
+- add_output_dir_suffix更名为add_version
+- 移除了eval_token，使用api_key支持
+- target_modules(lora_target_modules)的ALL改为了all-linear，含义相同
+
+2.0标记为compatible参数的部分整体移除了。
+
+### 功能
+
+1. 预训练请使用swift pt命令。该命令会默认使用generation template，而swift sft命令默认使用model_type预置的template
+2. 整体移除了2.x版本的examples目录，并添加了按功能类型划分的新examples
+3. 数据集格式完全向messages格式兼容，不再支持query/response/history格式
+4. merge_lora的存储目录可以通过`--output_dir`指定了，且merge_lora和量化不能在一个命令中执行，需要最少两个命令
+5. 移除了app-ui界面，并使用`swift web-ui --model xxx`进行替代，并支持了多模态界面部署
+6. 移除了AIGC的依赖以及对应的examples和训练代码
+
+## 待完成
+
+1. RM/PPO能力3.0版本尚不支持，请使用2.6.1版本
+2. 自定义数据集评测3.0版本尚不支持，请使用2.6.1版本
+3. Megatron预训练能力3.0版本尚不支持，请使用2.6.1版本
+4. 文档和README，尤其是英文部分暂时未更新完整
diff --git a/docs/source/Instruction/人类对齐.md b/docs/source/Instruction/人类对齐.md
@@ -2,16 +2,6 @@
 
 本文档提供了各种人类偏好对齐算法的训练脚本。若您希望深入了解更详尽的算法信息及其选择方法，请参考[文档](https://github.com/modelscope/modelscope-classroom/blob/main/LLM-tutorial/M.%E4%BA%BA%E7%B1%BB%E5%81%8F%E5%A5%BD%E5%AF%B9%E9%BD%90%E8%AE%AD%E7%BB%83.md)
 
-## 目录
-- [数据集](#数据集)
-- [DPO](#dpo)
-- [RM](#rm)
-- [PPO](#ppo)
-- [KTO](#kto)
-- [CPO](#cpo)
-- [ORPO](#orpo)
-- [SimPO](#simpo)
-
 
 ## 数据集
 

diff --git a/docs/source/Instruction/使用tuners.md b/docs/source/Instruction/使用tuners.md
@@ -1,4 +1,4 @@
-# 基本使用
+# 使用tuners
 
 tuner是指附加在模型上的额外结构部分，用于减少训练参数量或者提高训练精度。目前SWIFT支持的tuners有：
 

diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -39,8 +39,8 @@
 - max_length: 单样本的tokens最大长度
 - truncation_strategy: 如果超长如何处理，支持`delete`和`left`，代表删除和左侧裁剪，默认为left
 - max_pixels: 多模态模型图片前处理的最大像素数，默认不缩放
-- tools_prompt: 智能体训练时的工具列表转为system的格式，请参考[智能体训练](./智能体.md)
-- loss_scale: 如何针对训练添加token的loss权重。默认为`default`，代表所有response以1计算交叉熵损失。具体可以查看[插件化](../Customization/插件.md)和[智能体训练](./智能体.md)
+- tools_prompt: 智能体训练时的工具列表转为system的格式，请参考[智能体训练](./智能体的支持.md)
+- loss_scale: 如何针对训练添加token的loss权重。默认为`default`，代表所有response以1计算交叉熵损失。具体可以查看[插件化](../Customization/插件.md)和[智能体训练](./智能体的支持.md)
 - sequence_parallel_size: 序列并行数量。参考[example](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel/train.sh)
 - use_chat_template: 使用chat模板或generation模板，默认为`True`
 - template_backend: 使用swift或jinja进行推理。如果使用jinja，则使用transformers的`apply_chat_template`。默认为swift

diff --git a/docs/source/Instruction/推送模型.md b/docs/source/Instruction/推送模型.md
@@ -1,5 +1,5 @@
 
-# 推送模型的参数
+# 推送模型到社区
 
 使用SWIFT时，用户可以选择将训练好的模型推送到社区上。