Skip to content

Commit 94349d3

Browse files
authored
Fix custom dataset (modelscope#736)
1 parent 5b02afe commit 94349d3

File tree

4 files changed

+11
-11
lines changed

4 files changed

+11
-11
lines changed

.github/PULL_REQUEST_TEMPLATE.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
- [ ] Bug Fix
33
- [ ] New Feature
44
- [ ] Document Updates
5-
- [ ] More Model or Dataset Support
5+
- [ ] More Models or Datasets Support
66

77
# PR information
88

docs/source/LLM/命令行参数.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -186,8 +186,8 @@ dpo参数继承了sft参数, 除此之外增加了以下参数:
186186
- `--max_length`: 默认值为`-1`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
187187
- `--truncation_strategy`: 默认是`'delete'`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
188188
- `--check_dataset_strategy`: 默认值为`'none'`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
189-
- `--custom_train_dataset_path`: 默认值为`[]`. 具体的含义参考README.md中的`自定义数据集`模块.
190-
- `--custom_val_dataset_path`: 默认值为`[]`. 具体的含义参考README.md中的`自定义数据集`模块.
189+
- `--custom_train_dataset_path`: 默认值为`[]`. 具体的含义参考[自定义与拓展](自定义与拓展.md).
190+
- `--custom_val_dataset_path`: 默认值为`[]`. 具体的含义参考[自定义与拓展](自定义与拓展.md).
191191
- `--quantization_bit`: 默认值为0. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
192192
- `--bnb_4bit_comp_dtype`: 默认值为`'AUTO'`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
193193
- `--bnb_4bit_quant_type`: 默认值为`'nf4'`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.

docs/source_en/LLM/Command-line-parameters.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -186,8 +186,8 @@ dpo parameters inherit from sft parameters, with the following added parameters:
186186
- `--max_length`: Default is `-1`. See `sft.sh command line arguments` for parameter details.
187187
- `--truncation_strategy`: Default is `'delete'`. See `sft.sh command line arguments` for parameter details.
188188
- `--check_dataset_strategy`: Default is `'none'`, see `sft.sh command line arguments` for parameter details.
189-
- `--custom_train_dataset_path`: Default is `[]`. See README.md `Custom Datasets` module for details.
190-
- `--custom_val_dataset_path`: Default is `[]`. See README.md `Custom Datasets` module for details.
189+
- `--custom_train_dataset_path`: Default is `[]`. See [Customization](Customization.md) for details.
190+
- `--custom_val_dataset_path`: Default is `[]`. See [Customization](Customization.md) for details.
191191
- `--quantization_bit`: Default is 0. See `sft.sh command line arguments` for parameter details.
192192
- `--bnb_4bit_comp_dtype`: Default is `'AUTO'`. See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
193193
- `--bnb_4bit_quant_type`: Default is `'nf4'`. See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.

swift/llm/utils/dataset.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -1445,7 +1445,6 @@ def _preprocess_hc3(dataset: HfDataset) -> HfDataset:
14451445
tags=['chat', 'medical', '🔥'],
14461446
hf_dataset_id='Flmc/DISC-Med-SFT')
14471447

1448-
# hf_dataset_id='ShengbinYue/DISC-Law-SFT'
14491448
register_dataset(
14501449
DatasetName.disc_law_sft_zh,
14511450
'AI-ModelScope/DISC-Law-SFT', ['train'],
@@ -1455,7 +1454,8 @@ def _preprocess_hc3(dataset: HfDataset) -> HfDataset:
14551454
'output': 'response'
14561455
}),
14571456
get_dataset_from_repo,
1458-
tags=['chat', 'law', '🔥'])
1457+
tags=['chat', 'law', '🔥'],
1458+
hf_dataset_id='ShengbinYue/DISC-Law-SFT')
14591459

14601460
register_dataset(
14611461
DatasetName.pileval,
@@ -1666,12 +1666,12 @@ def load_dataset_from_local(
16661666
return concatenate_datasets(dataset_list)
16671667

16681668

1669-
def get_custom_dataset(_: str, train_dataset_path_list: Union[str, List[str]],
1670-
val_dataset_path_list: Optional[Union[str, List[str]]],
1669+
def get_custom_dataset(_: str, train_subset_split_list: Union[str, List[str]],
1670+
val_subset_split_list: Optional[Union[str, List[str]]],
16711671
preprocess_func: PreprocessFunc,
16721672
**kwargs) -> Tuple[HfDataset, Optional[HfDataset]]:
1673-
train_dataset = load_dataset_from_local(train_dataset_path_list,
1673+
train_dataset = load_dataset_from_local(train_subset_split_list,
16741674
preprocess_func)
1675-
val_dataset = load_dataset_from_local(val_dataset_path_list,
1675+
val_dataset = load_dataset_from_local(val_subset_split_list,
16761676
preprocess_func)
16771677
return train_dataset, val_dataset

0 commit comments

Comments
 (0)