Skip to content

Commit a90a7af

Browse files
authored
Delete useless codes and refactor process_untokenized_datasets (InternLM#379)
* delete useless codes * refactor process_untokenized_datasets: add ftdp to dataset-format * fix lint
1 parent 111cdfe commit a90a7af

File tree

3 files changed

+17
-18
lines changed

3 files changed

+17
-18
lines changed

docs/zh_cn/user_guides/intern_repo_dataset.md

+3-4
Original file line numberDiff line numberDiff line change
@@ -398,11 +398,10 @@ python xtuner/tools/process_untokenized_datasets.py \
398398
--save-folder ./processed \
399399
--tokenizer-path pretrained_model_name_or_path \
400400
--prompt-template internlm2_chat \
401-
--dataset-format openai \
402-
--is-ftdp
401+
--dataset-format ftdp
403402
```
404403

405-
其中 `pretrained_model_name_or_path``from_pretrained` 接口中的 `pretrained_model_name_or_path``--prompt-template` 表示对话模板的种类,其他可选对话模板可参考 [templates](https://github.com/InternLM/xtuner/blob/main/docs/zh_cn/user_guides/prompt_template.md)由于 untokenized internlm repo 格式的数据集(别名 ftdp 格式)满足 `openai` 数据格式,即
404+
其中 `pretrained_model_name_or_path``from_pretrained` 接口中的 `pretrained_model_name_or_path``--prompt-template` 表示对话模板的种类,其他可选对话模板可参考 [templates](https://github.com/InternLM/xtuner/blob/main/docs/zh_cn/user_guides/prompt_template.md)。untokenized internlm repo 格式的数据集(别名 ftdp 格式)满足以下格式
406405

407406
```
408407
[
@@ -418,7 +417,7 @@ python xtuner/tools/process_untokenized_datasets.py \
418417
]
419418
```
420419

421-
因此,上述命令中 `--dataset-format` 一项设为 `openai`
420+
`--dataset-format` 一项需要设为 `ftdp`
422421

423422
使用离线处理好的数据集进行训练,需要额外修改 Step 2 中的 Config 文件,并设置存放离线处理后的数据集路径:
424423

xtuner/dataset/utils.py

-2
Original file line numberDiff line numberDiff line change
@@ -121,8 +121,6 @@ def __init__(self,
121121
chunk_size=2048,
122122
use_varlen_attn=False,
123123
drop_last=False):
124-
use_varlen_attn = True
125-
drop_last = True
126124
self.chunk_size = chunk_size
127125
self.residual = {'input_ids': [], 'labels': []}
128126
self.use_varlen_attn = use_varlen_attn

xtuner/tools/process_untokenized_datasets.py

+14-12
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,7 @@
2626
--save-folder ./processed \
2727
--tokenizer-path pretrained_model_name_or_path \
2828
--prompt-template internlm2_chat \
29-
--dataset-format openai \
30-
--is-ftdp
29+
--dataset-format ftdp
3130
3231
normal json dataset:
3332
srun -p llm_razor --quotatype=auto --gres=gpu:1 --ntasks=1 \
@@ -48,10 +47,10 @@ def parse_args():
4847
'--tokenizer-path', help='The path to the hf tokenizer.')
4948
parser.add_argument(
5049
'--dataset-format',
51-
choices=DATASET_FORMAT_MAPPING.keys(),
50+
choices=list(DATASET_FORMAT_MAPPING.keys()) + ['ftdp'],
5251
default=None,
53-
help='Which dataset format is this data. '
54-
f'The available choices are {DATASET_FORMAT_MAPPING.keys()}')
52+
help='Which dataset format is this data. The available choices are '
53+
f"{list(DATASET_FORMAT_MAPPING.keys()) + ['ftdp']}. ")
5554
parser.add_argument(
5655
'--prompt-template',
5756
choices=PROMPT_TEMPLATE.keys(),
@@ -67,10 +66,6 @@ def parse_args():
6766
'--file-type',
6867
default='.json',
6968
help='We want to get the order of the file in this type.')
70-
parser.add_argument(
71-
'--is-ftdp',
72-
action='store_true',
73-
help='Whether it is in ftdp data format')
7469
parser.add_argument(
7570
'--data-order-path',
7671
default=None,
@@ -168,15 +163,22 @@ def process_untokenized_dataset(folder,
168163
pretrained_model_name_or_path=args.tokenizer_path,
169164
trust_remote_code=True,
170165
padding_side='right')
166+
167+
if args.dataset_format is None:
168+
dataset_map_fn = None
169+
elif args.dataset_format == 'ftdp':
170+
dataset_map_fn = DATASET_FORMAT_MAPPING['openai']
171+
else:
172+
dataset_map_fn = DATASET_FORMAT_MAPPING[args.dataset_format]
173+
171174
datasets_dict = process_untokenized_dataset(
172175
args.data_folder,
173176
tokenizer,
174177
args.max_length,
175178
args.pack_to_max_length,
176-
DATASET_FORMAT_MAPPING[args.dataset_format]
177-
if args.dataset_format is not None else None,
179+
dataset_map_fn,
178180
PROMPT_TEMPLATE[args.prompt_template],
179181
data_order_path=args.data_order_path,
180182
file_type=args.file_type,
181-
is_ftdp=args.is_ftdp)
183+
is_ftdp=args.dataset_format == 'ftdp')
182184
datasets_dict.save_to_disk(args.save_folder)

0 commit comments

Comments
 (0)