nevesaynever1 · 2024-02-22T01:24:09Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

数据集构建格式为：{"content": "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "summaty": "用电脑能读数据流吗？水温多少", "history": []}
{"content": "95", "summaty": "上下水管温差怎么样啊？空气是不是都排干净了呢？", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗？水温多少"]]}
{"content": "是的。上下水管都好的", "summaty": "那就要检查线路了，一般风扇继电器是由电脑控制吸合的，如果电路存在断路，或者电脑坏了的话会出现继电器不吸合的情况！", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗？水温多少"], ["95", "上下水管温差怎么样啊？空气是不是都排干净了呢？"]]}
{"content": "你好", "summaty": "你好", "history": []}
{"content": "请问你是谁？", "summaty": "我是chatGLm2-6B", "history": [["你好", "你好"]]}
{"content": "你是一个优秀的人工智能助手吗？", "summaty": "是的，我是。", "history": [["你好", "你好"], ["请问你是谁？","我是chatGLm2-6B"]]}

train.sh脚本文件为：
PRE_SEQ_LEN=128
LR=1e-2

CUDA_VISIBLE_DEVICES=0 python main.py
--do_train
--train_file train.json
--validation_file dev.json
--preprocessing_num_workers 10
--prompt_column content
--response_column summary
--history_column history
--overwrite_cache
--model_name_or_path THUDM/chatglm2-6b
--output_dir ../output
--overwrite_output_dir
--max_source_length 256
--max_target_length 256
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 16
--predict_with_generate
--max_steps 3000
--logging_steps 10
--save_steps 1000
--learning_rate $LR
--pre_seq_len $PRE_SEQ_LEN \

运行上述脚本出现下述错误：
Traceback (most recent call last):
File "/mnt/ChatGLM2-6B/ptuning/main.py", line 411, in
main()
File "/mnt/ChatGLM2-6B/ptuning/main.py", line 229, in main
train_dataset = train_dataset.map(
File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3180, in map
with Pool(len(kwargs_per_job)) as pool:
时数据格式的原因吗，排查了一下，和官方构建的数据格式是一样的。

Expected Behavior

No response

Steps To Reproduce

数据集构建格式为：{"content": "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "summaty": "用电脑能读数据流吗？水温多少", "history": []}
{"content": "95", "summaty": "上下水管温差怎么样啊？空气是不是都排干净了呢？", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗？水温多少"]]}
{"content": "是的。上下水管都好的", "summaty": "那就要检查线路了，一般风扇继电器是由电脑控制吸合的，如果电路存在断路，或者电脑坏了的话会出现继电器不吸合的情况！", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗？水温多少"], ["95", "上下水管温差怎么样啊？空气是不是都排干净了呢？"]]}
{"content": "你好", "summaty": "你好", "history": []}
{"content": "请问你是谁？", "summaty": "我是chatGLm2-6B", "history": [["你好", "你好"]]}
{"content": "你是一个优秀的人工智能助手吗？", "summaty": "是的，我是。", "history": [["你好", "你好"], ["请问你是谁？","我是chatGLm2-6B"]]}

train.sh脚本文件为：
PRE_SEQ_LEN=128
LR=1e-2

CUDA_VISIBLE_DEVICES=0 python main.py
--do_train
--train_file train.json
--validation_file dev.json
--preprocessing_num_workers 10
--prompt_column content
--response_column summary
--history_column history
--overwrite_cache
--model_name_or_path THUDM/chatglm2-6b
--output_dir ../output
--overwrite_output_dir
--max_source_length 256
--max_target_length 256
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 16
--predict_with_generate
--max_steps 3000
--logging_steps 10
--save_steps 1000
--learning_rate $LR
--pre_seq_len $PRE_SEQ_LEN \

运行上述脚本出现下述错误：
Traceback (most recent call last):
File "/mnt/ChatGLM2-6B/ptuning/main.py", line 411, in
main()
File "/mnt/ChatGLM2-6B/ptuning/main.py", line 229, in main
train_dataset = train_dataset.map(
File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3180, in map
with Pool(len(kwargs_per_job)) as pool:
时数据格式的原因吗，排查了一下，和官方构建的数据格式是一样的。

Environment

- OS: centos
- Python:3.9
- Transformers:4.30.2
- PyTorch:2.0.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : CU118

Anything else?

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nevesaynever1 commented Feb 22, 2024

Comments