Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

预训练数据格式 #83

Open
ScienGU opened this issue Jun 21, 2023 · 4 comments
Open

预训练数据格式 #83

ScienGU opened this issue Jun 21, 2023 · 4 comments

Comments

@ScienGU
Copy link

ScienGU commented Jun 21, 2023

运行pretrain_cpm_bee.sh脚本
修改了dataset指定datasets.json

[
    {
        "dataset_name": "pretrain",
        "task_name": "mlm",
        "weight": 1.0,
        "path": "/home/litao/ScienGU/CPM-Bee/sciengu/zhinan/bin_data",
        "transforms": [
            {
                "answer": "$answer",
                "document": "$source"
            },
            {
                "answer": "$answer",
                "query": "$source"
            },
            {
                "answer": "$answer",
                "input": "$source"
            }
        ]
    }
]

里面的path,使其根据自己的数据进行处理
transhformers字段不太明白,希望能解释下

下面是引用的数据

{"answer": "当前现代医学的主要治疗甲状腺药物", "input": "当前现代医学的主要治疗甲状腺药物"}

下面是报错信息

Traceback (most recent call last):
  File "/home/share/wuhkjdxue30509/home/wust30509/.conda/envs/bmtrain/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/share/wuhkjdxue30509/home/wust30509/.conda/envs/bmtrain/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 932, in _mixed_dataset_process
    batch = packer.add_data(config[ds_id])
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 638, in add_data
    ) = self.build_instance(config)
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 439, in build_instance
    inp = ds.read()
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/dataset/distributed_dataset.py", line 554, in read
    next_block_id = self._get_next_block()
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/dataset/distributed_dataset.py", line 394, in _get_next_block
    raise RuntimeError("Empty dataset {}".format(self._path))
RuntimeError: Empty dataset /home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/sciengu/zhinan/bin_data
Process Process-1:
Traceback (most recent call last):
  File "/home/share/wuhkjdxue30509/home/wust30509/.conda/envs/bmtrain/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/share/wuhkjdxue30509/home/wust30509/.conda/envs/bmtrain/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 932, in _mixed_dataset_process
    batch = packer.add_data(config[ds_id])
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 638, in add_data
    ) = self.build_instance(config)
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 440, in build_instance
    inp = self.apply_transform(inp, transform)
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 344, in apply_transform
    _expand_mapping(data, [], src[1:].split("."), tgt.split("."))
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 338, in _expand_mapping
    _expand_mapping(data[path[0]], stars, path[1:], target)
KeyError: 'source'
@fengcai24
Copy link

你好大佬,请问跑通了吗

@ScienGU
Copy link
Author

ScienGU commented Jun 26, 2023

没有啊,没人回复都

@gongbaitao
Copy link
Collaborator

您需要在执行preprocess_dataset.py的时候,在build_dataset和shuffle_dataset中将block_size设为一个较小的值,或增大您的数据集
transforms用于对数据变换,{"document": "$source"}表示把原始数据中的"source"字段替换到"document"字段中

@nasame
Copy link

nasame commented Sep 28, 2023

您需要在执行preprocess_dataset.py的时候,在build_dataset和shuffle_dataset中将block_size设为一个较小的值,或增大您的数据集 transforms用于对数据变换,{"document": "$source"}表示把原始数据中的"source"字段替换到"document"字段中

大佬说的是对的,亲证可以。修改cpm_live/dataset/distributed_dataset.py中的DEFAULT_BLOCK_SIZE=16<<10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants