-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
预训练数据格式 #83
Comments
你好大佬,请问跑通了吗 |
没有啊,没人回复都 |
您需要在执行preprocess_dataset.py的时候,在build_dataset和shuffle_dataset中将block_size设为一个较小的值,或增大您的数据集 |
大佬说的是对的,亲证可以。修改cpm_live/dataset/distributed_dataset.py中的DEFAULT_BLOCK_SIZE=16<<10 |
运行pretrain_cpm_bee.sh脚本
修改了dataset指定datasets.json
里面的path,使其根据自己的数据进行处理
transhformers字段不太明白,希望能解释下
下面是引用的数据
下面是报错信息
The text was updated successfully, but these errors were encountered: