-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tiktoken不支持多线程tokenize? #36
Comments
你好,请问一下方便提供更详细的代码让我们复现吗? |
@geekinglcq 感谢回复,使用的训练框架是https://github.com/hiyouga/LLaMA-Efficient-Tuning 这个仓库,将preprocessing_num_workers设置超过1就会报这个错,我的脚本是这样的:
|
我在text-generation-webui里调用,也只能用到1个CPU线程,推理超慢无比,开了个issue在这里,没人搭理…… |
可能是LLaMA-Efficient-Tuning使用的HuggingFace datasets版本上的问题,请尝试升级datasets版本到最新看看? 以下是MWE,datasets新版可正常运行 from transformers import AutoTokenizer
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
def process(example):
ids = tokenizer.encode(example['text'])
out = {'ids': ids, 'len': len(ids)}
return out
dataset = load_dataset("stas/openwebtext-10k") # just an example
tokenized = dataset.map(
process,
remove_columns=['text'],
desc="tokenizing the OWT splits",
num_proc=3,
) 参见 |
datasets中的多进程处理逻辑我们无法控制。一般而言,多进程tokenize最好在进程中初始化tokenizer,避免进程间传递tokenizer对象,可能会触发意外问题。 |
@skepsun 请问你解决了吗?除了改为单线程,还可以怎么解决?升级到datasets最近版本问题依然 |
您好,请问像下面这样子写能 work 吗? import os
import threading
from transformers import AutoTokenizer
tokenizer_dict = {}
def process(example):
k = str(os.getpid()) + str(threading.get_ident())
if k not in tokenizer_dict:
for _ in range(100): # try multiple times when the network is unreliable
try:
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
break
except Exception:
pass
tokenizer_dict[k] = tokenizer
else:
tokenizer = tokenizer_dict[k]
ids = tokenizer.encode(example["text"])
out = {"ids": ids, "len": len(ids)}
return out |
使用替代方案GPT2Tokenizer支持多线程:https://huggingface.co/vonjack/Qwen-LLaMAfied-HFTok-7B-Chat/tree/main |
报错:
The text was updated successfully, but these errors were encountered: