Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tiktoken不支持多线程tokenize? #36

Closed
skepsun opened this issue Aug 4, 2023 · 8 comments
Closed

tiktoken不支持多线程tokenize? #36

skepsun opened this issue Aug 4, 2023 · 8 comments

Comments

@skepsun
Copy link

skepsun commented Aug 4, 2023

报错:

TypeError: cannot pickle 'builtins.CoreBPE' obiect
@geekinglcq
Copy link

你好,请问一下方便提供更详细的代码让我们复现吗?

@skepsun
Copy link
Author

skepsun commented Aug 4, 2023

@geekinglcq 感谢回复,使用的训练框架是https://github.com/hiyouga/LLaMA-Efficient-Tuning 这个仓库,将preprocessing_num_workers设置超过1就会报这个错,我的脚本是这样的:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 accelerate launch --num_processes=7 src/train_bash.py \
    --stage sft \
    --deepspeed configs/ds_zero2.json \
    --lora_target q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --template vicuna \
    --model_name_or_path ../Qwen-7B \
    --do_train \
    --dataset alpaca_gpt4_zh \
    --finetuning_type full \
    --warmup_ratio 0.03 \
    --output_dir outputs/qwen-7b-sft \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --preprocessing_num_workers 12 \
    --lr_scheduler_type cosine \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --logging_steps 1 \
    --save_steps 100 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --dev_ratio 0.001 \
    --num_train_epochs 3 \
    --resume_lora_training True \
    --plot_loss \
    --report_to wandb \
    --fp16 \
    --tf32 True

@nobodybut
Copy link

我在text-generation-webui里调用,也只能用到1个CPU线程,推理超慢无比,开了个issue在这里,没人搭理……

@jklj077
Copy link
Contributor

jklj077 commented Aug 4, 2023

可能是LLaMA-Efficient-Tuning使用的HuggingFace datasets版本上的问题,请尝试升级datasets版本到最新看看?

以下是MWE,datasets新版可正常运行

from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
def process(example):
    ids = tokenizer.encode(example['text'])
    out = {'ids': ids, 'len': len(ids)}
    return out

dataset = load_dataset("stas/openwebtext-10k") # just an example
tokenized = dataset.map(
    process,
    remove_columns=['text'],
    desc="tokenizing the OWT splits",
    num_proc=3,
)

参见
datasets commit: huggingface/datasets#5552
datasets issue: huggingface/datasets#5769
LLaMA-Efficient-Tuning issue: hiyouga/LLaMA-Factory#328

@jklj077
Copy link
Contributor

jklj077 commented Aug 4, 2023

datasets中的多进程处理逻辑我们无法控制。一般而言,多进程tokenize最好在进程中初始化tokenizer,避免进程间传递tokenizer对象,可能会触发意外问题。

@zhaochs1995
Copy link

@skepsun 请问你解决了吗?除了改为单线程,还可以怎么解决?升级到datasets最近版本问题依然

@JianxinMa
Copy link
Contributor

您好,请问像下面这样子写能 work 吗?

import os
import threading
from transformers import AutoTokenizer

tokenizer_dict = {}

def process(example):
    k = str(os.getpid()) + str(threading.get_ident())
    if k not in tokenizer_dict:
        for _ in range(100):  # try multiple times when the network is unreliable
            try:
                tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
                break
            except Exception:
                pass
        tokenizer_dict[k] = tokenizer
    else:
        tokenizer = tokenizer_dict[k]
    ids = tokenizer.encode(example["text"])
    out = {"ids": ids, "len": len(ids)}
    return out

@songkq
Copy link

songkq commented Aug 12, 2023

使用替代方案GPT2Tokenizer支持多线程:https://huggingface.co/vonjack/Qwen-LLaMAfied-HFTok-7B-Chat/tree/main

@jklj077 jklj077 closed this as completed Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants