tiktoken不支持多线程tokenize? #36

skepsun · 2023-08-04T02:43:26Z

报错：

TypeError: cannot pickle 'builtins.CoreBPE' obiect

The text was updated successfully, but these errors were encountered:

geekinglcq · 2023-08-04T03:17:36Z

你好，请问一下方便提供更详细的代码让我们复现吗？

skepsun · 2023-08-04T04:13:42Z

@geekinglcq 感谢回复，使用的训练框架是https://github.com/hiyouga/LLaMA-Efficient-Tuning 这个仓库，将preprocessing_num_workers设置超过1就会报这个错，我的脚本是这样的：

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 accelerate launch --num_processes=7 src/train_bash.py \
    --stage sft \
    --deepspeed configs/ds_zero2.json \
    --lora_target q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --template vicuna \
    --model_name_or_path ../Qwen-7B \
    --do_train \
    --dataset alpaca_gpt4_zh \
    --finetuning_type full \
    --warmup_ratio 0.03 \
    --output_dir outputs/qwen-7b-sft \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --preprocessing_num_workers 12 \
    --lr_scheduler_type cosine \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --logging_steps 1 \
    --save_steps 100 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --dev_ratio 0.001 \
    --num_train_epochs 3 \
    --resume_lora_training True \
    --plot_loss \
    --report_to wandb \
    --fp16 \
    --tf32 True

nobodybut · 2023-08-04T06:57:21Z

我在text-generation-webui里调用，也只能用到1个CPU线程，推理超慢无比，开了个issue在这里，没人搭理……

jklj077 · 2023-08-04T06:57:52Z

可能是LLaMA-Efficient-Tuning使用的HuggingFace datasets版本上的问题，请尝试升级datasets版本到最新看看？

以下是MWE，datasets新版可正常运行

from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
def process(example):
    ids = tokenizer.encode(example['text'])
    out = {'ids': ids, 'len': len(ids)}
    return out

dataset = load_dataset("stas/openwebtext-10k") # just an example
tokenized = dataset.map(
    process,
    remove_columns=['text'],
    desc="tokenizing the OWT splits",
    num_proc=3,
)

参见
datasets commit: huggingface/datasets#5552
datasets issue: huggingface/datasets#5769
LLaMA-Efficient-Tuning issue: hiyouga/LLaMA-Factory#328

jklj077 · 2023-08-04T07:01:30Z

datasets中的多进程处理逻辑我们无法控制。一般而言，多进程tokenize最好在进程中初始化tokenizer，避免进程间传递tokenizer对象，可能会触发意外问题。

zhaochs1995 · 2023-08-11T08:29:04Z

@skepsun 请问你解决了吗？除了改为单线程，还可以怎么解决？升级到datasets最近版本问题依然

JianxinMa · 2023-08-11T13:51:33Z

您好，请问像下面这样子写能 work 吗？

import os
import threading
from transformers import AutoTokenizer

tokenizer_dict = {}

def process(example):
    k = str(os.getpid()) + str(threading.get_ident())
    if k not in tokenizer_dict:
        for _ in range(100):  # try multiple times when the network is unreliable
            try:
                tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
                break
            except Exception:
                pass
        tokenizer_dict[k] = tokenizer
    else:
        tokenizer = tokenizer_dict[k]
    ids = tokenizer.encode(example["text"])
    out = {"ids": ids, "len": len(ids)}
    return out

songkq · 2023-08-12T05:39:35Z

使用替代方案GPT2Tokenizer支持多线程：https://huggingface.co/vonjack/Qwen-LLaMAfied-HFTok-7B-Chat/tree/main

jklj077 closed this as completed Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tiktoken不支持多线程tokenize? #36

tiktoken不支持多线程tokenize? #36

skepsun commented Aug 4, 2023

geekinglcq commented Aug 4, 2023

skepsun commented Aug 4, 2023 •

edited

Loading

nobodybut commented Aug 4, 2023

jklj077 commented Aug 4, 2023

jklj077 commented Aug 4, 2023

zhaochs1995 commented Aug 11, 2023

JianxinMa commented Aug 11, 2023

songkq commented Aug 12, 2023 •

edited

Loading

tiktoken不支持多线程tokenize? #36

tiktoken不支持多线程tokenize? #36

Comments

skepsun commented Aug 4, 2023

geekinglcq commented Aug 4, 2023

skepsun commented Aug 4, 2023 • edited Loading

nobodybut commented Aug 4, 2023

jklj077 commented Aug 4, 2023

jklj077 commented Aug 4, 2023

zhaochs1995 commented Aug 11, 2023

JianxinMa commented Aug 11, 2023

songkq commented Aug 12, 2023 • edited Loading

skepsun commented Aug 4, 2023 •

edited

Loading

songkq commented Aug 12, 2023 •

edited

Loading