Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在使用ms-swift3.1的时候,自定义数据集的方式出错 #3090

Open
corkiyao opened this issue Feb 13, 2025 · 0 comments
Open

在使用ms-swift3.1的时候,自定义数据集的方式出错 #3090

corkiyao opened this issue Feb 13, 2025 · 0 comments

Comments

@corkiyao
Copy link

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
[采用]https://github.com/modelscope/ms-swift/blob/main/examples/notebook/qwen2_5-self-cognition/self-cognition-sft.ipynb这里的微调代码,发现在加载数据集路径出错。

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, load_dataset, get_template, EncodePreprocessor, get_model_arch,
    get_multimodal_target_regex, LazyLLMDataset
)
from swift.utils import get_logger, get_model_parameter_info, plot_images, seed_everything
from swift.tuners import Swift, LoraConfig
from swift.trainers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from functools import partial

logger = get_logger()
seed_everything(42)

# model
model_id_or_path = './ms-swift/Internvl25_1B'
model_type = 'internvl2_5'
system = None  # 使用template中定义的默认system
output_dir = 'output/InternVL2_5-1B'

# dataset
dataset = ['./ms-swift/Datasets/Jsonfile/train__swift.jsonl']  # dataset_id或者dataset_path。
data_seed = 42
max_length = 8192
split_dataset_ratio = 0.01  # 切分验证集的比例
num_proc = 4  # 数据处理的进程数
strict = False

# lora
lora_rank = 8
lora_alpha = 8
freeze_llm = False
freeze_vit = True
freeze_aligner = True
.................
.................

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

ms-swift3.1版本

Additional context
Add any other context about the problem here(在这里补充其他信息)
bug

  File "./ms-swift/internvl_251B.py", line 88, in <module>
    train_dataset, val_dataset = load_dataset(dataset, split_dataset_ratio=split_dataset_ratio, num_proc=num_proc,
  File "./ms-swift/swift/llm/dataset/loader.py", line 468, in load_dataset
    train_dataset = load_function(dataset_syntax, dataset_meta, **load_kwargs)
  File "./ms-swift/swift/llm/dataset/loader.py", line 363, in load
    dataset = DatasetLoader._load_dataset_path(
  File "./ms-swift/swift/llm/dataset/loader.py", line 197, in _load_dataset_path
    dataset = hf_load_dataset(file_type, data_files=dataset_path, **kwargs)
  File "./anaconda3/envs/swift/lib/python3.10/site-packages/datasets/load.py", line 2151, in load_dataset
    builder_instance.download_and_prepare(
  File "./anaconda3/envs/swift/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File "./anaconda3/envs/swift/lib/python3.10/site-packages/datasets/builder.py", line 1000, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "./anaconda3/envs/swift/lib/python3.10/site-packages/datasets/builder.py", line 1741, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "./anaconda3/envs/swift/lib/python3.10/site-packages/datasets/builder.py", line 1897, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

出现了这个问题,然后我的jsonl文件里面的东西为

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "<image>检测出图像中的<ref-object>,并提供每个物体的目标框坐标。"}, {"role": "assistant", "content": "<bbox></bbox>"}], "images": ["./ms-swift/Datasets/Traffic/15/1555.jpg"], "objects": {"ref": ["某物体"], "bbox": [[371, 648, 450, 758]]}}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant