多卡RFT采样时遇到报错 #3116

DogeWatch · 2025-02-14T09:55:52Z

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)

仅是使用了example中的rft.py，在2张A100环境上测试 ```python def do_sample(model: str, model_type: str, dataset: List[str], iter: int, args): device_count = torch.cuda.device_count() handlers = [] datasets = []

# Sampling cache, to avoid lmdeploy & PRM run at the same time
# Why lmdeploy not vllm? we found that the responses generated by lmdeploy are more similar than ones of vllm.
for device in range(device_count):
    sample_cmd = (f'{conda_prefix} CUDA_VISIBLE_DEVICES={device} swift sample '
                  f'--model {model} --model_type {model_type} '
                  f'--dataset {" ".join(dataset)} '
                  f'--data_range {device} {device_count} '
                  # f'--max_length 50000 '
                  f'--truncation_strategy delete '
                  f'--system "{args.system_prompt}" '
                  f'--load_args false '
                  f'--sampler_engine {args.sampler_engine} '
                  f'--max_new_tokens {args.max_new_tokens} '
                  f'--override_exist_file true '
                  f'--num_sampling_per_gpu_batch_size 16 '
                  f'--num_return_sequences 16 '
                  f'--cache_files {args.sample_output_dir}/iter_{iter}_proc_{device}_cache.jsonl '
                  f'--output_dir {args.sample_output_dir} '
                  f'--output_file iter_{iter}_proc_{device}_cache.jsonl '
                  f'--top_p 1.0 '
                  f'--temperature 1.0 ')
    print(f'Sampling caches of iter {iter}, part {device}.', flush=True)
    env = os.environ.copy()
    env['CUDA_VISIBLE_DEVICES'] = str(device)
    handler = subprocess.Popen(
        f'{sample_cmd}' + f' > {args.log_dir}/sample_iter_{iter}_proc_{device}_cache.log 2>&1',
        env=env,
        shell=True,
        executable='/bin/bash')
    handlers.append(handler)

for proc, handler in enumerate(handlers):
    handler.wait()
    assert os.path.exists(os.path.join(args.sample_output_dir, f'iter_{iter}_proc_{proc}_cache.jsonl'))
print("Sampling cache finished....")


**Your hardware and system info**
ms-swift=v3.1.0
2 * A100

The text was updated successfully, but these errors were encountered:

tastelikefeet · 2025-02-14T10:15:47Z

sample_cmd能否debug下给出来

DogeWatch · 2025-02-15T06:25:05Z

环境变量的问题，已解决

DogeWatch closed this as completed Feb 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多卡RFT采样时遇到报错 #3116

多卡RFT采样时遇到报错 #3116

DogeWatch commented Feb 14, 2025

tastelikefeet commented Feb 14, 2025

DogeWatch commented Feb 15, 2025

多卡RFT采样时遇到报错 #3116

多卡RFT采样时遇到报错 #3116

Comments

DogeWatch commented Feb 14, 2025

tastelikefeet commented Feb 14, 2025

DogeWatch commented Feb 15, 2025