Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡RFT采样时遇到报错 #3116

Closed
DogeWatch opened this issue Feb 14, 2025 · 2 comments
Closed

多卡RFT采样时遇到报错 #3116

DogeWatch opened this issue Feb 14, 2025 · 2 comments

Comments

@DogeWatch
Copy link

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)

Image 仅是使用了example中的rft.py,在2张A100环境上测试 ```python def do_sample(model: str, model_type: str, dataset: List[str], iter: int, args): device_count = torch.cuda.device_count() handlers = [] datasets = []
# Sampling cache, to avoid lmdeploy & PRM run at the same time
# Why lmdeploy not vllm? we found that the responses generated by lmdeploy are more similar than ones of vllm.
for device in range(device_count):
    sample_cmd = (f'{conda_prefix} CUDA_VISIBLE_DEVICES={device} swift sample '
                  f'--model {model} --model_type {model_type} '
                  f'--dataset {" ".join(dataset)} '
                  f'--data_range {device} {device_count} '
                  # f'--max_length 50000 '
                  f'--truncation_strategy delete '
                  f'--system "{args.system_prompt}" '
                  f'--load_args false '
                  f'--sampler_engine {args.sampler_engine} '
                  f'--max_new_tokens {args.max_new_tokens} '
                  f'--override_exist_file true '
                  f'--num_sampling_per_gpu_batch_size 16 '
                  f'--num_return_sequences 16 '
                  f'--cache_files {args.sample_output_dir}/iter_{iter}_proc_{device}_cache.jsonl '
                  f'--output_dir {args.sample_output_dir} '
                  f'--output_file iter_{iter}_proc_{device}_cache.jsonl '
                  f'--top_p 1.0 '
                  f'--temperature 1.0 ')
    print(f'Sampling caches of iter {iter}, part {device}.', flush=True)
    env = os.environ.copy()
    env['CUDA_VISIBLE_DEVICES'] = str(device)
    handler = subprocess.Popen(
        f'{sample_cmd}' + f' > {args.log_dir}/sample_iter_{iter}_proc_{device}_cache.log 2>&1',
        env=env,
        shell=True,
        executable='/bin/bash')
    handlers.append(handler)

for proc, handler in enumerate(handlers):
    handler.wait()
    assert os.path.exists(os.path.join(args.sample_output_dir, f'iter_{iter}_proc_{proc}_cache.jsonl'))
print("Sampling cache finished....")

**Your hardware and system info**
ms-swift=v3.1.0
2 * A100
@tastelikefeet
Copy link
Collaborator

Image
sample_cmd能否debug下给出来

@DogeWatch
Copy link
Author

环境变量的问题,已解决

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants