Kd vllm generation#5351
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| prompts=prompt_ids_list, | ||
| images=None, | ||
| num_generations=self.num_generations, | ||
| ) |
There was a problem hiding this comment.
Incompatible prompt handling for multi-generation vLLM usage
Medium Severity
VLLMGeneration.generate assumes prompts are pre-duplicated num_generations times (as GRPO does via its RepeatSampler), but GOLD passes each prompt exactly once. In server mode, the method does all_prompts[::num_generations] to "deduplicate," which skips prompts when num_generations > 1. In colocate mode, n=1 is hardcoded, so extra generations are silently ignored. The old code had custom _generate_vllm_server_global / _generate_vllm_colocate methods that correctly handled n=num_generations without requiring pre-duplication. This is a regression when num_generations > 1 (default is 1, so typical usage is unaffected).


What does this PR do?
Addresses the comment from #5137 to use
trl.generation.VLLMGenerationinstead of the separate vLLM logic.Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Note
Medium Risk
Moderate risk because it rewires GOLD on-policy generation, prompt/attention masking, and weight-sync behavior for vLLM, which can affect training correctness and stability in distributed setups.
Overview
GOLD’s vLLM path is refactored to use
trl.generation.VLLMGenerationinstead of trainer-local server/colocate logic, removing the custom callback/weight-update code and centralizing sync + generation behindsync_weights()/generate().On-policy buffering is updated to pass prompt token IDs directly (respecting
prompt_attention_mask), build concatenatedattention_maskexplicitly, and generate labels only on completion tokens that are not padded.GOLDConfiggains new vLLM knobs (vllm_server_base_url,vllm_group_port,vllm_max_model_length,vllm_model_impl), and tests are updated/added to lock in these behaviors (prompt masking for vLLM prompts, special-token reconstruction, and defaultingvllm_max_model_lengthtomax_length).Written by Cursor Bugbot for commit dcfce59. This will update automatically on new commits. Configure here.