Skip to content

[harbor][step-wise] Make Harbor use step-wise training#1542

Merged
CharlieFRuan merged 8 commits intomainfrom
harbor-stepwise
Apr 23, 2026
Merged

[harbor][step-wise] Make Harbor use step-wise training#1542
CharlieFRuan merged 8 commits intomainfrom
harbor-stepwise

Conversation

@CharlieFRuan
Copy link
Copy Markdown
Member

@CharlieFRuan CharlieFRuan commented Apr 21, 2026

Before this PR, when training with Harbor, we rely on Harbor returning an all_messages field which contains a string chat history. We then re-tokenize it in SkyRL, compute loss masks, and feed it to the trainer via GeneratorOutput.

This causes re-tokenization issues, and prevent us from doing fully async training (which requires logprobs for algorithmic correction, and logprobs will not match upon re-tokenization drift).

This PR makes HarborGenerator perform step-wise training.

We rely on setting collect_rollout_details: true (already did in harbor_trial_config/default.yaml). Harbor will then do per-turn book keeping. For each LLM invocation (i.e. turn), Harbor will record prompt_token_ids, completion_token_ids, and logprobs.

Then, by setting the following configs in SkyRL, we can perform step-wise training while merging when possible:

  generator.step_wise_trajectories=true \
  generator.merge_stepwise_output=true \

We also add a fully async script: examples/train_integrations/harbor/run_codecontest_fully_async.sh

Curve comparison

https://wandb.ai/sky-posttraining-uc-berkeley/harbor/reports/PR1542-Harbor-step-wise-training-in-SkyRL--VmlldzoxNjYzNDU0Mg

  • Blue: this PR's sync training (with token_mean)
  • Pink: before this PR (with seq_mean_token_sum_norm, everything else the same)
  • Red: this PR fully async (with train_batch_size=mini_batch_size=16)
image image

Besides, with merge_stepwise_output, we can shrink ~1000 sequences to ~300 sequences, improving training efficiencies. For sync run, it is 256 sequences (batch size 32 * 8) if everything can be merged. Number of sequences unmerged is roughly 256 * avg_num_turns. For fully async, similar things except it has 128 sequences (batch size 16 * 8). Related PR: #1538. Note this merging is much better than PR 1538's test on Qwen2.5 for search-r1 (which suffered a lot from retokenization drift for which Qwen2.5 is not familiar with). Also, for where merging fails, here is a report: https://gist.github.com/CharlieFRuan/b91cecfe891f9458c455b6f5e2f6af1d

image
Open in Devin Review

gemini-code-assist[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@CharlieFRuan
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements step-wise training for Harbor integrations, transitioning from chat history extraction to per-turn rollout detail collection. It introduces a fully asynchronous training entry point, updates the HarborGenerator to handle step-wise outputs, and refines metric aggregation to preserve custom generator statistics. Review feedback highlights the need for more robust metric aggregation logic that avoids string-based heuristics and suggests handling multi-segment rollouts more flexibly to prevent potential crashes from hard assertions.

Comment thread skyrl/train/generators/utils.py
Comment thread examples/train_integrations/harbor/harbor_generator.py
@CharlieFRuan CharlieFRuan merged commit a90cd1d into main Apr 23, 2026
5 of 6 checks passed
@CharlieFRuan CharlieFRuan deleted the harbor-stepwise branch April 23, 2026 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant