Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
220 commits
Select commit Hold shift + click to select a range
cc12b4e
init async training pipline
ArronHZG Jul 24, 2025
eb79903
init async training pipline
ArronHZG Jul 24, 2025
0459298
update code
ArronHZG Jul 25, 2025
5c9dd6d
test message queue
ArronHZG Jul 25, 2025
3fd7020
main
ArronHZG Jul 30, 2025
2df1811
cpu mq
ArronHZG Jul 30, 2025
48e91a3
one_step_off_policy
ArronHZG Jul 30, 2025
07f2e62
md
ArronHZG Jul 30, 2025
502de26
rollouter
ArronHZG Jul 30, 2025
dbdfdbf
yaml
ArronHZG Jul 31, 2025
08c1ba1
trainer
ArronHZG Jul 31, 2025
289a4a5
message_queue
ArronHZG Jul 31, 2025
a89991c
train
ArronHZG Jul 31, 2025
a5ee455
train
ArronHZG Jul 31, 2025
33ed01f
refactor init worker
ArronHZG Aug 1, 2025
9e8b596
init worker
ArronHZG Aug 1, 2025
8d8b99d
add rollouter thread
ArronHZG Aug 4, 2025
ba8f1ce
lock
ArronHZG Aug 4, 2025
8e5edeb
test
ArronHZG Aug 4, 2025
941c3de
init models
ArronHZG Aug 5, 2025
274883a
gen data
ArronHZG Aug 5, 2025
f653a8e
gen data to queue
ArronHZG Aug 5, 2025
352066c
gen data to queue
ArronHZG Aug 5, 2025
5fac1d8
train get data
ArronHZG Aug 5, 2025
459aa71
put data to queue
ArronHZG Aug 6, 2025
c65b627
merge data proto item
ArronHZG Aug 6, 2025
bc6aedd
train one step
ArronHZG Aug 6, 2025
a8691b0
train mutil step
ArronHZG Aug 6, 2025
ee8914c
param_sync
ArronHZG Aug 8, 2025
75fe2af
ParameterSynchronizer
ArronHZG Aug 8, 2025
c819fe1
ParameterSynchronizer
ArronHZG Aug 8, 2025
50cb8df
stop train
ArronHZG Aug 8, 2025
d59b734
readme docs
ArronHZG Aug 8, 2025
6e5da71
refactor code
ArronHZG Aug 8, 2025
1cfebfe
english notes
ArronHZG Aug 12, 2025
5d108bf
english notes
ArronHZG Aug 12, 2025
796880e
update print
ArronHZG Aug 12, 2025
444c3d1
update message
ArronHZG Aug 12, 2025
bd75207
sync weight time
ArronHZG Aug 12, 2025
57b93b7
total batch to mini batch
ArronHZG Aug 13, 2025
aeb4056
StreamRL batch
ArronHZG Aug 13, 2025
6c9d615
stream rollout
ArronHZG Aug 13, 2025
0d7233f
async mq
ArronHZG Aug 14, 2025
a59b84f
fix ray train bug
ArronHZG Aug 14, 2025
191605b
async server
ArronHZG Aug 14, 2025
6ddb460
update shell
ArronHZG Aug 14, 2025
12edb90
stream rollout
ArronHZG Aug 14, 2025
efa6640
RolloutSample
ArronHZG Aug 14, 2025
966f58d
RolloutSample
ArronHZG Aug 15, 2025
28809b5
success rollout
ArronHZG Aug 15, 2025
1c06296
staleness_samples
ArronHZG Aug 15, 2025
0412861
assemble_batch_from_rollout_samples
ArronHZG Aug 15, 2025
936a672
assemble_batch_from_rollout_samples
ArronHZG Aug 15, 2025
7763c68
train success
ArronHZG Aug 15, 2025
d8212d9
refactor log
ArronHZG Aug 18, 2025
25740b2
stop system run
ArronHZG Aug 18, 2025
737a8ce
system run suceess trigger_parameter_sync_step
ArronHZG Aug 18, 2025
defd61f
system run suceess trigger_parameter_sync_step
ArronHZG Aug 18, 2025
e86625b
All active tasks completed
ArronHZG Aug 18, 2025
ed4d572
pause submit task
ArronHZG Aug 18, 2025
bd99e16
steam rollout
ArronHZG Aug 18, 2025
c59055c
steam rollout
ArronHZG Aug 18, 2025
5f1302e
fully log
ArronHZG Aug 18, 2025
42789e8
fully async log
ArronHZG Aug 18, 2025
a1c0f5c
ruff format
ArronHZG Aug 18, 2025
41cc0f3
Merge pull request #3 from meituan-search/recipe/async_policy_server
ArronHZG Aug 18, 2025
26b55d9
update log
ArronHZG Aug 19, 2025
749d4df
partial rollout
ArronHZG Aug 20, 2025
f5364be
partial rollout cancel
ArronHZG Aug 20, 2025
f547a22
partial rollout cancel debug
ArronHZG Aug 20, 2025
a3e11f9
partial rollout cancel success
ArronHZG Aug 20, 2025
e991d67
Merge pull request #4 from meituan-search/recipe/async_policy_partial
ArronHZG Aug 21, 2025
1fdd90d
partial rollout cancel debug
ArronHZG Aug 21, 2025
ea00205
partial rollout banchmark time
ArronHZG Aug 21, 2025
eb67390
eval code
ArronHZG Aug 25, 2025
43883ae
fix FullyAsyncRollouter
ArronHZG Aug 25, 2025
b228265
group batch
ArronHZG Aug 25, 2025
167c58c
Merge pull request #6 from meituan-search/recipe/async_policy_group
meituan-search Aug 25, 2025
7a41993
merge main
ArronHZG Aug 25, 2025
7d65054
fix oom
ArronHZG Aug 27, 2025
57076bc
fix validation bug
sl-1314 Aug 28, 2025
74f569d
Merge pull request #7 from meituan-search/recipe/async_policy_refacto…
ArronHZG Aug 28, 2025
e59cca4
Merge pull request #5 from meituan-search/recipe/async_policy_refactor
ArronHZG Aug 28, 2025
a7133c9
fsdp2 8 8
ArronHZG Aug 28, 2025
fb7e65d
Merge branch 'recipe/async_policy_refactor_dev' into recipe/async_policy
ArronHZG Aug 28, 2025
d3216d2
fsdp2_8_8
ArronHZG Aug 28, 2025
c33e40e
megatron colocate
ArronHZG Aug 28, 2025
d391a06
rollout log probs
ArronHZG Aug 29, 2025
f27b916
tensorboard
ArronHZG Aug 29, 2025
2f89713
update metrics
ArronHZG Aug 31, 2025
1c3b32b
update metrics
ArronHZG Aug 31, 2025
1bea47c
rollout log probs
ArronHZG Aug 29, 2025
fdd8af0
batch.meta_info.items()
ArronHZG Aug 31, 2025
9444d19
total wait time
ArronHZG Aug 31, 2025
455ff15
Merge remote-tracking branch 'refs/remotes/origin/recipe/async_policy…
ArronHZG Aug 31, 2025
0065542
Merge pull request #9 from meituan-search/recipe/async_policy_log_prob
ArronHZG Sep 1, 2025
69c2427
from .detach_sharding_manager import DetachShardingManager
ArronHZG Sep 1, 2025
237d766
fix validate frequent bug & add final validate
sl-1314 Sep 1, 2025
66cc990
remove unnecessary code, fix validate logic
sl-1314 Sep 2, 2025
b405c66
8_8
Sep 2, 2025
5919f05
fix trainer and rollouter validation asynchrony
sl-1314 Sep 2, 2025
cfa3249
TENSORBOARD_DIR
Sep 2, 2025
5444819
simple implementation of Metrics Aggregator
sl-1314 Sep 2, 2025
09b0e13
Merge branch 'recipe/async_policy' into recipe/fully_async_fix_0
sl-1314 Sep 3, 2025
d393d5c
fix final param_sync wait
sl-1314 Sep 3, 2025
362c3f9
free kv cache by calling sleep&wake_up
sl-1314 Sep 3, 2025
5f69816
Merge pull request #11 from meituan-search/recipe/fully_async_fix_0
meituan-search Sep 4, 2025
53bfad2
reset one step
Sep 5, 2025
aa57cd4
fix some metrics aggregate
sl-1314 Sep 5, 2025
570eb3b
temporarily fix log_prob
sl-1314 Sep 5, 2025
5a85685
add exp folder
sl-1314 Sep 5, 2025
9199f56
exp shell files qwen3-32B_32 megatron colocate
sl-1314 Sep 5, 2025
363d12d
exp shell file colocate done
sl-1314 Sep 5, 2025
b058605
megatron fix
ArronHZG Sep 5, 2025
0be5500
rm DocQA
ArronHZG Sep 5, 2025
7f837ac
update 7b 128
ArronHZG Sep 6, 2025
3c239be
fix typo in use_rollout_log_probs
sl-1314 Sep 7, 2025
9cbce52
remove unused code
sl-1314 Sep 7, 2025
5539e03
add exp fully_async 32, 64
sl-1314 Sep 8, 2025
07ae4a0
add empty_cache after sync_rollout_weights
sl-1314 Sep 8, 2025
34cf9e7
add exp fully_async 128 64-64
sl-1314 Sep 8, 2025
a7c0655
fix max_concurrent_samples, fix progress_bar
sl-1314 Sep 8, 2025
0e1f2d7
change max_concurrent_samples num & change some exp
sl-1314 Sep 8, 2025
6c557d6
remove unused code, add stale 0.1 exp
sl-1314 Sep 9, 2025
e0c14cc
Merge pull request #14 from meituan-search/recipe/fully_async_fix_1
sl-1314 Sep 9, 2025
9fcc63e
Merge branch 'recipe/async_policy' into recipe/async_policy_megatron
Sep 9, 2025
15b53c8
reset one step
Sep 9, 2025
5249bcd
unchange protobuf
Sep 9, 2025
174e762
move shell
ArronHZG Sep 9, 2025
085f367
rm agent_loop
ArronHZG Sep 9, 2025
fa9e103
refactor agent_loop
ArronHZG Sep 9, 2025
7bd4859
refactor vllm async
ArronHZG Sep 9, 2025
ec3f0c5
refactor logs
ArronHZG Sep 9, 2025
547d68f
qwen3 A3B
ArronHZG Sep 10, 2025
1fc52bb
staleness_threshold=0.1
ArronHZG Sep 10, 2025
9280c11
staleness_threshold=0.1
ArronHZG Sep 10, 2025
d471890
fix last_valid bug, fix staleness_samples reset
sl-1314 Sep 10, 2025
840cc73
fix wait_last
sl-1314 Sep 10, 2025
5fb806d
Merge pull request #15 from meituan-search/recipe/fully_async_fix_2
ArronHZG Sep 10, 2025
fd48a9a
qwen2.5 32B
ArronHZG Sep 11, 2025
8f445b0
fsdp2_fully-async_16-16
ArronHZG Sep 11, 2025
f85c138
update note in config, param_sync, mq
sl-1314 Sep 12, 2025
56f853b
qwen2-32B
ArronHZG Sep 13, 2025
9f53cd7
update 32 workers
ArronHZG Sep 15, 2025
d0a5142
extract modified files in verl/
sl-1314 Sep 15, 2025
c206660
restore modified files in verl folder
sl-1314 Sep 15, 2025
b093270
Merge branch 'recipe/async_policy' into recipe/fully_async_refactor
sl-1314 Sep 15, 2025
6cf1da1
ruff format
sl-1314 Sep 15, 2025
aa370b4
add anomaly detection and exit
sl-1314 Sep 15, 2025
0a2763d
qwen3-32b-96-32
ArronHZG Sep 16, 2025
12eb273
Merge pull request #16 from meituan-search/recipe/fully_async_refactor
sl-1314 Sep 16, 2025
dd534e0
add rollouter&trainer idle time
sl-1314 Sep 16, 2025
67de99f
refactor code rm megatron code
ArronHZG Sep 16, 2025
a66c4cf
set required_samples=ppo_mini_bs & set max_concurrent_samples=rollout…
sl-1314 Sep 16, 2025
0ae200e
rm code
ArronHZG Sep 16, 2025
cf58f10
Merge branch 'main' into recipe/async_policy
ArronHZG Sep 16, 2025
9cfacc2
refactor 1
ArronHZG Sep 16, 2025
073e40f
cleaned up the fully_async metric, fix processing_time, add partial m…
sl-1314 Sep 17, 2025
7ce1a76
Merge branch 'recipe/async_policy' into recipe/fully_async_refactor_1
sl-1314 Sep 17, 2025
f029e30
refactor 2
ArronHZG Sep 17, 2025
94d681d
qwen3-32b-64-64
ArronHZG Sep 17, 2025
a382f9a
add param_sync time log
sl-1314 Sep 17, 2025
e6d51d3
fix typo
sl-1314 Sep 17, 2025
d759cfe
fix typo
sl-1314 Sep 17, 2025
ce944d8
Merge pull request #19 from meituan-search/recipe/fully_async_refactor_1
ArronHZG Sep 17, 2025
c8db507
refactor 3
ArronHZG Sep 17, 2025
e49f6b6
Merge branch 'recipe/async_policy' of https://github.com/meituan-sear…
ArronHZG Sep 17, 2025
3898c5f
translate
sl-1314 Sep 17, 2025
106f5eb
fix typo
sl-1314 Sep 17, 2025
611df39
refactor 4
ArronHZG Sep 17, 2025
c2219e0
qwen3-32B-sta0
ArronHZG Sep 18, 2025
91d199c
refactor 5
ArronHZG Sep 18, 2025
0e88084
refactor 6
ArronHZG Sep 18, 2025
a48ec88
refactor 7
ArronHZG Sep 18, 2025
e6819cd
refactor 8
ArronHZG Sep 18, 2025
8f62a94
refactor 8
ArronHZG Sep 18, 2025
2684943
refactor 10
ArronHZG Sep 18, 2025
5942a4b
Merge branch 'recipe/async_policy' of https://github.com/meituan-sear…
ArronHZG Sep 18, 2025
ecba194
refactor 11
ArronHZG Sep 18, 2025
cf570c8
Merge pull request #18 from meituan-search/async_policy_merge_main
ArronHZG Sep 18, 2025
5adde90
update shel
ArronHZG Sep 18, 2025
6638100
Merge branch 'recipe/fully_async_refactor_2' into recipe/fully_async_…
sl-1314 Sep 19, 2025
3155b44
fix notation
sl-1314 Sep 19, 2025
96e3136
Merge pull request #22 from meituan-search/recipe/fully_async_refactor_3
ArronHZG Sep 19, 2025
c39f283
rm print
ArronHZG Sep 19, 2025
e511694
fix log prob in hybird&streaming mode
sl-1314 Sep 28, 2025
41cea0f
fix stale_samples_processed and stale_trajectory_processed metrics
sl-1314 Sep 29, 2025
97615b4
add require_batches config param
sl-1314 Oct 11, 2025
106c933
Merge pull request #24 from meituan-search/recipe/fully_async_fix_3
sl-1314 Oct 11, 2025
211a441
fix staleness_samples reset bug
sl-1314 Oct 13, 2025
f7a8e96
del debug code
sl-1314 Oct 14, 2025
3de3ed0
add README_zh.md
Oct 14, 2025
f658643
update README_zh.md
ArronHZG Oct 14, 2025
1a3759e
update README_zh.md
ArronHZG Oct 14, 2025
94ed108
Merge pull request #26 from meituan-search/recipe/fully_async_fix_4
sl-1314 Oct 14, 2025
ed73079
update README
ArronHZG Oct 14, 2025
b0d21f2
Merge branch 'recipe/async_policy' of https://github.com/meituan-sear…
ArronHZG Oct 14, 2025
b528c41
merge main
ArronHZG Oct 14, 2025
dbf19e3
Merge pull request #27 from meituan-search/async_policy_merge_main_v8
ArronHZG Oct 14, 2025
ad595f7
add ci
ArronHZG Oct 14, 2025
1f51b0d
update readme
ArronHZG Oct 14, 2025
ead757a
fix ci
Oct 16, 2025
2383a15
fix some ci
ArronHZG Oct 16, 2025
7298b65
fix e2e_fully_async_policy_fsdp2
ArronHZG Oct 16, 2025
0730b75
update readme exp
sl-1314 Oct 16, 2025
89485fe
Merge branch 'recipe/async_policy' of https://github.com/meituan-sear…
sl-1314 Oct 16, 2025
c4a0633
update readme
ArronHZG Oct 16, 2025
de05510
update readme
ArronHZG Oct 16, 2025
8153ad2
Merge branch 'main' into async_policy_merge_main_v9
ArronHZG Oct 16, 2025
a60ef90
Merge pull request #28 from meituan-search/async_policy_merge_main_v9
ArronHZG Oct 16, 2025
4e122bf
update shell script
sl-1314 Oct 17, 2025
90d76f2
Merge branch 'main' into async_policy_merge_main_v10
ArronHZG Oct 17, 2025
26feea0
Merge pull request #29 from meituan-search/async_policy_merge_main_v10
ArronHZG Oct 17, 2025
7cae5d5
update readme
ArronHZG Oct 17, 2025
62fb0d0
trigger ci
ArronHZG Oct 17, 2025
fbae66a
trigger ci
ArronHZG Oct 17, 2025
0565a55
trigger ci
ArronHZG Oct 17, 2025
dda6c5d
trigger ci
ArronHZG Oct 17, 2025
423e14c
Merge branch 'main' into async_policy_merge_main_v11
ArronHZG Oct 17, 2025
17b9e5b
Merge pull request #30 from meituan-search/async_policy_merge_main_v11
ArronHZG Oct 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions .github/workflows/e2e_fully_async_policy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# # Tests layout

# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
# - `tests/trainer` for testing functionality related to `verl/trainer`
# - `tests/models` for testing functionality related to `verl/models`
# - ...

# There are a few folders with `special_` prefix, created for special purposes:
# - `special_distributed`: unit tests that must run with multiple GPUs
# - `special_e2e`: end-to-end tests with training/generation scripts
# - `special_npu`: tests for NPUs
# - `special_sanity`: a suite of quick sanity tests
# - `special_standalone`: a set of test that are designed to run in dedicated environments

# Accelerators for tests
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.

# # Workflow layout

# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
# 3. End-to-end tests: `e2e_*.yml`
# 4. Unit tests
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
# - new workflow yaml is added to `.github/workflows`
# - new tests are added to workflow mentioned in 2.


name: e2e_fully_async_policy

on:
# Trigger the workflow on push or pull request,
# but only for the main branch
# For push, for now only anti-patterns are specified so it is more conservative
# and achieves higher coverage.
push:
branches:
- main
- v0.*
paths:
- "**/*.py"
- "!**/*.md"
- "!**/*.sh"
# Other entrypoints
- "!examples/*trainer*"
- "!tests/**"
- "!verl/trainer/main_*.py"
- "!verl/trainer/fsdp_sft_trainer.py"
- "!recipe/**"
- "recipe/fully_async_policy"
pull_request:
branches:
- main
- v0.*
paths:
- "**/*.py"
- "!**/*.md"
- "!**/*.sh"
# Other entrypoints
- "!examples/**"
- "!tests/**"
- "!verl/trainer/main_*.py"
- "!verl/trainer/fsdp_sft_trainer.py"
# Other recipes
- "!recipe/**"
# Home
- "recipe/fully_async_policy"
# Entrypoints
- ".github/workflows/e2e_fully_async_policy.yml"
- "examples/data_preprocess/gsm8k.py"
- "tests/special_e2e/run_fully_async_policy.sh"

# Cancel jobs on the same ref if a new one is triggered
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}

# Declare permissions just read content.
permissions:
contents: read

env:
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.13.0-te2.2"
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
TRANSFORMERS_VERSION: "4.56.2"

jobs:
setup:
if: github.repository_owner == 'volcengine'
runs-on: ubuntu-latest
outputs:
runner-label: ${{ steps.create-runner.outputs.runner-label }}
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
steps:
- uses: actions/checkout@v4
- id: create-runner
uses: volcengine/vemlp-github-runner@v1
with:
mode: "create"
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
mlp-image: "${{ env.IMAGE }}"

# Test FSDP2 strategy
e2e_fully_async_policy_fsdp2:
needs: setup
runs-on: [ "${{ needs.setup.outputs.runner-label || 'L20x8' }}" ]
timeout-minutes: 10 # Increase timeout for async training
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
ACTOR_STRATEGY: "fsdp2"
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Install the current repository
run: |
pip3 install --no-deps -e .[test,gpu]
pip3 install transformers==$TRANSFORMERS_VERSION
- name: Prepare GSM8K dataset
run: |
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
- name: Running the E2E test with fully_async_policy algorithm (FSDP2)
run: |
ray stop --force
bash tests/special_e2e/run_fully_async_policy.sh

cleanup:
runs-on: ubuntu-latest
needs:
[
setup,
e2e_fully_async_policy_fsdp2
]
if: always()
steps:
- id: destroy-runner
uses: volcengine/vemlp-github-runner@v1
with:
mode: "destroy"
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
Loading
Loading