Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
309 commits
Select commit Hold shift + click to select a range
33ca684
fix: expand ray port range from 54001 ~ 54257 to 54001 ~ 54513 (#950)
yuki-97 Aug 21, 2025
f8fdb5c
fix: fix async vllm nccl fail on dsv3 tp16pp2 and non-colocated on si…
yuki-97 Aug 21, 2025
bc24887
feat: fp8 block scaling (#543)
jiemingz Aug 21, 2025
ac7469f
test: Add Megatron tests (#713)
ashors1 Aug 22, 2025
add4efb
feat: GSPO (#859)
pjin-nvidia Aug 23, 2025
c86bccb
revert: "feat: GSPO" (#973)
terrykong Aug 24, 2025
e861b94
feat: Create DTensorPolicyWorkerV2 to integrate nemo-automodel apis (…
ffrujeri Aug 24, 2025
200b4e8
ci: Clean-up docker system before test (#974)
chtruong814 Aug 25, 2025
071ebfc
build: Use no-build-isolation to install deep_gemm to fix arm install…
chtruong814 Aug 25, 2025
faad021
fix: Automodel integration - remove nvfsdp from uv lock. (#980)
ffrujeri Aug 25, 2025
8d8f365
fix: Update Automodel integration check logic and message. (#981)
ffrujeri Aug 26, 2025
65a7965
chore: ray-sub - improve robustness (#968)
skirdey-inflection Aug 26, 2025
989f177
fix: memory optimizations for Nemotron12B 12k seqlen DPO training (#926)
ybgao-nvidia Aug 26, 2025
2aea5ad
fix: fix temperature-related issues (#935)
zhandaz Aug 26, 2025
97b6c31
feat: GSPO (w/ CI fixes) (#976)
pjin-nvidia Aug 27, 2025
8e0daa5
test: introduce "run_first" marker to fail on config changes early (#…
terrykong Aug 27, 2025
d168de3
fix: ulimit set in ray.sub (#989)
bogdansalyp Aug 27, 2025
a84f3b4
test: add non-colocated nightly test (#960)
yuki-97 Aug 28, 2025
4923403
feat: add vllm enable_expert_parallel (#997)
yuki-97 Aug 28, 2025
e7a09b8
ci: Update community bot to add issues to shared project (#931)
chtruong814 Aug 28, 2025
57e43bb
fix: [mcore] only take optimizer steps when in train mode (#1012)
ashors1 Aug 28, 2025
157b254
fix: remove unused fp8 training args in config (#1018)
ashors1 Aug 29, 2025
9301d36
fix: ulimit in ray.sub respect hard limit (#1011)
terrykong Aug 29, 2025
c4fd5d3
feat: Migration from NeMo Tron to Megatron Bridge (#905)
yaoyu-33 Aug 30, 2025
cbd4b93
feat: preference datasets (#673)
jveronvialard Aug 30, 2025
200568e
feat: Fix nsight profiling file sync for multi-node jobs (#1001)
guyueh1 Aug 31, 2025
0358a86
feat: Overlong filtering for GRPO (#724)
jubick1337 Sep 1, 2025
7c5efd8
chore: flush to stdout when print logging during GRPO (#1021)
pjin-nvidia Sep 2, 2025
4821ef8
ci: Add healthcheck for Github runners to run on a schedule (#1030)
chtruong814 Sep 2, 2025
f182460
fix: make layernorm_epsilon configurable in with megatron backend (#1…
ashors1 Sep 3, 2025
13b4ca5
ci: Only run build-test-publish-wheel workflow if env var set (#1047)
chtruong814 Sep 3, 2025
acabc79
fix: ray.sub will exit early if any srun fails to launch (#1022)
terrykong Sep 3, 2025
2b55598
fix: address double bos in eval task (#962)
ZhiyuLi-Nvidia Sep 3, 2025
f17f331
feat: add testmon support to detect when tests need to be rerun (#1056)
terrykong Sep 4, 2025
76874a8
feat: Integrate vlm changes between DTensorPolicyWorker V1 and V2. (#…
ffrujeri Sep 5, 2025
191a160
fix: Correct strict loading megatron bridge config (#1055)
yfw Sep 5, 2025
ae89e12
fix: Reset parallelism configs to default after initial import (#1078)
yfw Sep 5, 2025
7675ae4
feat: Support Multi-epoch training in GRPO (#776)
ahmadki Sep 5, 2025
dc865ef
feat: support drop_last=False during validation (#1029)
ashors1 Sep 5, 2025
46f5edd
fix: optional clear cache between microbatch iterations (#1074)
ybgao-nvidia Sep 7, 2025
3da8221
fix: nightly CI tests (#1090)
terrykong Sep 7, 2025
5d2fd87
ci: Add checks for docs broken links (#1048)
chtruong814 Sep 7, 2025
266e718
fix: fix scheduler decay steps with megatron backend (#939)
ashors1 Sep 7, 2025
f0588dc
fix: Make `prepare_for_generation` metric names compatible with MLFlo…
nathan-az Sep 8, 2025
1c85276
fix: convergence issue by adding use_inductor=False in vllm compilati…
ZhiyuLi-Nvidia Sep 8, 2025
62112f6
fix: report the correct number of workers during FLOPs calculation (#…
ybgao-nvidia Sep 9, 2025
b060d1d
docs: update `grpo.md` (#1106)
xxman-google Sep 9, 2025
9397ef8
fix: `clear_cache_every_n_steps` variable name (#1109)
bxyu-nvidia Sep 10, 2025
915c79c
chore: add DeepEP dependencies (#1045)
yuki-97 Sep 10, 2025
16d9128
feat: Deepseek migration to Megatron-Bridge + CP support (#1059)
yfw Sep 11, 2025
46e2beb
fix: restore qwen3 support for FLOPs accounting (#1117)
ybgao-nvidia Sep 12, 2025
94a3d49
fix: stop jobs after timeout and add warning for validation (#1069)
wedu-nvidia Sep 12, 2025
52f1d6a
fix: fix eval config (#1123)
yuki-97 Sep 15, 2025
1cb5e0d
ci: Fix automodel and submodule check comments from a fork (#1028)
chtruong814 Sep 16, 2025
5a9f7ac
feat: Expose async vLLM engine as HTTP server (#1110)
bxyu-nvidia Sep 16, 2025
95b7326
ci: Remove test comment from automodel integration check (#1148)
chtruong814 Sep 18, 2025
eef47b8
chore: add coderabbit configuration and coding guidelines for coderab…
terrykong Sep 18, 2025
61e235c
docs: End-to-end timeline view with nsys (#1114)
youngeunkwon0405 Sep 18, 2025
affef0e
chore: introduce codeowners (#1133)
terrykong Sep 18, 2025
ee8f5aa
ci: Add merge queue retry if CI_TIMEOUT (#1111)
chtruong814 Sep 18, 2025
594e700
feat: support DP inside vLLM for EP (#1081)
yuki-97 Sep 18, 2025
c6be532
feat: Implement safetensors checkpointing format support using nemo-a…
ffrujeri Sep 18, 2025
e8106df
fix: crash when sequence packing is enabled for gemma 1b. (#809)
joyang-nv Sep 18, 2025
652ed92
refactor: refactor dataset module (#977)
yuki-97 Sep 18, 2025
7e6b786
fix: Convert relative path to a file in Mardown to its URL on GitHub.…
wangshangsam Sep 19, 2025
e433d40
chore: add deepep install instruction (#1136)
yuki-97 Sep 19, 2025
3a1ca3f
perf: Remove empty_cache for performance optimization (#1071)
katec846 Sep 19, 2025
c6e6f70
feat: add support for COMMAND= in ray.sub *-attach.sh scripts (#1167)
terrykong Sep 19, 2025
5355a3d
fix: cleaned up the instructions around installing cuDNN (#1105)
ahmadki Sep 20, 2025
cfaf5a8
feat: Support Reward Model based Environments (#1026)
RayenTian Sep 20, 2025
a4c30f9
ci: Ensure mcore and automodel are installed before checking if tests…
chtruong814 Sep 20, 2025
5faaea8
ci: Add check for PR branch being up to date (#1171)
chtruong814 Sep 20, 2025
ef60b33
ci: Run nightly Github tests (#1172)
chtruong814 Sep 22, 2025
2d3c43c
ci: Set HF_HUB_OFFLINE=1 during tests when PR is from a fork (#1174)
chtruong814 Sep 22, 2025
42aa41b
feat: add async RL support (#1098)
parthchadha Sep 22, 2025
cde2acd
perf: Add a field in SFT data config to modify num_workers for loadin…
katec846 Sep 22, 2025
64ee0d0
feat: support chat_template_kwargs in tokenizer config (#1165)
yuki-97 Sep 23, 2025
051c2f7
fix: Add check for world size and parallelism enabled (#1190)
parthchadha Sep 23, 2025
66099f5
fix: A fix in megatron YARN module for memory leak (#1163)
guyueh1 Sep 23, 2025
a9ff45c
chore: Delete .github/ISSUE_TEMPLATE directory (#1194)
pablo-garay Sep 23, 2025
63439ac
docs: guide for sliding puzzle example (#961)
slikhite-1 Sep 23, 2025
e22a340
docs: Restructure README with backend-specific quick start and setup …
euronymous-aithal Sep 24, 2025
38f0543
fix: Run crash on get_latest_checkpoint (#1168)
bogdansalyp Sep 24, 2025
a579137
fix: can't find transformers_modules error for moonlight (#1124)
joyang-nv Sep 25, 2025
7aa7071
chore: patch KL loss to prevent nans (#876)
rohitrango Sep 25, 2025
56a6225
feat: add support for nemotron-nas with custom plan. (#1180)
joyang-nv Sep 25, 2025
32faafa
feat: add config_cli.py and refactor configs + config pre-commit (#1024)
terrykong Sep 25, 2025
e60c4d9
fix: minimize llama-super grpo config (#1206)
terrykong Sep 25, 2025
79b7a87
feat: support swanlab logger (#923)
terrykong Sep 25, 2025
c01f9d7
ci: Add status badge and prevent merging if no tests ran (#1192)
chtruong814 Sep 25, 2025
6fe56b0
feat: Support passing in tool calls with OpenAI chat format when doin…
HeyyyyyyG Sep 26, 2025
9cc8c9f
chore: remove deprecated --dashboard-grpc-port from ray.sub (#1209)
terrykong Sep 26, 2025
f521459
feat: Update mbridge with cache support (#1187)
ZhiyuLi-Nvidia Sep 26, 2025
16e08cd
feat: FP8 Training in Megatron Path (#971)
guyueh1 Sep 26, 2025
2570489
test: add bisect-script.sh to help bisect CI tests (#1215)
terrykong Sep 27, 2025
1b96b45
fix: Reduce memory usage of gradient norm computation (#1138)
jseppanen Sep 27, 2025
4528931
fix: Handle missing prompts in math HF data processor and add regress…
zpqiu Sep 28, 2025
5166d74
fix: invalid time for fp8 grpo test 300 -> 240 minutes (#1220)
terrykong Sep 28, 2025
0dca729
fix: dpo mistral nightly needs more time (#1225)
terrykong Sep 29, 2025
ebfa9e2
fix: nightlies using v1 can't use model_save_format=safetensors (#1226)
terrykong Sep 29, 2025
629a82b
chore: Update cherry-pick workflow to use v0.63.0 (#1218)
pablo-garay Sep 29, 2025
b445a3a
fix: loosen sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh step time/loss check …
terrykong Sep 29, 2025
17ea9ab
feat: add on policy distillation algorithm (#1006)
zpqiu Sep 29, 2025
c2b36f2
fix: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts runs 40 ste…
terrykong Sep 30, 2025
bc1a027
feat: Adding perf metrics (#1183)
youngeunkwon0405 Sep 30, 2025
8003918
docs: async doc update for importance sampling correction (#1222)
parthchadha Sep 30, 2025
b50bfca
feat: VLM support via megatron backend (#1115)
yfw Oct 1, 2025
cc8a93e
fix: Fix OOM in validation during colocated training (#1159)
jseppanen Oct 1, 2025
f7645f3
feat: Update Theoretical TFLOPS (#1236)
youngeunkwon0405 Oct 1, 2025
d82ca75
fix: fix checkpointing when `val_period` does not divide `save_period…
ashors1 Oct 1, 2025
0ad4722
fix: lower steps in smolvlm nightly test (#1239)
terrykong Oct 2, 2025
2df9ea5
fix: Fix gradient clipping of non-float32 params (#1158)
jseppanen Oct 2, 2025
43928aa
fix: Release gradient memory after policy training (#1147)
jseppanen Oct 2, 2025
d653437
fix: gitignore only the top level datasets directory (#1252)
terrykong Oct 2, 2025
376e625
fix: fp8 rollout nightly fix check from step 100 to 40 (#1233)
terrykong Oct 2, 2025
557b7ec
fix: moonlight CI test mem regression (increase cache flush) (#1257)
terrykong Oct 2, 2025
f1bfeb6
docs: add missing async_grpo.enabled flag to configuration (#1237)
youngeunkwon0405 Oct 3, 2025
9a909cc
docs: Update v0.3.0 announcement link (#1269)
chtruong814 Oct 3, 2025
d0e203c
feat: add valid_tokens_per_sec metric and total_valid_tokens to save …
terrykong Oct 3, 2025
38125c2
fix: remove noisy qwen2 vl nightly test loss check (#1272)
terrykong Oct 5, 2025
c68b4c2
fix: make sft dynamic batch step time check more stable (#1265)
terrykong Oct 5, 2025
bf5445c
fix: qwen30 config had typo in metric check (#1266)
terrykong Oct 5, 2025
7aabb81
fix: colocated.resources.gpus_per_node is now required for colocated …
terrykong Oct 5, 2025
7536be3
feat: more numerically stable qwen custom plan (#1235)
terrykong Oct 6, 2025
f3f2743
chore: 0.4.0.rc0 -> 0.5.0.rc0 (#1284)
terrykong Oct 6, 2025
119b2ff
chore: 0.4.0.rc0 -> 0.4.0 (#1285)
terrykong Oct 6, 2025
5efbe4f
chore: Revert "chore: 0.4.0.rc0 -> 0.4.0 (#1285)" (#1296)
chtruong814 Oct 7, 2025
1f979e0
build: Fix ngc pytorch build with deep-ep (#1234)
chtruong814 Oct 7, 2025
00cb570
fix: parallel state initialization error in Megatron to HF model conv…
skirdey-inflection Oct 7, 2025
806e285
fix: deepscaler-24k test reduce to 10 steps to safely finish in 4 hr …
terrykong Oct 7, 2025
9a1e3df
fix: qwen32 nightly metric check more stable (#1271)
terrykong Oct 8, 2025
3fccb63
feat: Add deepseek flops tracker (original #1250) (#1305)
guyueh1 Oct 8, 2025
57046a4
fix: enhancing non-colocated refit performance by having inclusive co…
youngeunkwon0405 Oct 8, 2025
d726c38
fix: Reinitialize model parallel after import (#1317)
ybgao-nvidia Oct 9, 2025
52cd68d
feat: Using mcore cpu optimizer (#1242)
guyueh1 Oct 9, 2025
1ed45b7
docs: Hardcode docs github url (#1328)
chtruong814 Oct 10, 2025
d975e39
perf: Add a field in megatron_cfg to enable bias_activation_fusion (#…
katec846 Oct 10, 2025
7c574d0
fix: fix github to myst-parser admonition conversion (#1224)
terrykong Oct 10, 2025
9d44598
fix: Async GRPO loop crash logging (#1321)
bogdansalyp Oct 10, 2025
f29fa2a
feat: Add Penguin stub (#1325)
bxyu-nvidia Oct 10, 2025
b19d669
docs: Add news items for FP8 Quantization, MoE optimization, and NeMo…
snowmanwwg Oct 10, 2025
a777f2a
feat: tensor packing and batching for non-colocated refit performance…
youngeunkwon0405 Oct 12, 2025
53129d4
test: disable dpo mistral nightly until transformers upgrades past 4.…
terrykong Oct 12, 2025
6d1d711
ci: add a descriptive error message for the no-test state (#1318)
terrykong Oct 12, 2025
eb5bb0f
fix: Fix checkpoint conversion error for qwen 30b-a3b (#1335)
yfw Oct 13, 2025
4db1704
feat: Add Penguin env (#1327)
bxyu-nvidia Oct 13, 2025
355aa98
perf: Add num_workers in DPO, GRPO and SFT for loading data (#1314)
katec846 Oct 14, 2025
15a0343
chore: add chat_template_kwargs in default train configs (#1353)
yuki-97 Oct 15, 2025
5c67023
fix: Replace decode-based prefix matching with EOS-boundary splicing …
parthchadha Oct 15, 2025
0a769cc
fix: grpo early exit edge case (#1361)
terrykong Oct 15, 2025
8f6e00e
fix: Megatron worker to have locked dependencies (#1315)
bogdansalyp Oct 16, 2025
96656c3
fix: Fix the logger error in non-colocated sync-grpo code path (#1355)
youngeunkwon0405 Oct 16, 2025
638bc52
test: Update on-policy distillation release tests (#1363)
zpqiu Oct 16, 2025
9da0317
fix: update the custom vllm instructions (#1116)
terrykong Oct 16, 2025
dee3fd9
fix: Fix non-colocated refit when vLLM model parallel size is larger …
youngeunkwon0405 Oct 16, 2025
7bd853a
feat: Support DAPO dynamic sampling and reward shaping (#602)
peri044 Oct 17, 2025
905a224
fix: fix mcore train_iters in grpo (#1383)
yuki-97 Oct 17, 2025
85eeb8d
fix: more robust fp8 rollout metric check (#1307)
terrykong Oct 17, 2025
1b3c12d
docs: update latest news list (#1390)
euronymous-aithal Oct 18, 2025
9a22e2c
docs: Update README to include NVIDIA NeMo Framework link (#1392)
snowmanwwg Oct 20, 2025
73c8725
feat: add Megatron support for on-policy distillation (#1324)
zpqiu Oct 21, 2025
f286857
feat: Add debugger flag that can be turned on via RAY_DEBUG=legacy (#…
guyueh1 Oct 21, 2025
f2de476
feat: support truncated importance sampling (#1348)
yuki-97 Oct 21, 2025
3a69c21
feat: refit refactoring with zmq and overlapping (#1267)
ZhiyuLi-Nvidia Oct 22, 2025
d843f02
fix: Fix policy worker placement when using unified placement group (…
guyueh1 Oct 23, 2025
73e0c09
feat: Overlap param iteration and broadcast in non-colocated refit (#…
youngeunkwon0405 Oct 23, 2025
e762237
docs: Add repo overview diagram (#1403)
snowmanwwg Oct 23, 2025
e9bd6fd
docs: Update README.md to say NeMo RL (#1424)
Sylendran95 Oct 24, 2025
79269af
fix: append to hf_overrides rather than overwriting (#1413)
ashors1 Oct 25, 2025
3f36d14
chore: major version bump (torch 2.8, vllm 0.11, ray 2.49) & SP fixes…
terrykong Oct 26, 2025
9475e7b
fix: Fix grad norm metric in mcore path (#1426)
yfw Oct 27, 2025
b3aac89
fix: Adding mean total tokens per sample to the output log (#1406)
youngeunkwon0405 Oct 28, 2025
4db0db2
feat: additional kl metrics (#1420)
ZhiyuLi-Nvidia Oct 28, 2025
ceb25fc
chore: use pydantic for yaml test validation (#1382)
terrykong Oct 29, 2025
7b32363
fix: support arbitrary values for `checkpointing.metric_name` (#1291)
ashors1 Oct 30, 2025
bd2e645
fix: fix log step (#1447)
yuki-97 Oct 30, 2025
90fb0a8
fix: Fixes to make Megatron backend match dtensor (#1389)
ashors1 Oct 31, 2025
855151b
chore: improve ray.sub generalization across clusters (#1451)
terrykong Oct 31, 2025
938761a
feat: Add DAPO dataset and Deepseek-v3 config (#1281)
yfw Nov 1, 2025
1f69bb0
feat: add capability to update weights inflight during generation (#1…
parthchadha Nov 3, 2025
1982f6e
fix: nsys multi-report view image from docs.nvidia.com (#1466)
youngeunkwon0405 Nov 3, 2025
2e2c2b3
chore: Update RL to use megatron-bridge tot (#1358)
yaoyu-33 Nov 4, 2025
19f68c8
fix: stabilize uv lock after experimental changes (#1461)
terrykong Nov 4, 2025
e6adc77
feat: add kl penalty k1, k2 (#1349)
yuki-97 Nov 4, 2025
8762f57
docs: On policy KD readme update (#1425)
sharathts Nov 4, 2025
6984ba7
feat: Integrate Penguin env logic (#1450)
bxyu-nvidia Nov 6, 2025
8615ee9
fix: Make the optimizer offloading optional (#1404)
youngeunkwon0405 Nov 7, 2025
82bf15a
build: Bump python to 3.12.12 and mlflow to 3.5.1 (#1482)
chtruong814 Nov 7, 2025
40de222
feat: Add Penguin run (#1481)
bxyu-nvidia Nov 7, 2025
2951ce3
fix: improve ZMQ error handling and messages in colocated refit (#1477)
ZhiyuLi-Nvidia Nov 7, 2025
ba68386
build: Remove torch extra build dependency for NGC Pytorch build (#1490)
chtruong814 Nov 10, 2025
3350ba2
feat: Onboard perf recipes in tests (#1322)
guyueh1 Nov 10, 2025
6a035bc
fix: Fix process_weights_after_loading for fp8 dense (#1432)
guyueh1 Nov 10, 2025
6a40247
revert: "chore: improve ray.sub generalization across clusters" (#1505)
terrykong Nov 11, 2025
779f775
fix: patch python path to include transformers_modules in __init__ (#…
hemildesai Nov 12, 2025
7124e44
feat: enhance advantages tracking and normalization stability in GRPO…
ffrujeri Nov 13, 2025
b3a7892
feat: improve non-colocated startup by starting policy and vllm in pa…
terrykong Nov 14, 2025
6fc917f
fix: fixing the sequence parallel related issue in mcore path (#1487)
youngeunkwon0405 Nov 14, 2025
45f5ce6
fix: improve local eval config and doc (#1528)
yuki-97 Nov 17, 2025
74b9b17
docs: Refactor Home Page and New About Section (#1338)
jgerh Nov 17, 2025
775fc34
fix: Incompatible configuration between reward normalization and the …
ffrujeri Nov 18, 2025
c32778d
feat: Support for nano-v2 (#1514)
yfw Nov 18, 2025
55dc433
fix: Update Penguin tests to use renamed resource server (#1540)
shashank3959 Nov 19, 2025
7257c30
fix: honor mlflow server artifact_location (#1536) (#1538)
clumsy Nov 19, 2025
08534fe
build: Update docker file to include OSS NOTICES.txt (#1544)
chtruong814 Nov 19, 2025
1c371a9
perf: perf script change for qwen30b-a3b (#1526)
youngeunkwon0405 Nov 20, 2025
ca9f8f0
move
terrykong Nov 20, 2025
af18c36
wip
terrykong Nov 21, 2025
0f1354e
version fix
terrykong Nov 21, 2025
951c47b
token
terrykong Nov 21, 2025
465adca
robust renovate_app_id
terrykong Nov 21, 2025
b7d282d
successful, but trying to create pr
terrykong Nov 21, 2025
0ce4d98
ignore fact that repo can be fork and also simplify into one pr
terrykong Nov 21, 2025
166f995
try to fix fork
terrykong Nov 21, 2025
0c6901e
[Renovate]: migrate config .github/renovate.json
renovate-bot Nov 21, 2025
6ec02d2
Merge pull request #3 from terrykong/renovate/migrate-config
terrykong Nov 21, 2025
c284119
attempt to group but also fix issues
terrykong Nov 21, 2025
e2f2fa5
try consolidating the logs
terrykong Nov 21, 2025
274767b
consolidated pr
terrykong Nov 21, 2025
2f11198
try to fix
terrykong Nov 22, 2025
97b6f99
try restricting dependency list
terrykong Nov 22, 2025
e538481
try to fix
terrykong Nov 24, 2025
6a14f68
pre-commit
terrykong Nov 24, 2025
355987b
add uv
terrykong Nov 24, 2025
ccef319
pre-clone repo with unshallowed submodules for Renovate
terrykong Nov 24, 2025
73e77d4
fix: add safe.directory for pre-cloned repo
terrykong Nov 24, 2025
69c9d97
fix: use 777 permissions and move gitconfig to /tmp
terrykong Nov 24, 2025
4eb360a
fix(renovate): use docker-user root and pass GIT_CONFIG_GLOBAL env var
terrykong Nov 24, 2025
5ab4f01
fix(renovate): use docker-cmd-file to set git safe.directory
terrykong Nov 24, 2025
d2629d4
fix(renovate): disable uv manager to prevent artifact update race con…
terrykong Nov 24, 2025
7366674
fix(renovate): disable uv manager via packageRules instead of top-lev…
terrykong Nov 24, 2025
debbc06
fix(sync): use bracket counting to find list end instead of regex
terrykong Nov 24, 2025
4cd8141
fix(renovate): remove incorrect workspace check that skipped uv lock
terrykong Nov 25, 2025
8bceb6f
fix(deps): update Automodel to main, remove torch/ray from Renovate a…
terrykong Nov 25, 2025
8654ac4
fix(renovate): change group name to force new branch after closed PRs
terrykong Nov 25, 2025
c0bbd39
fix(renovate): add recreateClosed to allow recreating PRs after closes
terrykong Nov 25, 2025
8b7d59a
fix(renovate): use recreateWhen instead of deprecated recreateClosed
terrykong Nov 25, 2025
67f21bc
fix(renovate): remove schedule restriction to allow anytime updates
terrykong Nov 25, 2025
eaa98ef
fix(renovate): remove transformers from allowlist (conflicts with nem…
terrykong Nov 25, 2025
88b258d
Unpin torch version to let vllm dictate
terrykong Nov 25, 2025
84feb20
Temp hack: override timm==1.0.16 to work around submodule conflict
terrykong Nov 25, 2025
e1ed02a
Disable pep621 auto lock file updates - let postUpgradeTasks handle it
terrykong Nov 25, 2025
a435420
Add NRL_AUTO_SYNC_DEPS env var for Renovate auto-sync
terrykong Nov 25, 2025
4ad700e
Set NRL_AUTO_SYNC_DEPS in renovate_cmd.sh wrapper
terrykong Nov 25, 2025
d9abfa6
Add customEnvVariables for NRL_AUTO_SYNC_DEPS in renovate.json
terrykong Nov 25, 2025
2b654d7
Add workflow_dispatch inputs for Docker updates and force run
terrykong Nov 25, 2025
4029d21
Use jq instead of sed for JSON modification
terrykong Nov 25, 2025
c043992
Remove unused dry_run input
terrykong Nov 25, 2025
758c06a
Add dockerfile disable rule to renovate.json
terrykong Nov 25, 2025
bbfaa6b
[Renovate] Update dependency updates
renovate-bot Mar 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
101 changes: 101 additions & 0 deletions .coderabbit.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
# https://docs.coderabbit.ai/getting-started/configure-coderabbit/
# Validator https://docs.coderabbit.ai/configuration/yaml-validator#yaml-validator
# In PR, comment "@coderabbitai configuration" to get the full config including defaults
# Set the language for reviews by using the corresponding ISO language code.
# Default: "en-US"
language: "en-US"
# Settings related to reviews.
# Default: {}
reviews:
# Set the profile for reviews. Assertive profile yields more feedback, that may be considered nitpicky.
# Options: chill, assertive
# Default: "chill"
profile: chill
# Add this keyword in the PR/MR title to auto-generate the title.
# Default: "@coderabbitai"
auto_title_placeholder: '@coderabbitai title'
# Auto Title Instructions - Custom instructions for auto-generating the PR/MR title.
# Default: ""
auto_title_instructions: 'Format: "<category>: <title>". Category must be one of: feat, fix, docs, style, refactor, perf, test, build, ci, chore, revert, cp. The category must be followed by a colon. Title should be concise (<= 80 chars). Example: "feat: Add logit_bias support".' # current: ''
# Set the commit status to 'pending' when the review is in progress and 'success' when it is complete.
# Default: true
commit_status: false
# Generate walkthrough in a markdown collapsible section.
# Default: false
collapse_walkthrough: true
# Generate an assessment of how well the changes address the linked issues in the walkthrough.
# Default: true
assess_linked_issues: true
# Include possibly related issues in the walkthrough.
# Default: true
related_issues: true
# Related PRs - Include possibly related pull requests in the walkthrough.
# Default: true
related_prs: true
# Suggest labels based on the changes in the pull request in the walkthrough.
# Default: true
suggested_labels: true
# Suggest reviewers based on the changes in the pull request in the walkthrough.
# Default: true
suggested_reviewers: true
# Generate a poem in the walkthrough comment.
# Default: true
poem: false # current: true
# Post review details on each review. Additionally, post a review status when a review is skipped in certain cases.
# Default: true
review_status: false # current: true
# Configuration for pre merge checks
# Default: {}
pre_merge_checks:
# Custom Pre-merge Checks - Add unique checks to enforce your team's standards before merging a pull request. Each check must have a unique name (up to 50 characters) and clear instructions (up to 10000 characters). Use these to automatically verify coding, security, documentation, or business rules and maintain code quality.
# Default: []
custom_checks:
- name: "Test Results for Major Changes"
mode: "warning" # or "error" to block merges
instructions: |
If this PR contains major changes (such as new features, breaking changes, or significant refactoring), verify that the PR description includes test results or testing information.
If a change could affect numerics or convergence, the PR description should include information demonstrating that there is no regression.
If a change could affect performance, the PR description should include before-and-after performance numbers, as well as the configuration and context in which they apply.
Pass if test results are documented or if the changes are minor.
auto_review:
# Configuration for auto review
# Default: {}
# Automatic Incremental Review - Automatic incremental code review on each push
# Default: true
auto_incremental_review: false # current: true
# Review draft PRs/MRs.
# Default: false
drafts: false
# Base branches (other than the default branch) to review. Accepts regex patterns. Use '.*' to match all branches.
# Default: []
base_branches: ["main", "r[0-9].*"] # current: []
# Configuration for knowledge base
# Default: {}
knowledge_base:
code_guidelines:
# CodeRabbit will analyse and learn from your organization's code guidelines, which you can mention in the file patterns section. These guidelines will then be used to conduct thorough code reviews.
# Default: {}
enabled: true
# Enabled - Enable CodeRabbit to enforce your organization's coding standards during reviews.
# Default: true
filePatterns: # current: []
# File Patterns - Specify files for your coding guideline documents in this section. CodeRabbit will scan these files to understand your team's standards and apply them during code reviews. Multiple files supported. File names are case-sensitive. Common files like: (**/.cursorrules, .github/copilot-instructions.md, .github/instructions/*.instructions.md, **/CLAUDE.md, **/GEMINI.md, **/.cursor/rules/*, **/.windsurfrules, **/.clinerules/*, **/.rules/*, **/AGENT.md, **/AGENTS.md) are included by default.
# Default: []
- "**/CODING_GUIDELINES.md"
- "**/.cursor/rules/*"
4 changes: 3 additions & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Adding to .gitignore helps reduce the size of your working_dir

.git
# Note: removing .git from .dockerignore since it is valuable to have the git history to
# know where this container was built
# .git
*.out
*.log
*.tar
Expand Down
42 changes: 0 additions & 42 deletions .github/ISSUE_TEMPLATE/bug_report.md

This file was deleted.

25 changes: 0 additions & 25 deletions .github/ISSUE_TEMPLATE/feature_request.md

This file was deleted.

180 changes: 180 additions & 0 deletions .github/RENOVATE_SETUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# Renovate Setup Documentation

This repository uses [Renovate](https://docs.renovatebot.com/) to automatically update dependencies, including git submodules and Python packages managed in `pyproject.toml`.

## What Renovate Does

Renovate automatically:
1. **Updates git submodules** by tracking the configured branches
2. **Updates a small allowlist of Python dependencies** in `pyproject.toml`:
- `vllm`, `torch`, and `ray` for the core training stack
- `transformer-engine` and `flash-attn` for xformers compatibility
- `transformers` so we can track upstream releases
- _Everything else is frozen unless explicitly requested._
3. **Syncs `3rdparty/*/setup.py` files** with their corresponding submodule dependencies
4. **Regenerates `uv.lock`** after dependency updates
5. **Pre-clones git submodules with full history** so Renovate can checkout new commits (works around `shallow=true` in `.gitmodules`)
6. **Creates a single PR** that automatically triggers the full CI pipeline (`cicd-main.yml`)

## Setup Requirements

You need to set up authentication for Renovate. Choose one of the following options:

### Option 1: Personal Access Token (PAT) - Quick Start

**This is the easiest way to get started:**

1. Create a GitHub Personal Access Token (PAT):
- Go to GitHub Settings → Developer settings → Personal access tokens → Tokens (classic)
- Click "Generate new token (classic)"
- Give it a descriptive name (e.g., "Renovate Bot")
- Select scopes:
- ✅ `repo` (Full control of private repositories)
- ✅ `workflow` (Update GitHub Action workflows - required for github-actions manager)
- Click "Generate token" and copy it

2. Add the token as a repository secret:
- Go to your repository → Settings → Secrets and variables → Actions
- Click "New repository secret"
- Name: `RENOVATE_TOKEN`
- Value: Paste your PAT
- Click "Add secret"

3. You're done! The workflow will use the PAT automatically.

### Option 2: GitHub App (Recommended for Organizations)

**Better for rate limits and security, but requires more setup:**

1. Create a GitHub App:
- Go to Organization Settings → Developer settings → GitHub Apps → New GitHub App
- Or use an existing Renovate GitHub App

2. Configure the app with these permissions:
- Repository permissions:
- Contents: Read & Write
- Pull requests: Read & Write
- Workflows: Read & Write (if using github-actions manager)
- Metadata: Read-only

3. Install the app on your repository

4. Add these secrets to your repository:
- `RENOVATE_APP_ID`: The app ID (found on the app's settings page)
- `RENOVATE_APP_PRIVATE_KEY`: The app's private key (PEM format)

5. The workflow will automatically detect and use the GitHub App token

### 2. Grant Workflow Permissions

Ensure the Renovate workflow has permission to:
- Create and update pull requests
- Read and write to the repository
- Access secrets

This can be configured in: `Settings` → `Actions` → `General` → `Workflow permissions`

## Configuration Files

### `.github/renovate.json`
Main configuration file that defines:
- Update schedule (daily during business hours PST)
- Package grouping rules
- Branch naming conventions
- PR labels (`dependencies`, `CI:L2`)

### `.github/workflows/renovate.yml`
GitHub Actions workflow that:
- Runs daily at 9 AM UTC (1 AM PST / 2 AM PDT)
- Can be manually triggered with `workflow_dispatch`
- Sets up the environment (Python, uv)
- Executes Renovate with proper credentials

### `.github/scripts/sync_submodule_dependencies.py`
Python script that:
- Reads dependencies from `3rdparty/*/pyproject.toml` files in submodules
- Updates `CACHED_DEPENDENCIES` in corresponding `setup.py` files
- Ensures consistency between submodule requirements and wrapper packages

### `.github/scripts/renovate_post_update.sh`
Bash script that runs after Renovate updates dependencies:
1. Syncs submodule dependencies to setup.py files
2. Runs `uv lock` to regenerate the lock file
3. Stages changes for commit

## Manual Workflow Trigger

You can manually trigger Renovate at any time:

1. Go to `Actions` → `Renovate` in GitHub
2. Click `Run workflow`
3. Optional parameters:
- **Log level**: Set to `debug` for verbose output
- **Dry run**: Enable to preview changes without creating PRs

## Update Strategy

Renovate now produces **one consolidated PR at a time**:

| Branch prefix | Contents | Notes |
|---------------|----------|-------|
| `renovate/allowlist-…` | Git submodules, Docker/GitHub Action updates, and the allowlisted Python packages above | Runs on the configured weekday schedule; no other dependencies are touched until explicitly re-enabled. Renovate's built-in vulnerability PRs are disabled so everything funnels through this branch. |

## Debug vs. Production Settings

- `prHourlyLimit` is currently `0` **only while debugging** so Renovate can recreate PRs immediately. Set it back to `1` once we're satisfied with the configuration to avoid noisy PR bursts.
- `prConcurrentLimit` stays at `1` to preserve the "one PR at a time" contract; raise it temporarily if you ever need parallel testing.

## CI Integration

When Renovate creates a PR:
1. The PR is automatically labeled with `CI:L2` to trigger full CI testing
2. `cicd-main.yml` runs the complete test suite
3. All L2 tests must pass before the PR can be merged
4. The lock file and setup.py changes are included in the PR

## Troubleshooting

### Renovate workflow fails
- Check that secrets `RENOVATE_APP_ID` and `RENOVATE_APP_PRIVATE_KEY` are set
- Verify the GitHub App is installed on the repository
- Check workflow logs for specific error messages

### Dependencies not syncing
- Ensure submodules are properly initialized
- Check `.github/scripts/sync_submodule_dependencies.py` logs
- Verify that submodule `pyproject.toml` files exist and are valid

### uv lock fails
- Ensure `uv` version in workflow matches project requirements
- Check for dependency conflicts in the update
- Review the post-update script logs

### PRs not triggering CI
- Verify PR has the `CI:L2` label
- Check `cicd-main.yml` configuration
- Ensure PR is targeting the `main` branch

## Customization

To modify Renovate behavior:
1. Edit `.github/renovate.json` for scheduling, grouping, or update rules
2. Update `.github/workflows/renovate.yml` for workflow settings
3. Modify `.github/scripts/renovate_post_update.sh` for custom post-update logic

## Testing Changes

Before committing Renovate config changes:
1. Use the workflow's dry-run mode to test
2. Check the Renovate logs for validation errors
3. Test the post-update script locally:
```bash
.github/scripts/renovate_post_update.sh
```

## References

- [Renovate Documentation](https://docs.renovatebot.com/)
- [Renovate Configuration Options](https://docs.renovatebot.com/configuration-options/)
- [GitHub Action for Renovate](https://github.com/renovatebot/github-action)

Loading