Skip to content

feat: add config_cli.py and refactor configs + config pre-commit#1024

Merged
terrykong merged 15 commits intomainfrom
tk/config-fold
Sep 25, 2025
Merged

feat: add config_cli.py and refactor configs + config pre-commit#1024
terrykong merged 15 commits intomainfrom
tk/config-fold

Conversation

@terrykong
Copy link
Collaborator

@terrykong terrykong commented Aug 29, 2025

Convergence/Perf check

pr1024-convergence.log

Overview

Given the recent friction of merging can sometimes come down to config conflicts, this PR introduces a tool to merge configs (and pre-commit hook) so you can commit the minimal config necessary for a recipe. There is now a new risk of config changes in the base config propagating to the recipes, especially if the config was not defined and the new default changes the behavior, but this has the positive side-effect of bisecting being more helpful whereas before we were shielded from regressions until the recipe config was updated explicitly.

We need to now keep in mind that if a flag like "enable_eager" was false and no other config overrode it, turning it on would turn it on for all recipes.

Example of how to use the tool:

Utilities for working with YAML configs in this repo.

Subcommands:
  - expand: Resolve a config with OmegaConf interpolation and inheritance.
  - minimize: Given a base config and a config, remove keys in the config that
    are equal to the base, and ensure a defaults entry pointing to the base
    exists. The defaults path in the resulting config is written relative to
    the base config file.
  - minimize-check: Same args as `minimize` but only checks if minimization
    would change the file; exits non-zero if changes are needed.

Both commands support printing to stdout or in-place editing of the config file.

Example:
  # Expand a config with a root level "defaults" key to see the full config; print to stdout
  tools/config_cli.py expand examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.yaml

  # Expand a config with a root level "defaults" key to see the full config; edit the config in place
  tools/config_cli.py expand examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.yaml --in-place

  # Minimize a config and remove all keys that are present in the base config; print to stdout
  # tools/config_cli.py minimize <base_config> <config>
  tools/config_cli.py minimize examples/configs/dpo.yaml examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.yaml

  # Minimize a config and remove all keys that are present in the base config; edit the config in place
  # tools/config_cli.py minimize <base_config> <config>
  tools/config_cli.py minimize examples/configs/dpo.yaml examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.yaml --in-place

  # Minimize all llm the configs:
  for algo in grpo dpo sft; do
    base_config=examples/configs/${algo}.yaml
    if [[ ${algo} == grpo ]]; then
      base_config=examples/configs/grpo_math_1B.yaml
    fi
    for recipe in examples/configs/recipes/llm/${algo}-*.yaml; do
      tools/config_cli.py minimize $base_config $recipe --in-place
    done
  done

  # Minimize vlm configs:
  for recipe in examples/configs/recipes/vlm/vlm_grpo-*.yaml; do
    tools/config_cli.py minimize examples/configs/vlm_grpo_3B.yaml $recipe --in-place
  done

  # Compare two configs
  tools/config_cli.py compare examples/configs/grpo_math_1B.yaml examples/configs/grpo_math_8B.yaml

  # Minimize a config and compare it to not minimzing (should be the same)
  tools/config_cli.py minimize examples/configs/dpo.yaml examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.yaml >examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.yaml.minimized
  tools/config_cli.py compare \
    examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.yaml \
    examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.yaml.minimized

Related to #927

Notable issues that arose from merging configs:

1. all dtensor recipes defaulted to v2 and that uncovered some perf regressions

grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long

this recipe is much worse on v2 than on v1 (cand=v1, base=v2). Due to this, i kept this recipe using v1
image

Issue tracking gemma perf regression #1097

2. SFT recipes defaulted to the default chat template recipe

Summary by CodeRabbit

  • New Features

    • Introduced a configuration CLI with commands to expand, minimize, validate, and compare YAML configs.
    • Added default chat templates to several SFT tokenizer settings.
  • Chores

    • Added pre-commit hooks to enforce minimized configuration files with strict error handling.
  • Refactor

    • Simplified and standardized numerous LLM/VLM recipe configs by delegating to shared defaults, trimming deprecated or redundant options, and restructuring schedulers.
  • Style

    • Normalized YAML formatting (unquoted scalars, boolean/number formats), streamlined logger sections, and clarified dataset fields across configs.

@terrykong terrykong force-pushed the tk/config-fold branch 4 times, most recently from 4534008 to 1c0290c Compare September 6, 2025 06:41
@github-actions github-actions bot added the CI Relating to CI label Sep 6, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 10, 2025

📝 Walkthrough

Walkthrough

Adds a config minimization/validation CLI (tools/config_cli.py) and pre-commit hooks to enforce minimized YAML recipes. Broadly rewrites multiple LLM/VLM recipe YAMLs to rely on shared defaults, remove redundant fields, normalize formatting, and simplify scheduler/optimizer/logging sections.

Changes

Cohort / File(s) Summary
Tooling: Config CLI and CI hooks
./tools/config_cli.py, .pre-commit-config.yaml
New CLI with expand/minimize/compare/minimize-check commands; vendored loader resolves defaults and prunes keys. Pre-commit adds local hooks to run minimize-check over specified base/recipe pairs; bash entrypoints with strict error handling.
LLM DPO recipes normalization
examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.yaml, .../dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.yaml, .../dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml, .../dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.yaml, .../dpo-llama3.1-8b-tulu3-1n8g-fsdp2tp1.yaml, .../dpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v2.yaml, .../dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.yaml
Add shared defaults; remove many training/validation fields; simplify dtensor/optimizer/scheduler/data/logger; unquote scalars; restructure scheduler lists.
LLM GRPO recipes normalization
examples/configs/recipes/llm/grpo-deepscaler-1.5b-{8K,16K,24K}.yaml, .../grpo-gemma3-{1b-it-1n8g-fsdp2tp1,27b-it-8n8g-fsdp2tp8-actckpt-long}.yaml, .../grpo-gspo-deepscaler-1.5b-8K.yaml, .../grpo-llama3.1-8b-instruct-{1n8g-megatron-fp8,2n8g-fsdp2tp1-noncolocated,4n8g-fsdp2tp1-long.v3}.yaml, .../grpo-llama3.2-1b-instruct-{1n8g-fsdp2tp1.v3,1n8g-megatron}.yaml, .../grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml, .../grpo-moonlight-16ba3b-4n8g-megatron.yaml, .../grpo-qwen2.5-{32b-32n8g-fsdp2tp8sp-actckpt{.v3,.long.v3},7b-instruct-4n8g-{fsdp2tp4sp.v3,megatron}}.yaml, .../grpo-qwen3-30ba3b-8n8g-megatron.yaml, .../grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1.v3.yaml
Shift to defaults-driven configs; remove rollout/validation/baselines/async and many optimizer/scheduler/generation/vLLM fields; add/set dtensor _v2 where applicable; normalize numeric/boolean formatting; restructure schedulers; prune logger/cluster/env blocks.
LLM SFT recipes normalization
examples/configs/recipes/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.yaml, .../sft-llama3.1-8b-1n8g-{fsdp2tp1-dynamicbatch,fsdp2tp1-long,fsdp2tp2sp}.yaml, .../sft-llama3.1-8b-1n8g-megatron{,-seqpack}.yaml, .../sft-llama3.2-1b-1n8g-fsdp2tp1.v3.yaml, .../sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.yaml
Add tokenizer chat_template; remove epochs/val flags and dtensor/dynamic/sequence packing details; simplify optimizer (lr/eps) and data/logging; unquote values; prune gpu monitoring and num_nodes.
VLM GRPO recipes normalization
examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1.yaml, .../vlm_grpo-smolvlm2-2.2b-instruct-clevr-1n2g-dtensor2tp1.v1.yaml
Point to shared VLM defaults; collapse policy to essentials; simplify checkpointing; trim training/generation/data/env/logger/cluster blocks with minimal retained fields.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev
  participant CLI as config_cli.py
  participant FS as Filesystem
  participant YAML as YAML Loader

  Dev->>CLI: minimize --base BASE.yaml --child CHILD.yaml [-i]
  CLI->>YAML: load(BASE.yaml)
  CLI->>YAML: load(CHILD.yaml)
  CLI->>CLI: resolve defaults (expand)
  CLI->>CLI: compute minimized(CHILD - BASE)
  alt in-place (-i)
    CLI->>FS: write CHILD.yaml (defaults->relative, ordered)
  else stdout
    CLI-->>Dev: print minimized YAML
  end
Loading
sequenceDiagram
  autonumber
  actor Git as pre-commit
  participant Hook as minimize-check-*(bash)
  participant CLI as config_cli.py
  participant FS as Repo

  Git->>Hook: on commit
  loop for each (BASE, RECIPE)
    Hook->>CLI: minimize-check --base BASE --child RECIPE
    CLI->>FS: read files
    CLI->>CLI: compute minimized form
    alt differs
      CLI-->>Hook: exit 1 (non-zero)
      Hook-->>Git: fail hook
    else identical
      CLI-->>Hook: exit 0
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

CI:L1

Suggested reviewers

  • yuki-97

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Test Results For Major Changes ❓ Inconclusive This PR is a major change (new CLI plus large-scale config refactors) that can affect numerics and performance. The description notes testing on a specific commit and documents DTensor v2 performance regressions for certain recipes, with mitigation (staying on v1) and mentions that screenshots are included. However, the PR description as provided here does not contain explicit before/after metrics, convergence evidence, or the exact context/config/hardware used, so I cannot verify that performance and numerics were adequately validated across impacted recipes. Given the absence of concrete numbers and artifacts in the text available, the check outcome is inconclusive. Please augment the PR description with a concise Testing and Performance section: include before/after numbers (e.g., tokens/s, step time, reward/accuracy) for the affected recipes, the exact configs and commands used, hardware/software context (GPU model, nodes, CUDA/driver), and brief convergence evidence (loss/reward curves or summary stats) showing no regressions post-refactor. For the new CLI, add minimal test artifacts (e.g., sample inputs/outputs or unit test summary) demonstrating expand/minimize/minimize-check/compare behavior. Once these details are added, this check can be reassessed and should pass.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly states the primary feature addition (config_cli.py) and the accompanying updates to configuration files and pre-commit hooks, succinctly capturing the main purpose of the PR in a single sentence without unnecessary detail. It uses a conventional commit style prefix and avoids file lists or vague terms. Scanning the title gives a reviewer immediate insight into the key changes.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tk/config-fold

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@terrykong terrykong force-pushed the tk/config-fold branch 2 times, most recently from 260f022 to 1cd4fd9 Compare September 10, 2025 06:30
@github-actions github-actions bot removed the CI Relating to CI label Sep 10, 2025
@terrykong terrykong changed the title feat: add config_tools.py and refactor configs feat: add config_cli.py and refactor configs Sep 16, 2025
@terrykong
Copy link
Collaborator Author

tested on b171629

@terrykong terrykong force-pushed the tk/config-fold branch 2 times, most recently from 04cbee1 to 968afbd Compare September 24, 2025 16:39
Signed-off-by: Terry Kong <terryk@nvidia.com>

compare command

Signed-off-by: Terry Kong <terryk@nvidia.com>

config changes

Signed-off-by: Terry Kong <terryk@nvidia.com>

Revert "config changes"

This reverts commit 25b87e2.

Signed-off-by: Terry Kong <terryk@nvidia.com>

cleanup

Signed-off-by: Terry Kong <terryk@nvidia.com>

vlm example

Signed-off-by: Terry Kong <terryk@nvidia.com>

minimize configs

Signed-off-by: Terry Kong <terryk@nvidia.com>

Revert "minimize configs"

This reverts commit 1375480.

Signed-off-by: Terry Kong <terryk@nvidia.com>

minimize configs

Signed-off-by: Terry Kong <terryk@nvidia.com>

Revert "minimize configs"

This reverts commit a4cd8a4.

Signed-off-by: Terry Kong <terryk@nvidia.com>

minimize configs

Signed-off-by: Terry Kong <terryk@nvidia.com>

force sft configs to use default chat template to match last releases
behavior

Signed-off-by: Terry Kong <terryk@nvidia.com>

reverting select configs to v1 to address

Signed-off-by: Terry Kong <terryk@nvidia.com>

add pre-commit and add a minimize-check func

Signed-off-by: Terry Kong <terryk@nvidia.com>

Revert "reverting select configs to v1 to address"

This reverts commit d81f806.

Signed-off-by: Terry Kong <terryk@nvidia.com>

Revert "force sft configs to use default chat template to match last releases"

This reverts commit be01df7.

Signed-off-by: Terry Kong <terryk@nvidia.com>

Revert "minimize configs"

This reverts commit e54f144.

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested a review from yfw September 24, 2025 18:11
@terrykong terrykong assigned parthchadha and unassigned parthchadha Sep 24, 2025
@terrykong terrykong marked this pull request as ready for review September 24, 2025 18:11
@terrykong terrykong requested review from a team as code owners September 24, 2025 18:11
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
@parthchadha
Copy link
Contributor

@terrykong thanks for this change, do you want to add a few unit tests for the new tool?

@terrykong
Copy link
Collaborator Author

Some of our configs in example/configs also depend on each other. E.g. grpo_math_8B.yaml uses the defaults from grpo_math_1B.yaml. Do we also want to change these so we only keep the minimal config diffs?

Yea, that's a good point. I created an issue for those: #1201. I only planned on minimizing the recipes since those have tests

Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
yfw
yfw previously approved these changes Sep 24, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong
Copy link
Collaborator Author

@parthchadha added tests in 044385c

@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 25, 2025
@terrykong terrykong enabled auto-merge (squash) September 25, 2025 06:52
@terrykong terrykong merged commit 32faafa into main Sep 25, 2025
42 checks passed
@terrykong terrykong deleted the tk/config-fold branch September 25, 2025 16:33
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
…DIA-NeMo#1024)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
This was referenced Dec 19, 2025
@coderabbitai coderabbitai bot mentioned this pull request Feb 20, 2026
4 tasks
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
…DIA-NeMo#1024)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants