scripts: ship deterministic comment / docstring-only diff verifier#641
Conversation
scripts/verify_comment_only_diff.py compares a list of changed files
between two git refs and reports whether each diff is strictly comments
or docstrings.
* .py: parse both revs into AST, strip module / class / function
docstrings, then compare ast.unparse output. Pure Python comments
are discarded by ast.parse by construction, so any post-strip diff
is real code.
* .yml / .yaml: yaml.safe_load both sides and compare the parsed
Python object; if scalar values differ, also strip shell comments
inside any multi-line scalar (i.e. `run: |` script bodies) before
comparing.
Exit code is 0 if every file is comment-only, 1 otherwise. The script
also prints a tight diff snippet for any FAIL line so a reviewer can
spot the real code change at a glance.
This is what I used to gate the trim PRs #637 / #640 (zoo) and #5416 /
#5418 / #5421 (unsloth). Shipping it under scripts/ so any contributor
can deterministically prove a comment / docstring refactor is truly
comment-only, without manually eyeballing every line of a 4000-line
diff.
Usage:
python scripts/verify_comment_only_diff.py [--base REF] [--head REF] path ...
Defaults: --base origin/main, --head HEAD. Paths are repo-relative.
Smoke test against the squash-merged PR #640 (an 18-file pure trim):
git diff --name-only 02875d0~1..02875d0 \
| xargs python scripts/verify_comment_only_diff.py --base 02875d0~1 --head 02875d0
reports OK for all 18 files.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d935cdeb60
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except subprocess.CalledProcessError as exc: | ||
| print(f"SKIP {path}: {exc}") | ||
| continue |
There was a problem hiding this comment.
Fail on paths missing from either ref
When a path exists only on one side of the comparison, such as an added, deleted, or renamed .py/.yml file from git diff --name-only, _git_show raises and this branch prints SKIP but leaves rc at 0. That lets a PR adding a real code file pass the verifier as comment-only if all missing-side paths are skipped, so missing revisions should be treated as failures unless the caller explicitly requested skipping them.
Useful? React with 👍 / 👎.
| if isinstance(obj, str) and "\n" in obj: | ||
| return _strip_shell_comments(obj) |
There was a problem hiding this comment.
Do not strip data from every multiline YAML scalar
This normalizes every multiline string, not just GitHub Actions run: bodies, so YAML data can be silently changed and still report OK. For example, an action input or env value written as message: | with lines beginning with # has those lines as scalar content, not YAML or shell comments; changing or removing them will be hidden by _strip_shell_comments. If only shell scripts are intended, the walker needs to track mapping keys and apply this only to run values.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces scripts/verify_comment_only_diff.py, a utility designed to verify that changes in a PR are strictly limited to comments or docstrings for Python and YAML files. The script uses AST parsing and structural YAML comparison to detect code drift. Review feedback suggested that the script should fail when files are added or deleted rather than skipping them, as these constitute structural changes. Additionally, it was recommended to add a guarded import for the PyYAML dependency to provide a more helpful error message if the library is not installed.
| except subprocess.CalledProcessError as exc: | ||
| print(f"SKIP {path}: {exc}") | ||
| continue |
There was a problem hiding this comment.
The script currently skips files that cannot be retrieved from one of the git refs (e.g., when a file is added or deleted) and continues with a success status. For a tool intended to prove that a PR has zero real code changes, adding or deleting a file is a structural change that should be reported as a failure. Skipping these files could lead to false positives where a PR with new files is incorrectly verified as "comment-only".
| except subprocess.CalledProcessError as exc: | |
| print(f"SKIP {path}: {exc}") | |
| continue | |
| except subprocess.CalledProcessError as exc: | |
| print(f"FAIL {path}: could not fetch file from both refs (added/deleted?) -- {exc}") | |
| rc = 1 | |
| continue |
| import sys | ||
| from typing import Any | ||
|
|
||
| import yaml |
There was a problem hiding this comment.
The PyYAML library is not listed as a required dependency in pyproject.toml. For consistency with other scripts in this repository (e.g., scripts/lint_workflow_triggers.py), consider adding a guarded import with a helpful error message to guide users who might be missing the library.
| import yaml | |
| try: | |
| import yaml | |
| except ImportError: | |
| print( | |
| "ERROR: PyYAML is required. Install with 'pip install pyyaml'", file = sys.stderr | |
| ) | |
| sys.exit(2) |
References
- Avoid using try...except ImportError for libraries that are required dependencies of the project, as the check is redundant.
Summary
Adds
scripts/verify_comment_only_diff.py, a small standalone tool for proving a "comment trim" or "docstring refactor" PR has zero real code changes. This is what I used to gate the trim PRs #640 (zoo) and #5418 / #5421 (unsloth) and I want it in the tree so any contributor can run the same check.What it does
Given a list of changed files and two git refs, the script reports whether each file's diff is strictly comments or docstrings:
.pyfiles: parse both revs into AST, strip module / class / function docstrings, then compareast.unparseoutput. Pure Python comments are discarded byast.parseby construction, so any post-strip diff is real code..yml/.yamlfiles:yaml.safe_loadboth sides and compare the parsed Python object. If scalar values differ, also strip shell comments inside any multi-line scalar (i.e.run: |script bodies in workflows) before comparing.Exit code 0 if every file is comment-only, 1 otherwise. Failing files print a tight diff snippet so a reviewer can spot the real change at a glance.
Usage
Defaults:
--base origin/main,--head HEAD. Paths are repo-relative.Typical invocation against the current branch:
Smoke test
Against the squash-merged PR #640 (a real 18-file pure-trim PR on this repo):
reports OK for all 18 files (5 YAML workflows + 12 .py tests + unsloth_zoo/init.py + unsloth_zoo/patching_utils.py).
Test plan
python scripts/verify_comment_only_diff.py --helpparses.