Skip to content

scripts: ship deterministic comment / docstring-only diff verifier#641

Merged
danielhanchen merged 1 commit into
mainfrom
chore/ship-comment-only-verifier
May 14, 2026
Merged

scripts: ship deterministic comment / docstring-only diff verifier#641
danielhanchen merged 1 commit into
mainfrom
chore/ship-comment-only-verifier

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Summary

Adds scripts/verify_comment_only_diff.py, a small standalone tool for proving a "comment trim" or "docstring refactor" PR has zero real code changes. This is what I used to gate the trim PRs #640 (zoo) and #5418 / #5421 (unsloth) and I want it in the tree so any contributor can run the same check.

What it does

Given a list of changed files and two git refs, the script reports whether each file's diff is strictly comments or docstrings:

  • .py files: parse both revs into AST, strip module / class / function docstrings, then compare ast.unparse output. Pure Python comments are discarded by ast.parse by construction, so any post-strip diff is real code.
  • .yml / .yaml files: yaml.safe_load both sides and compare the parsed Python object. If scalar values differ, also strip shell comments inside any multi-line scalar (i.e. run: | script bodies in workflows) before comparing.

Exit code 0 if every file is comment-only, 1 otherwise. Failing files print a tight diff snippet so a reviewer can spot the real change at a glance.

Usage

python scripts/verify_comment_only_diff.py [--base REF] [--head REF] path ...

Defaults: --base origin/main, --head HEAD. Paths are repo-relative.

Typical invocation against the current branch:

git diff --name-only origin/main..HEAD \
  | xargs python scripts/verify_comment_only_diff.py --base origin/main

Smoke test

Against the squash-merged PR #640 (a real 18-file pure-trim PR on this repo):

git diff --name-only 02875d0~1..02875d0 \
  | xargs python scripts/verify_comment_only_diff.py --base 02875d0~1 --head 02875d0

reports OK for all 18 files (5 YAML workflows + 12 .py tests + unsloth_zoo/init.py + unsloth_zoo/patching_utils.py).

Test plan

scripts/verify_comment_only_diff.py compares a list of changed files
between two git refs and reports whether each diff is strictly comments
or docstrings.

  * .py: parse both revs into AST, strip module / class / function
    docstrings, then compare ast.unparse output. Pure Python comments
    are discarded by ast.parse by construction, so any post-strip diff
    is real code.
  * .yml / .yaml: yaml.safe_load both sides and compare the parsed
    Python object; if scalar values differ, also strip shell comments
    inside any multi-line scalar (i.e. `run: |` script bodies) before
    comparing.

Exit code is 0 if every file is comment-only, 1 otherwise. The script
also prints a tight diff snippet for any FAIL line so a reviewer can
spot the real code change at a glance.

This is what I used to gate the trim PRs #637 / #640 (zoo) and #5416 /
#5418 / #5421 (unsloth). Shipping it under scripts/ so any contributor
can deterministically prove a comment / docstring refactor is truly
comment-only, without manually eyeballing every line of a 4000-line
diff.

Usage:

    python scripts/verify_comment_only_diff.py [--base REF] [--head REF] path ...

Defaults: --base origin/main, --head HEAD. Paths are repo-relative.

Smoke test against the squash-merged PR #640 (an 18-file pure trim):

    git diff --name-only 02875d0~1..02875d0 \
      | xargs python scripts/verify_comment_only_diff.py --base 02875d0~1 --head 02875d0

reports OK for all 18 files.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d935cdeb60

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +228 to +230
except subprocess.CalledProcessError as exc:
print(f"SKIP {path}: {exc}")
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail on paths missing from either ref

When a path exists only on one side of the comparison, such as an added, deleted, or renamed .py/.yml file from git diff --name-only, _git_show raises and this branch prints SKIP but leaves rc at 0. That lets a PR adding a real code file pass the verifier as comment-only if all missing-side paths are skipped, so missing revisions should be treated as failures unless the caller explicitly requested skipping them.

Useful? React with 👍 / 👎.

Comment on lines +123 to +124
if isinstance(obj, str) and "\n" in obj:
return _strip_shell_comments(obj)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not strip data from every multiline YAML scalar

This normalizes every multiline string, not just GitHub Actions run: bodies, so YAML data can be silently changed and still report OK. For example, an action input or env value written as message: | with lines beginning with # has those lines as scalar content, not YAML or shell comments; changing or removing them will be hidden by _strip_shell_comments. If only shell scripts are intended, the walker needs to track mapping keys and apply this only to run values.

Useful? React with 👍 / 👎.

@danielhanchen danielhanchen merged commit f37d510 into main May 14, 2026
10 of 13 checks passed
@danielhanchen danielhanchen deleted the chore/ship-comment-only-verifier branch May 14, 2026 12:02
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces scripts/verify_comment_only_diff.py, a utility designed to verify that changes in a PR are strictly limited to comments or docstrings for Python and YAML files. The script uses AST parsing and structural YAML comparison to detect code drift. Review feedback suggested that the script should fail when files are added or deleted rather than skipping them, as these constitute structural changes. Additionally, it was recommended to add a guarded import for the PyYAML dependency to provide a more helpful error message if the library is not installed.

Comment on lines +228 to +230
except subprocess.CalledProcessError as exc:
print(f"SKIP {path}: {exc}")
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script currently skips files that cannot be retrieved from one of the git refs (e.g., when a file is added or deleted) and continues with a success status. For a tool intended to prove that a PR has zero real code changes, adding or deleting a file is a structural change that should be reported as a failure. Skipping these files could lead to false positives where a PR with new files is incorrectly verified as "comment-only".

Suggested change
except subprocess.CalledProcessError as exc:
print(f"SKIP {path}: {exc}")
continue
except subprocess.CalledProcessError as exc:
print(f"FAIL {path}: could not fetch file from both refs (added/deleted?) -- {exc}")
rc = 1
continue

import sys
from typing import Any

import yaml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PyYAML library is not listed as a required dependency in pyproject.toml. For consistency with other scripts in this repository (e.g., scripts/lint_workflow_triggers.py), consider adding a guarded import with a helpful error message to guide users who might be missing the library.

Suggested change
import yaml
try:
import yaml
except ImportError:
print(
"ERROR: PyYAML is required. Install with 'pip install pyyaml'", file = sys.stderr
)
sys.exit(2)
References
  1. Avoid using try...except ImportError for libraries that are required dependencies of the project, as the check is redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant