Skip to content

scripts: ship deterministic comment / docstring-only diff verifier#5422

Merged
danielhanchen merged 1 commit into
mainfrom
chore/ship-comment-only-verifier
May 14, 2026
Merged

scripts: ship deterministic comment / docstring-only diff verifier#5422
danielhanchen merged 1 commit into
mainfrom
chore/ship-comment-only-verifier

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Summary

Adds scripts/verify_comment_only_diff.py, a small standalone tool for proving a "comment trim" or "docstring refactor" PR has zero real code changes. This is what I used to gate the trim PRs #5418 / #5421 (this repo) and #640 (unsloth-zoo) and I want it in the tree so any contributor can run the same check.

What it does

Given a list of changed files and two git refs, the script reports whether each file's diff is strictly comments or docstrings:

  • .py files: parse both revs into AST, strip module / class / function docstrings, then compare ast.unparse output. Pure Python comments are discarded by ast.parse by construction, so any post-strip diff is real code.
  • .yml / .yaml files: yaml.safe_load both sides and compare the parsed Python object. If scalar values differ, also strip shell comments inside any multi-line scalar (i.e. run: | script bodies in workflows) before comparing.

Exit code 0 if every file is comment-only, 1 otherwise. Failing files print a tight diff snippet so a reviewer can spot the real change at a glance.

Usage

python scripts/verify_comment_only_diff.py [--base REF] [--head REF] path ...

Defaults: --base origin/main, --head HEAD. Paths are repo-relative.

Typical invocation against the current branch:

git diff --name-only origin/main..HEAD \
  | xargs python scripts/verify_comment_only_diff.py --base origin/main

Smoke test

Against the squash-merged PR #5418 (a real 3-file pure-trim PR on this repo):

git diff --name-only 6994d07f~1..6994d07f \
  | xargs python scripts/verify_comment_only_diff.py --base 6994d07f~1 --head 6994d07f

reports OK for all 3 files (tests/conftest.py, unsloth/_gpu_init.py, unsloth/import_fixes.py).

Test plan

scripts/verify_comment_only_diff.py compares a list of changed files
between two git refs and reports whether each diff is strictly comments
or docstrings.

  * .py: parse both revs into AST, strip module / class / function
    docstrings, then compare ast.unparse output. Pure Python comments
    are discarded by ast.parse by construction, so any post-strip diff
    is real code.
  * .yml / .yaml: yaml.safe_load both sides and compare the parsed
    Python object; if scalar values differ, also strip shell comments
    inside any multi-line scalar (i.e. `run: |` script bodies) before
    comparing.

Exit code is 0 if every file is comment-only, 1 otherwise. The script
also prints a tight diff snippet for any FAIL line so a reviewer can
spot the real code change at a glance.

This is what I used to gate the trim PRs #5418 (this repo) and #640
(unsloth-zoo). Shipping it under scripts/ so any contributor can
deterministically prove a comment / docstring refactor is truly
comment-only, without manually eyeballing every line of a 4000-line
diff.

Usage:

    python scripts/verify_comment_only_diff.py [--base REF] [--head REF] path ...

Defaults: --base origin/main, --head HEAD. Paths are repo-relative.

Smoke test against the squash-merged PR #5418 (a real 3-file pure trim):

    git diff --name-only 6994d07~1..6994d07 \
      | xargs python scripts/verify_comment_only_diff.py --base 6994d07~1 --head 6994d07

reports OK for all 3 files.
@danielhanchen danielhanchen merged commit 0c8eb10 into main May 14, 2026
5 of 6 checks passed
@danielhanchen danielhanchen deleted the chore/ship-comment-only-verifier branch May 14, 2026 12:02
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0461a2c7e7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +229 to +230
print(f"SKIP {path}: {exc}")
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat missing ref paths as failures

When the path exists in only one ref, git diff --name-only still passes it here, but a failed git show is reported as SKIP and leaves rc unchanged. In a comment-only gate, adding a new .py/.yaml file with real code or deleting one will therefore exit 0 as long as the remaining files are OK, incorrectly certifying a non-comment change as safe; these missing-side cases should fail or be handled explicitly as additions/deletions.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a script to verify that changes between git revisions are strictly limited to comments or docstrings. The review feedback highlights several necessary improvements: correctly handling new or deleted files by returning empty strings in _git_show rather than skipping them, making the shell comment stripping heuristic more robust and efficient, and ensuring that structural differences in YAML lists are fully reported when lengths differ. The reviewer also provided specific suggestions for better exception logging and code efficiency.

Comment on lines +224 to +232
for path in args.paths:
try:
before = _git_show(args.base, path)
after = _git_show(args.head, path)
except subprocess.CalledProcessError as exc:
print(f"SKIP {path}: {exc}")
continue

if path.endswith(".py"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current loop skips files that do not exist in one of the revisions (e.g., new files). This is a critical flaw: a PR adding a new file with functional code would be reported as OK (or skipped with exit code 0), bypassing the verifier's purpose. By updating _git_show to return an empty string for missing files, you can remove this try...except block and properly verify that new or deleted files are indeed comment-only.

Suggested change
for path in args.paths:
try:
before = _git_show(args.base, path)
after = _git_show(args.head, path)
except subprocess.CalledProcessError as exc:
print(f"SKIP {path}: {exc}")
continue
if path.endswith(".py"):
for path in args.paths:
before = _git_show(args.base, path)
after = _git_show(args.head, path)
if path.endswith(".py"):

Comment on lines +50 to +53
def _git_show(rev: str, path: str) -> str:
return subprocess.check_output(
["git", "show", f"{rev}:{path}"], text = True, stderr = subprocess.DEVNULL,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To correctly handle newly added or deleted files, _git_show should return an empty string instead of raising an exception when a file does not exist in a specific revision. This allows the verifier to compare the empty state against the new/old content. In accordance with repository rules, the exception is caught specifically and logged at a debug level to aid in future debugging.

def _git_show(rev: str, path: str) -> str:
    try:
        return subprocess.check_output(
            ["git", "show", f"{rev}:{path}"], text = True, stderr = subprocess.DEVNULL,
        )
    except subprocess.CalledProcessError as e:
        import logging
        logging.getLogger(__name__).debug(f"File {path} not found in revision {rev}: {e}")
        return ""
References
  1. When handling exceptions, avoid broad except Exception: pass clauses. Instead, catch specific exceptions and log them (at least at a debug level) to aid in troubleshooting. If a failure is expected, log the specific exception type and its details.

Comment on lines +86 to +112
def _strip_shell_comments(s: str) -> str:
"""Strip pure-comment lines and inline trailing comments from a shell
snippet, then collapse runs of blank lines. Heuristic only: leaves a
line untouched if it has an odd quote count (open string)."""
out = []
for line in s.splitlines():
stripped = line.lstrip()
if stripped.startswith("#"):
continue
has_single = line.count("'") % 2 == 0
has_double = line.count('"') % 2 == 0
if has_single and has_double:
idx = line.find(" #")
if idx >= 0:
line = line[:idx].rstrip()
out.append(line)
norm = []
prev_blank = False
for line in out:
if line.strip() == "":
if prev_blank:
continue
prev_blank = True
else:
prev_blank = False
norm.append(line)
return "\n".join(norm).strip()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The heuristic for stripping inline shell comments is fragile and can lead to false negatives. It is safer to only strip whole-line comments. To improve efficiency and adhere to repository rules, the comment stripping and blank line collapsing have been combined into a single loop to avoid redundant data iterations.

Suggested change
def _strip_shell_comments(s: str) -> str:
"""Strip pure-comment lines and inline trailing comments from a shell
snippet, then collapse runs of blank lines. Heuristic only: leaves a
line untouched if it has an odd quote count (open string)."""
out = []
for line in s.splitlines():
stripped = line.lstrip()
if stripped.startswith("#"):
continue
has_single = line.count("'") % 2 == 0
has_double = line.count('"') % 2 == 0
if has_single and has_double:
idx = line.find(" #")
if idx >= 0:
line = line[:idx].rstrip()
out.append(line)
norm = []
prev_blank = False
for line in out:
if line.strip() == "":
if prev_blank:
continue
prev_blank = True
else:
prev_blank = False
norm.append(line)
return "\n".join(norm).strip()
def _strip_shell_comments(s: str) -> str:
"""Strip pure-comment lines from a shell snippet, then collapse runs of
blank lines."""
norm = []
prev_blank = False
for line in s.splitlines():
if line.lstrip().startswith("#"):
continue
if line.strip() == "":
if prev_blank:
continue
prev_blank = True
else:
prev_blank = False
norm.append(line)
return "\n".join(norm).strip()
References
  1. To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

Comment on lines +146 to +152
if len(b) != len(a):
print(
f" list len at {prefix or '/'}: "
f"{len(b)} -> {len(a)}",
)
for i, (bi, ai) in enumerate(zip(b, a)):
_walk_yaml_diff(bi, ai, f"{prefix}[{i}]")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When comparing lists of different lengths, zip(b, a) only iterates over the elements present in both. This means additions or removals at the end of a list might not be reported with specific indices, only as a general length difference. Iterating over the maximum length provides a more detailed report of structural changes.

Suggested change
if len(b) != len(a):
print(
f" list len at {prefix or '/'}: "
f"{len(b)} -> {len(a)}",
)
for i, (bi, ai) in enumerate(zip(b, a)):
_walk_yaml_diff(bi, ai, f"{prefix}[{i}]")
elif isinstance(b, list):
if len(b) != len(a):
print(f" list len at {prefix or '/'}: {len(b)} -> {len(a)}")
for i in range(max(len(b), len(a))):
if i >= len(b):
print(f" added element at {prefix}[{i}]")
elif i >= len(a):
print(f" removed element at {prefix}[{i}]")
else:
_walk_yaml_diff(b[i], a[i], f"{prefix}[{i}]")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant