scripts: ship deterministic comment / docstring-only diff verifier by danielhanchen · Pull Request #5422 · unslothai/unsloth

danielhanchen · 2026-05-14T12:02:25Z

Summary

Adds scripts/verify_comment_only_diff.py, a small standalone tool for proving a "comment trim" or "docstring refactor" PR has zero real code changes. This is what I used to gate the trim PRs #5418 / #5421 (this repo) and #640 (unsloth-zoo) and I want it in the tree so any contributor can run the same check.

What it does

Given a list of changed files and two git refs, the script reports whether each file's diff is strictly comments or docstrings:

.py files: parse both revs into AST, strip module / class / function docstrings, then compare ast.unparse output. Pure Python comments are discarded by ast.parse by construction, so any post-strip diff is real code.
.yml / .yaml files: yaml.safe_load both sides and compare the parsed Python object. If scalar values differ, also strip shell comments inside any multi-line scalar (i.e. run: | script bodies in workflows) before comparing.

Exit code 0 if every file is comment-only, 1 otherwise. Failing files print a tight diff snippet so a reviewer can spot the real change at a glance.

Usage

python scripts/verify_comment_only_diff.py [--base REF] [--head REF] path ...

Defaults: --base origin/main, --head HEAD. Paths are repo-relative.

Typical invocation against the current branch:

git diff --name-only origin/main..HEAD \
  | xargs python scripts/verify_comment_only_diff.py --base origin/main

Smoke test

Against the squash-merged PR #5418 (a real 3-file pure-trim PR on this repo):

git diff --name-only 6994d07f~1..6994d07f \
  | xargs python scripts/verify_comment_only_diff.py --base 6994d07f~1 --head 6994d07f

reports OK for all 3 files (tests/conftest.py, unsloth/_gpu_init.py, unsloth/import_fixes.py).

Test plan

python scripts/verify_comment_only_diff.py --help parses.
Reports OK on the 3 files of PR chore: trim verbose comments added in PR #5416 (commit 12295c1f) #5418's squash-merge.
CI passes.

scripts/verify_comment_only_diff.py compares a list of changed files between two git refs and reports whether each diff is strictly comments or docstrings. * .py: parse both revs into AST, strip module / class / function docstrings, then compare ast.unparse output. Pure Python comments are discarded by ast.parse by construction, so any post-strip diff is real code. * .yml / .yaml: yaml.safe_load both sides and compare the parsed Python object; if scalar values differ, also strip shell comments inside any multi-line scalar (i.e. `run: |` script bodies) before comparing. Exit code is 0 if every file is comment-only, 1 otherwise. The script also prints a tight diff snippet for any FAIL line so a reviewer can spot the real code change at a glance. This is what I used to gate the trim PRs #5418 (this repo) and #640 (unsloth-zoo). Shipping it under scripts/ so any contributor can deterministically prove a comment / docstring refactor is truly comment-only, without manually eyeballing every line of a 4000-line diff. Usage: python scripts/verify_comment_only_diff.py [--base REF] [--head REF] path ... Defaults: --base origin/main, --head HEAD. Paths are repo-relative. Smoke test against the squash-merged PR #5418 (a real 3-file pure trim): git diff --name-only 6994d07~1..6994d07 \ | xargs python scripts/verify_comment_only_diff.py --base 6994d07~1 --head 6994d07 reports OK for all 3 files.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0461a2c7e7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T12:03:45Z

+            print(f"SKIP {path}: {exc}")
+            continue


Treat missing ref paths as failures

When the path exists in only one ref, git diff --name-only still passes it here, but a failed git show is reported as SKIP and leaves rc unchanged. In a comment-only gate, adding a new .py/.yaml file with real code or deleting one will therefore exit 0 as long as the remaining files are OK, incorrectly certifying a non-comment change as safe; these missing-side cases should fail or be handled explicitly as additions/deletions.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces a script to verify that changes between git revisions are strictly limited to comments or docstrings. The review feedback highlights several necessary improvements: correctly handling new or deleted files by returning empty strings in _git_show rather than skipping them, making the shell comment stripping heuristic more robust and efficient, and ensuring that structural differences in YAML lists are fully reported when lengths differ. The reviewer also provided specific suggestions for better exception logging and code efficiency.

gemini-code-assist · 2026-05-14T12:09:26Z

+    for path in args.paths:
+        try:
+            before = _git_show(args.base, path)
+            after = _git_show(args.head, path)
+        except subprocess.CalledProcessError as exc:
+            print(f"SKIP {path}: {exc}")
+            continue
+
+        if path.endswith(".py"):


The current loop skips files that do not exist in one of the revisions (e.g., new files). This is a critical flaw: a PR adding a new file with functional code would be reported as OK (or skipped with exit code 0), bypassing the verifier's purpose. By updating _git_show to return an empty string for missing files, you can remove this try...except block and properly verify that new or deleted files are indeed comment-only.

Suggested change

for path in args.paths:

try:

before = _git_show(args.base, path)

after = _git_show(args.head, path)

except subprocess.CalledProcessError as exc:

print(f"SKIP {path}: {exc}")

continue

if path.endswith(".py"):

for path in args.paths:

before = _git_show(args.base, path)

after = _git_show(args.head, path)

if path.endswith(".py"):

gemini-code-assist · 2026-05-14T12:09:26Z

+def _git_show(rev: str, path: str) -> str:
+    return subprocess.check_output(
+        ["git", "show", f"{rev}:{path}"], text = True, stderr = subprocess.DEVNULL,
+    )


To correctly handle newly added or deleted files, _git_show should return an empty string instead of raising an exception when a file does not exist in a specific revision. This allows the verifier to compare the empty state against the new/old content. In accordance with repository rules, the exception is caught specifically and logged at a debug level to aid in future debugging.

def _git_show(rev: str, path: str) -> str: try: return subprocess.check_output( ["git", "show", f"{rev}:{path}"], text = True, stderr = subprocess.DEVNULL, ) except subprocess.CalledProcessError as e: import logging logging.getLogger(__name__).debug(f"File {path} not found in revision {rev}: {e}") return ""

References

When handling exceptions, avoid broad except Exception: pass clauses. Instead, catch specific exceptions and log them (at least at a debug level) to aid in troubleshooting. If a failure is expected, log the specific exception type and its details.

gemini-code-assist · 2026-05-14T12:09:26Z

+def _strip_shell_comments(s: str) -> str:
+    """Strip pure-comment lines and inline trailing comments from a shell
+    snippet, then collapse runs of blank lines. Heuristic only: leaves a
+    line untouched if it has an odd quote count (open string)."""
+    out = []
+    for line in s.splitlines():
+        stripped = line.lstrip()
+        if stripped.startswith("#"):
+            continue
+        has_single = line.count("'") % 2 == 0
+        has_double = line.count('"') % 2 == 0
+        if has_single and has_double:
+            idx = line.find(" #")
+            if idx >= 0:
+                line = line[:idx].rstrip()
+        out.append(line)
+    norm = []
+    prev_blank = False
+    for line in out:
+        if line.strip() == "":
+            if prev_blank:
+                continue
+            prev_blank = True
+        else:
+            prev_blank = False
+        norm.append(line)
+    return "\n".join(norm).strip()


The heuristic for stripping inline shell comments is fragile and can lead to false negatives. It is safer to only strip whole-line comments. To improve efficiency and adhere to repository rules, the comment stripping and blank line collapsing have been combined into a single loop to avoid redundant data iterations.

Suggested change

def _strip_shell_comments(s: str) -> str:

"""Strip pure-comment lines and inline trailing comments from a shell

snippet, then collapse runs of blank lines. Heuristic only: leaves a

line untouched if it has an odd quote count (open string)."""

out = []

for line in s.splitlines():

stripped = line.lstrip()

if stripped.startswith("#"):

continue

has_single = line.count("'") % 2 == 0

has_double = line.count('"') % 2 == 0

if has_single and has_double:

idx = line.find(" #")

if idx >= 0:

line = line[:idx].rstrip()

out.append(line)

norm = []

prev_blank = False

for line in out:

if line.strip() == "":

if prev_blank:

continue

prev_blank = True

else:

prev_blank = False

norm.append(line)

return "\n".join(norm).strip()

def _strip_shell_comments(s: str) -> str:

"""Strip pure-comment lines from a shell snippet, then collapse runs of

blank lines."""

norm = []

prev_blank = False

for line in s.splitlines():

if line.lstrip().startswith("#"):

continue

if line.strip() == "":

if prev_blank:

continue

prev_blank = True

else:

prev_blank = False

norm.append(line)

return "\n".join(norm).strip()

References

To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

gemini-code-assist · 2026-05-14T12:09:26Z

+        if len(b) != len(a):
+            print(
+                f"     list len at {prefix or '/'}: "
+                f"{len(b)} -> {len(a)}",
+            )
+        for i, (bi, ai) in enumerate(zip(b, a)):
+            _walk_yaml_diff(bi, ai, f"{prefix}[{i}]")


When comparing lists of different lengths, zip(b, a) only iterates over the elements present in both. This means additions or removals at the end of a list might not be reported with specific indices, only as a general length difference. Iterating over the maximum length provides a more detailed report of structural changes.

Suggested change

if len(b) != len(a):

print(

f" list len at {prefix or '/'}: "

f"{len(b)} -> {len(a)}",

)

for i, (bi, ai) in enumerate(zip(b, a)):

_walk_yaml_diff(bi, ai, f"{prefix}[{i}]")

elif isinstance(b, list):

if len(b) != len(a):

print(f" list len at {prefix or '/'}: {len(b)} -> {len(a)}")

for i in range(max(len(b), len(a))):

if i >= len(b):

print(f" added element at {prefix}[{i}]")

elif i >= len(a):

print(f" removed element at {prefix}[{i}]")

else:

_walk_yaml_diff(b[i], a[i], f"{prefix}[{i}]")

danielhanchen requested a review from rolandtannous as a code owner May 14, 2026 12:02

danielhanchen merged commit 0c8eb10 into main May 14, 2026
5 of 6 checks passed

danielhanchen deleted the chore/ship-comment-only-verifier branch May 14, 2026 12:02

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scripts: ship deterministic comment / docstring-only diff verifier#5422

scripts: ship deterministic comment / docstring-only diff verifier#5422
danielhanchen merged 1 commit into
mainfrom
chore/ship-comment-only-verifier

danielhanchen commented May 14, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        if len(b) != len(a):
-            print(
-                f"     list len at {prefix or '/'}: "
-                f"{len(b)} -> {len(a)}",
-            )
-        for i, (bi, ai) in enumerate(zip(b, a)):
-            _walk_yaml_diff(bi, ai, f"{prefix}[{i}]")
+    elif isinstance(b, list):
+        if len(b) != len(a):
+            print(f"     list len at {prefix or '/'}: {len(b)} -> {len(a)}")
+        for i in range(max(len(b), len(a))):
+            if i >= len(b):
+                print(f"     added element at {prefix}[{i}]")
+            elif i >= len(a):
+                print(f"     removed element at {prefix}[{i}]")
+            else:
+                _walk_yaml_diff(b[i], a[i], f"{prefix}[{i}]")

Uh oh!

Conversation

danielhanchen commented May 14, 2026

Summary

What it does

Usage

Smoke test

Test plan

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant