fix(tests/sh): accept pinned tokenizers line after #5359 by danielhanchen · Pull Request #5361 · unslothai/unsloth

danielhanchen · 2026-05-11T09:45:16Z

Follow-up to #5359. That PR pinned the tokenizers line in studio/backend/requirements/no-torch-runtime.txt from bare tokenizers to tokenizers<=0.23.0 to stop pip from resolving to 0.23.1+ (which transformers rejects at import time).

tests/sh/test_torch_constraint.sh was still asserting the literal ^tokenizers$ regex, so it now fails on main. Surfaced first on #5312's Backend CI Repo tests (CPU) step (https://github.com/unslothai/unsloth/actions/runs/25662435376/job/75326242237):

=== Structural: tokenizers in no-torch-runtime.txt ===
  FAIL: tokenizers present as standalone line (expected '1', got '0')
  FAIL: tokenizers before transformers (expected 'yes', got 'no')

Relax the regex to ^tokenizers([<>=!,~ ]|$) so it matches both bare and version-constrained forms. Preserves the original intent (verify the tokenizers package is listed in the file, before transformers).

Test plan

bash tests/sh/test_torch_constraint.sh locally: 24 PASS, 0 FAIL on current main (which contains fix: unblock 4 tests deselected/skipped in #5312 (real bugs) #5359's pin).
The two relaxed regexes still match a hypothetical bare tokenizers line as well (so the contract holds in either direction).
CI green on this PR.

#5359 pinned the tokenizers line in studio/backend/requirements/no-torch-runtime.txt from bare `tokenizers` to `tokenizers<=0.23.0` to stop pip from resolving to 0.23.1+ (which transformers rejects at import time). The shell test in tests/sh/test_torch_constraint.sh was still asserting the literal `^tokenizers$` regex, which fails on the pinned form. Surfaced as a hard fail on PR #5312's Backend CI Repo tests (CPU) step: === Structural: tokenizers in no-torch-runtime.txt === FAIL: tokenizers present as standalone line (expected '1', got '0') FAIL: tokenizers before transformers (expected 'yes', got 'no') Relax the regex to `^tokenizers([<>=!,~ ]|$)` so it matches both bare and version-constrained forms, which preserves the original intent (verify tokenizers is present in the file, before transformers). Verified locally: 24 PASS, 0 FAIL.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0357c2d9ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T09:46:56Z

+# Accept bare `tokenizers` or version-pinned forms (e.g. `tokenizers<=0.23.0`).
+# Per #5359 the line is pinned because unpinned resolves to 0.23.1+ which
+# transformers rejects at import time.
+_has_tokenizers=$(grep -cE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" || true)


Assert the safe tokenizers bound

Because this check now accepts any version operator after tokenizers, the shell installer test still passes for lines like tokenizers>=0.23.1 or tokenizers!=0.23.0, even though the adjacent comment says 0.23.1+ is what breaks transformers imports. Since .github/workflows/studio-backend-ci.yml runs this script in CI, this weakens the regression guard for the pinned no-torch requirement; constrain the accepted line to the known-safe bound (or explicitly reject 0.23.1+) rather than any pinned form.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request updates the test_torch_constraint.sh script to allow for version-pinned tokenizers entries in the requirements file. The review feedback suggests refining the regular expressions used in grep to more robustly identify the package name by accounting for various PEP 508 separators like extras, environment markers, or URLs.

gemini-code-assist · 2026-05-11T09:47:29Z

+# Accept bare `tokenizers` or version-pinned forms (e.g. `tokenizers<=0.23.0`).
+# Per #5359 the line is pinned because unpinned resolves to 0.23.1+ which
+# transformers rejects at import time.
+_has_tokenizers=$(grep -cE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" || true)


The regex ^tokenizers([<>=!,~ ]|$) is an improvement, but it is still somewhat brittle as it doesn't account for all valid PEP 508 separators that can follow a package name in a requirements file, such as extras ([), environment markers (;), or URLs (@). A more robust approach is to match the package name followed by any character that is not valid in a package name (i.e., not alphanumeric, ., _, or -), or the end of the line.

Suggested change

_has_tokenizers=$(grep -cE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" || true)

_has_tokenizers=$(grep -cE '^tokenizers([^a-zA-Z0-9._-]|$) ' "$NO_TORCH_RT" || true)

gemini-code-assist · 2026-05-11T09:47:29Z


 # tokenizers before transformers
-_tok_line=$(grep -n '^tokenizers$' "$NO_TORCH_RT" | head -1 | cut -d: -f1)
+_tok_line=$(grep -nE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" | head -1 | cut -d: -f1)


Applying the same robustness improvement here to ensure consistency in how the tokenizers line is identified.

Suggested change

_tok_line=$(grep -nE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" | head -1 | cut -d: -f1)

_tok_line=$(grep -nE '^tokenizers([^a-zA-Z0-9._-]|$) ' "$NO_TORCH_RT" | head -1 | cut -d: -f1)

Address bot review feedback on #5361: * Codex (P2): the previous relaxed regex `^tokenizers([<>=!,~ ]|$)` accepted any version operator, so `tokenizers>=0.23.1` would still pass the test even though that line would re-introduce the import failure #5359 fixed. * Gemini (medium): the boundary char class did not cover all PEP 508 separators (`[`, `;`, `@`). Replace the single check with two: 1. Loose: `^tokenizers([^a-zA-Z0-9._-]|$)` confirms the package is listed (covers extras, env markers, URLs, bare line). 2. Tight regression guard: pipe those lines through a second grep that requires `<=0.23.0` or the functionally equivalent `<0.23.1`. Rejects bare `tokenizers`, `>=0.22.0` (no upper bound), `>=0.23.1`, `!=0.23.0`, `<=0.24.0`, etc. Verified locally: - Current main (tokenizers<=0.23.0): 25 PASS, 0 FAIL. - Spot-check with the bug reverted (bare `tokenizers`): the new "tokenizers pinned with upper bound excluding 0.23.1+" check FAILS as intended; the original "listed" check still passes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42342521df

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T09:51:38Z

+# Two-stage grep: pick lines that start with the tokenizers package
+# name (PEP 508 name boundary), then require a safe upper bound.
+_has_safe_pin=$(grep -E "$_TOK_RE" "$NO_TORCH_RT" \
+    | grep -cE '(<=[[:space:]]*0\.23\.0|<[[:space:]]*0\.23\.1)' \


Require a delimiter after the tokenizers cap

Fresh evidence in the updated guard is that the safe-pin regex now matches only a prefix of the version, so lines such as tokenizers<0.23.10 or tokenizers<=0.23.01 still pass this test even though they allow or normalize to an unsafe 0.23.1+ release. In the no-torch runtime regression check, that means the test can give a false pass while the resolver may still install a tokenizer version that breaks the documented transformers import; add a version terminator/package-specifier boundary after 0.23.0/0.23.1.

Useful? React with 👍 / 👎.

danielhanchen requested a review from rolandtannous as a code owner May 11, 2026 09:45

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

danielhanchen merged commit 2fd21a7 into main May 11, 2026
12 checks passed

danielhanchen deleted the fix/tokenizers-shell-test-regex branch May 11, 2026 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tests/sh): accept pinned tokenizers line after #5359#5361

fix(tests/sh): accept pinned tokenizers line after #5359#5361
danielhanchen merged 2 commits into
mainfrom
fix/tokenizers-shell-test-regex

danielhanchen commented May 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	_has_tokenizers=$(grep -cE '^tokenizers([<>=!,~ ]\|$)' "$NO_TORCH_RT" \|\| true)
	_has_tokenizers=$(grep -cE '^tokenizers([^a-zA-Z0-9._-]\|$) ' "$NO_TORCH_RT" \|\| true)

	_tok_line=$(grep -nE '^tokenizers([<>=!,~ ]\|$)' "$NO_TORCH_RT" \| head -1 \| cut -d: -f1)
	_tok_line=$(grep -nE '^tokenizers([^a-zA-Z0-9._-]\|$) ' "$NO_TORCH_RT" \| head -1 \| cut -d: -f1)

Uh oh!

Conversation

danielhanchen commented May 11, 2026

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant