fix(tests/sh): accept pinned tokenizers line after #5359#5361
Conversation
#5359 pinned the tokenizers line in studio/backend/requirements/no-torch-runtime.txt from bare `tokenizers` to `tokenizers<=0.23.0` to stop pip from resolving to 0.23.1+ (which transformers rejects at import time). The shell test in tests/sh/test_torch_constraint.sh was still asserting the literal `^tokenizers$` regex, which fails on the pinned form. Surfaced as a hard fail on PR #5312's Backend CI Repo tests (CPU) step: === Structural: tokenizers in no-torch-runtime.txt === FAIL: tokenizers present as standalone line (expected '1', got '0') FAIL: tokenizers before transformers (expected 'yes', got 'no') Relax the regex to `^tokenizers([<>=!,~ ]|$)` so it matches both bare and version-constrained forms, which preserves the original intent (verify tokenizers is present in the file, before transformers). Verified locally: 24 PASS, 0 FAIL.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0357c2d9ee
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Accept bare `tokenizers` or version-pinned forms (e.g. `tokenizers<=0.23.0`). | ||
| # Per #5359 the line is pinned because unpinned resolves to 0.23.1+ which | ||
| # transformers rejects at import time. | ||
| _has_tokenizers=$(grep -cE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" || true) |
There was a problem hiding this comment.
Assert the safe tokenizers bound
Because this check now accepts any version operator after tokenizers, the shell installer test still passes for lines like tokenizers>=0.23.1 or tokenizers!=0.23.0, even though the adjacent comment says 0.23.1+ is what breaks transformers imports. Since .github/workflows/studio-backend-ci.yml runs this script in CI, this weakens the regression guard for the pinned no-torch requirement; constrain the accepted line to the known-safe bound (or explicitly reject 0.23.1+) rather than any pinned form.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request updates the test_torch_constraint.sh script to allow for version-pinned tokenizers entries in the requirements file. The review feedback suggests refining the regular expressions used in grep to more robustly identify the package name by accounting for various PEP 508 separators like extras, environment markers, or URLs.
| # Accept bare `tokenizers` or version-pinned forms (e.g. `tokenizers<=0.23.0`). | ||
| # Per #5359 the line is pinned because unpinned resolves to 0.23.1+ which | ||
| # transformers rejects at import time. | ||
| _has_tokenizers=$(grep -cE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" || true) |
There was a problem hiding this comment.
The regex ^tokenizers([<>=!,~ ]|$) is an improvement, but it is still somewhat brittle as it doesn't account for all valid PEP 508 separators that can follow a package name in a requirements file, such as extras ([), environment markers (;), or URLs (@). A more robust approach is to match the package name followed by any character that is not valid in a package name (i.e., not alphanumeric, ., _, or -), or the end of the line.
| _has_tokenizers=$(grep -cE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" || true) | |
| _has_tokenizers=$(grep -cE '^tokenizers([^a-zA-Z0-9._-]|$) ' "$NO_TORCH_RT" || true) |
|
|
||
| # tokenizers before transformers | ||
| _tok_line=$(grep -n '^tokenizers$' "$NO_TORCH_RT" | head -1 | cut -d: -f1) | ||
| _tok_line=$(grep -nE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" | head -1 | cut -d: -f1) |
There was a problem hiding this comment.
Applying the same robustness improvement here to ensure consistency in how the tokenizers line is identified.
| _tok_line=$(grep -nE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" | head -1 | cut -d: -f1) | |
| _tok_line=$(grep -nE '^tokenizers([^a-zA-Z0-9._-]|$) ' "$NO_TORCH_RT" | head -1 | cut -d: -f1) |
Address bot review feedback on #5361: * Codex (P2): the previous relaxed regex `^tokenizers([<>=!,~ ]|$)` accepted any version operator, so `tokenizers>=0.23.1` would still pass the test even though that line would re-introduce the import failure #5359 fixed. * Gemini (medium): the boundary char class did not cover all PEP 508 separators (`[`, `;`, `@`). Replace the single check with two: 1. Loose: `^tokenizers([^a-zA-Z0-9._-]|$)` confirms the package is listed (covers extras, env markers, URLs, bare line). 2. Tight regression guard: pipe those lines through a second grep that requires `<=0.23.0` or the functionally equivalent `<0.23.1`. Rejects bare `tokenizers`, `>=0.22.0` (no upper bound), `>=0.23.1`, `!=0.23.0`, `<=0.24.0`, etc. Verified locally: - Current main (tokenizers<=0.23.0): 25 PASS, 0 FAIL. - Spot-check with the bug reverted (bare `tokenizers`): the new "tokenizers pinned with upper bound excluding 0.23.1+" check FAILS as intended; the original "listed" check still passes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 42342521df
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Two-stage grep: pick lines that start with the tokenizers package | ||
| # name (PEP 508 name boundary), then require a safe upper bound. | ||
| _has_safe_pin=$(grep -E "$_TOK_RE" "$NO_TORCH_RT" \ | ||
| | grep -cE '(<=[[:space:]]*0\.23\.0|<[[:space:]]*0\.23\.1)' \ |
There was a problem hiding this comment.
Require a delimiter after the tokenizers cap
Fresh evidence in the updated guard is that the safe-pin regex now matches only a prefix of the version, so lines such as tokenizers<0.23.10 or tokenizers<=0.23.01 still pass this test even though they allow or normalize to an unsafe 0.23.1+ release. In the no-torch runtime regression check, that means the test can give a false pass while the resolver may still install a tokenizer version that breaks the documented transformers import; add a version terminator/package-specifier boundary after 0.23.0/0.23.1.
Useful? React with 👍 / 👎.
Follow-up to #5359. That PR pinned the tokenizers line in
studio/backend/requirements/no-torch-runtime.txtfrom baretokenizerstotokenizers<=0.23.0to stop pip from resolving to 0.23.1+ (which transformers rejects at import time).tests/sh/test_torch_constraint.shwas still asserting the literal^tokenizers$regex, so it now fails on main. Surfaced first on #5312's Backend CI Repo tests (CPU) step (https://github.com/unslothai/unsloth/actions/runs/25662435376/job/75326242237):Relax the regex to
^tokenizers([<>=!,~ ]|$)so it matches both bare and version-constrained forms. Preserves the original intent (verify the tokenizers package is listed in the file, before transformers).Test plan
bash tests/sh/test_torch_constraint.shlocally: 24 PASS, 0 FAIL on currentmain(which contains fix: unblock 4 tests deselected/skipped in #5312 (real bugs) #5359's pin).tokenizersline as well (so the contract holds in either direction).