Skip to content

fix(tests/sh): accept pinned tokenizers line after #5359#5361

Merged
danielhanchen merged 2 commits into
mainfrom
fix/tokenizers-shell-test-regex
May 11, 2026
Merged

fix(tests/sh): accept pinned tokenizers line after #5359#5361
danielhanchen merged 2 commits into
mainfrom
fix/tokenizers-shell-test-regex

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Follow-up to #5359. That PR pinned the tokenizers line in studio/backend/requirements/no-torch-runtime.txt from bare tokenizers to tokenizers<=0.23.0 to stop pip from resolving to 0.23.1+ (which transformers rejects at import time).

tests/sh/test_torch_constraint.sh was still asserting the literal ^tokenizers$ regex, so it now fails on main. Surfaced first on #5312's Backend CI Repo tests (CPU) step (https://github.com/unslothai/unsloth/actions/runs/25662435376/job/75326242237):

=== Structural: tokenizers in no-torch-runtime.txt ===
  FAIL: tokenizers present as standalone line (expected '1', got '0')
  FAIL: tokenizers before transformers (expected 'yes', got 'no')

Relax the regex to ^tokenizers([<>=!,~ ]|$) so it matches both bare and version-constrained forms. Preserves the original intent (verify the tokenizers package is listed in the file, before transformers).

Test plan

#5359 pinned the tokenizers line in
studio/backend/requirements/no-torch-runtime.txt from bare
`tokenizers` to `tokenizers<=0.23.0` to stop pip from resolving to
0.23.1+ (which transformers rejects at import time). The shell test
in tests/sh/test_torch_constraint.sh was still asserting the literal
`^tokenizers$` regex, which fails on the pinned form. Surfaced as
a hard fail on PR #5312's Backend CI Repo tests (CPU) step:

  === Structural: tokenizers in no-torch-runtime.txt ===
    FAIL: tokenizers present as standalone line (expected '1', got '0')
    FAIL: tokenizers before transformers (expected 'yes', got 'no')

Relax the regex to `^tokenizers([<>=!,~ ]|$)` so it matches both bare
and version-constrained forms, which preserves the original intent
(verify tokenizers is present in the file, before transformers).
Verified locally: 24 PASS, 0 FAIL.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0357c2d9ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/sh/test_torch_constraint.sh Outdated
# Accept bare `tokenizers` or version-pinned forms (e.g. `tokenizers<=0.23.0`).
# Per #5359 the line is pinned because unpinned resolves to 0.23.1+ which
# transformers rejects at import time.
_has_tokenizers=$(grep -cE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" || true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Assert the safe tokenizers bound

Because this check now accepts any version operator after tokenizers, the shell installer test still passes for lines like tokenizers>=0.23.1 or tokenizers!=0.23.0, even though the adjacent comment says 0.23.1+ is what breaks transformers imports. Since .github/workflows/studio-backend-ci.yml runs this script in CI, this weakens the regression guard for the pinned no-torch requirement; constrain the accepted line to the known-safe bound (or explicitly reject 0.23.1+) rather than any pinned form.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the test_torch_constraint.sh script to allow for version-pinned tokenizers entries in the requirements file. The review feedback suggests refining the regular expressions used in grep to more robustly identify the package name by accounting for various PEP 508 separators like extras, environment markers, or URLs.

Comment thread tests/sh/test_torch_constraint.sh Outdated
# Accept bare `tokenizers` or version-pinned forms (e.g. `tokenizers<=0.23.0`).
# Per #5359 the line is pinned because unpinned resolves to 0.23.1+ which
# transformers rejects at import time.
_has_tokenizers=$(grep -cE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" || true)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex ^tokenizers([<>=!,~ ]|$) is an improvement, but it is still somewhat brittle as it doesn't account for all valid PEP 508 separators that can follow a package name in a requirements file, such as extras ([), environment markers (;), or URLs (@). A more robust approach is to match the package name followed by any character that is not valid in a package name (i.e., not alphanumeric, ., _, or -), or the end of the line.

Suggested change
_has_tokenizers=$(grep -cE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" || true)
_has_tokenizers=$(grep -cE '^tokenizers([^a-zA-Z0-9._-]|$) ' "$NO_TORCH_RT" || true)

Comment thread tests/sh/test_torch_constraint.sh Outdated

# tokenizers before transformers
_tok_line=$(grep -n '^tokenizers$' "$NO_TORCH_RT" | head -1 | cut -d: -f1)
_tok_line=$(grep -nE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" | head -1 | cut -d: -f1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Applying the same robustness improvement here to ensure consistency in how the tokenizers line is identified.

Suggested change
_tok_line=$(grep -nE '^tokenizers([<>=!,~ ]|$)' "$NO_TORCH_RT" | head -1 | cut -d: -f1)
_tok_line=$(grep -nE '^tokenizers([^a-zA-Z0-9._-]|$) ' "$NO_TORCH_RT" | head -1 | cut -d: -f1)

Address bot review feedback on #5361:

* Codex (P2): the previous relaxed regex `^tokenizers([<>=!,~ ]|$)`
  accepted any version operator, so `tokenizers>=0.23.1` would still
  pass the test even though that line would re-introduce the import
  failure #5359 fixed.

* Gemini (medium): the boundary char class did not cover all PEP 508
  separators (`[`, `;`, `@`).

Replace the single check with two:

1. Loose: `^tokenizers([^a-zA-Z0-9._-]|$)` confirms the package is
   listed (covers extras, env markers, URLs, bare line).

2. Tight regression guard: pipe those lines through a second grep
   that requires `<=0.23.0` or the functionally equivalent
   `<0.23.1`. Rejects bare `tokenizers`, `>=0.22.0` (no upper
   bound), `>=0.23.1`, `!=0.23.0`, `<=0.24.0`, etc.

Verified locally:
- Current main (tokenizers<=0.23.0): 25 PASS, 0 FAIL.
- Spot-check with the bug reverted (bare `tokenizers`): the new
  "tokenizers pinned with upper bound excluding 0.23.1+" check
  FAILS as intended; the original "listed" check still passes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42342521df

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# Two-stage grep: pick lines that start with the tokenizers package
# name (PEP 508 name boundary), then require a safe upper bound.
_has_safe_pin=$(grep -E "$_TOK_RE" "$NO_TORCH_RT" \
| grep -cE '(<=[[:space:]]*0\.23\.0|<[[:space:]]*0\.23\.1)' \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require a delimiter after the tokenizers cap

Fresh evidence in the updated guard is that the safe-pin regex now matches only a prefix of the version, so lines such as tokenizers<0.23.10 or tokenizers<=0.23.01 still pass this test even though they allow or normalize to an unsafe 0.23.1+ release. In the no-torch runtime regression check, that means the test can give a false pass while the resolver may still install a tokenizer version that breaks the documented transformers import; add a version terminator/package-specifier boundary after 0.23.0/0.23.1.

Useful? React with 👍 / 👎.

@danielhanchen danielhanchen merged commit 2fd21a7 into main May 11, 2026
12 checks passed
@danielhanchen danielhanchen deleted the fix/tokenizers-shell-test-regex branch May 11, 2026 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant