Skip to content

docs(core): Replace tiktoken references with gpt-tokenizer#1413

Merged
yamadashy merged 2 commits intomainfrom
docs/remove-tiktoken-references
Apr 6, 2026
Merged

docs(core): Replace tiktoken references with gpt-tokenizer#1413
yamadashy merged 2 commits intomainfrom
docs/remove-tiktoken-references

Conversation

@yamadashy
Copy link
Copy Markdown
Owner

@yamadashy yamadashy commented Apr 6, 2026

Update all documentation, build configuration, and source comments to reflect the completed migration from tiktoken to gpt-tokenizer.

Changes

Documentation (README + 15 language variants):

  • Update tokenCount.encoding description: replaced tiktoken-specific wording with "OpenAI-compatible tokenization" and linked to gpt-tokenizer
  • Remove tiktoken from bundling external dependencies list — gpt-tokenizer is pure JS, no longer needs to be external

Build infrastructure:

  • Remove tiktoken COPY line from website/server/Dockerfile
  • Remove tiktoken from external array in website/server/scripts/bundle.mjs

Source code:

  • Simplify comments in TokenCounter.ts — remove "old tiktoken behavior" migration artifacts

Lock files:

  • Regenerate scripts/memory/package-lock.json to remove stale tiktoken reference
  • website/server/package-lock.json still references tiktoken via the GitHub-linked repomix dependency — will resolve automatically after this merges and npm install is re-run

Checklist

  • Run npm run test
  • Run npm run lint

🤖 Generated with Claude Code


Open with Devin

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 6, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 67124ae2-3868-4f5a-85ee-c031b76b4d40

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request replaces all references to OpenAI's tiktoken library with gpt-tokenizer across documentation and bundling configuration. It updates configuration guides in multiple languages, removes tiktoken from external dependency lists, updates source code comments, and modifies Docker and build scripts to treat tiktoken as bundled rather than external.

Changes

Cohort / File(s) Summary
Documentation - Configuration Guides
README.md, website/client/src/*/guide/configuration.md
Updated tokenCount.encoding descriptions to reference OpenAI-compatible tokenization and gpt-tokenizer instead of OpenAI's tiktoken. Removed tiktoken links and references across English, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese (BR), Russian, Turkish, Vietnamese, Simplified Chinese, and Traditional Chinese documentation.
Documentation - Bundling Guides
website/client/src/*/guide/development/using-repomix-as-a-library.md
Removed tiktoken from external dependencies lists across all 15 language versions, leaving only tinypool as an unbundleable external dependency.
Source Code Comments
src/core/metrics/TokenCounter.ts
Updated inline documentation to reference OpenAI-compatible encodings rather than tiktoken-specific behavior; no logic or control flow changes.
Bundling & Runtime Configuration
website/server/Dockerfile, website/server/scripts/bundle.mjs
Removed tiktoken copy instruction from Docker runtime image build and removed tiktoken from Rolldown's external dependencies list, treating it as bundleable code.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

  • PR #1065: Addresses the opposite concern—handling how tiktoken maintains external dependency status and WASM provisioning at runtime.
  • PR #1350: Contains related gpt-tokenizer implementation changes that complement this documentation and bundling migration.
  • PR #1075: Originally introduced tiktoken as an external dependency in bundling documentation and configuration; this PR reverses that decision by removing it.
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: replacing tiktoken references with gpt-tokenizer across documentation and configuration files.
Description check ✅ Passed The description covers all required changes comprehensively, includes a well-organized checklist of modifications across documentation, build infrastructure, and source code, and confirms both npm test and lint were run.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/remove-tiktoken-references

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 6, 2026

Deploying repomix with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8d8aefc
Status:⚡️  Build in progress...

View logs

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the project from tiktoken to gpt-tokenizer for token counting. Key changes include updating dependencies in package-lock.json, revising the TokenCounter logic and comments, and updating documentation across all supported languages. Additionally, tiktoken has been removed from the build's external dependencies and Docker configuration since it is no longer required at runtime. I have no feedback to provide.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

⚡ Performance Benchmark

Latest commit:8d8aefc chore(server): Update website server package-lock.json
Status:✅ Benchmark complete!
Ubuntu:1.50s (±0.04s) → 1.50s (±0.05s) · +0.00s (+0.0%)
macOS:1.21s (±0.36s) → 1.22s (±0.28s) · +0.01s (+1.0%)
Windows:1.80s (±0.13s) → 1.83s (±0.12s) · +0.03s (+1.8%)
Details
  • Packing the repomix repository with node bin/repomix.cjs
  • Warmup: 2 runs (discarded), interleaved execution
  • Measurement: 20 runs / 30 on macOS (median ± IQR)
  • Workflow run
History

54597b1 chore(server): Update website server package-lock.json

Ubuntu:1.55s (±0.04s) → 1.54s (±0.03s) · -0.01s (-0.7%)
macOS:0.97s (±0.07s) → 0.98s (±0.13s) · +0.01s (+1.2%)
Windows:1.80s (±0.05s) → 1.81s (±0.04s) · +0.00s (+0.1%)

493035c docs(core): Replace tiktoken references with gpt-tokenizer

Ubuntu:1.50s (±0.03s) → 1.50s (±0.03s) · +0.00s (+0.2%)
macOS:1.04s (±0.13s) → 1.05s (±0.16s) · +0.01s (+0.5%)
Windows:1.86s (±0.16s) → 1.82s (±0.12s) · -0.05s (-2.5%)

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.26%. Comparing base (ffe6770) to head (8d8aefc).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1413   +/-   ##
=======================================
  Coverage   87.26%   87.26%           
=======================================
  Files         117      117           
  Lines        4420     4420           
  Branches     1021     1021           
=======================================
  Hits         3857     3857           
  Misses        563      563           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

yamadashy and others added 2 commits April 6, 2026 15:37
Update all documentation, build configuration, and source comments
to reflect the completed migration from tiktoken to gpt-tokenizer.

- Update tokenCount.encoding description to say "OpenAI-compatible
  tokenization" with gpt-tokenizer link (README + 15 language docs)
- Remove tiktoken from bundling external deps list since
  gpt-tokenizer is pure JS (README + 15 language docs)
- Remove tiktoken COPY from Dockerfile and external from bundle.mjs
- Simplify source code comments removing migration artifacts
- Regenerate scripts/memory/package-lock.json

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yamadashy yamadashy force-pushed the docs/remove-tiktoken-references branch from 54597b1 to 8d8aefc Compare April 6, 2026 06:37
@yamadashy yamadashy merged commit 01f5c1a into main Apr 6, 2026
54 of 55 checks passed
@yamadashy yamadashy deleted the docs/remove-tiktoken-references branch April 6, 2026 06:37
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 6, 2026

Code Review — Claude

Overall: Approve — This is a clean, thorough documentation and config cleanup PR. No logic changes, no regressions, safe to merge.

Findings

Verify gpt-tokenizer GitHub URL (all 16 doc files)

All documentation links point to https://github.com/nicolo-ribaudo/gpt-tokenizer. The canonical npm package gpt-tokenizer (v3.4.0) is historically maintained at https://github.com/niieani/gpt-tokenizer by niieani (Bazyli Brzóska). nicolo-ribaudo is primarily known as a Babel/TC39 contributor. Please verify this URL is correct — if it's a recent repo transfer, it's fine; if not, all 16 doc files need updating.

Detailed review notes

Code Quality — The TokenCounter.ts comment cleanup is well done. Removing the "old tiktoken behavior" migration artifacts is appropriate now that the migration is complete. The free() no-op comment and PLAIN_TEXT_OPTIONS explanation remain clear and accurate.

Security — No concerns. Removing tiktoken (native/WASM addon) from the Docker image and bundle externals actually reduces the supply-chain attack surface. The dynamic import path in TokenCounter.ts is constrained to the TOKEN_ENCODINGS const tuple — no path traversal risk.

Performance — The benchmark results confirm no regression. Bundling gpt-tokenizer inline (pure JS) instead of keeping it external is correct and should slightly reduce Docker image size vs. copying tiktoken's WASM binaries. Lazy loading via dynamic import is preserved.

Test Coverage — No new tests needed. Only comments and config changed; Codecov confirms all modified coverable lines are covered (87.26% project coverage).

Conventions — Commit messages follow Conventional Commits correctly. Documentation wording is consistent across all 16 language variants.

Completeness — All tiktoken references in source, docs, and build config have been removed. The only remaining tiktoken mention is in historical release notes (.github/releases/), which is correct.

🤖 Generated with Claude Code

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 6, 2026

Review — Claude

Verdict: Looks good

Clean documentation and build config follow-up to the tiktoken → gpt-tokenizer migration (#1350). All 36 changed files are consistent and well-scoped.

Highlights

  • No logic changes — only comments, docs, and build config touched
  • All 15 language variants updated consistently
  • Build pipeline correctly simplified — tiktoken WASM no longer needs special handling in Dockerfile or bundle externals
  • Benchmarks show no regression across all platforms
  • Conventions fully followed (commit messages, PR checklist)
  • No stale tiktoken references remain — verified only historical release notes (v0.1.39) still mention it, which is correct
Detailed review notes

Code Quality: Comment simplifications in TokenCounter.ts are accurate. Removing "old tiktoken behavior" migration artifacts is appropriate cleanup.

Security: No concerns. Removing the tiktoken WASM COPY from Dockerfile reduces attack surface marginally.

Performance: Bundling gpt-tokenizer (pure JS) instead of externalizing tiktoken (WASM) is neutral to positive. CI benchmarks confirm.

Test Coverage: No new logic paths — 87.26% coverage maintained, all modified coverable lines covered.

Conventions: Both commits follow type(scope): Description format. PR body is well-organized with checked test/lint boxes.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant