Update grok.py and tiktoken tokenizer #9532

merrymercy · 2025-08-23T12:07:10Z

No description provided.

gemini-code-assist

Summary of Changes

Hello @merrymercy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant enhancements to the Grok1 model and integrates a new tiktoken tokenizer. The core improvements focus on refining the model's attention mechanisms with new logit capping and XAI temperature scaling, and enabling YaRN-like scaling for Rotary Positional Embeddings to better handle extended contexts. A notable architectural change is the introduction of a residual Mixture-of-Experts (MoE) setup, which allows for a more robust combination of standard MLP and MoE outputs. Additionally, the PR includes memory optimization techniques and updates to the weight loading process for increased flexibility and efficiency.

Highlights

Tiktoken Tokenizer Integration: This PR introduces a new TiktokenTokenizer that allows the system to load and utilize tiktoken models from JSON files. This new tokenizer is seamlessly integrated with the xgrammar backend, enabling constrained generation with tiktoken models.
Advanced Attention Control: The attention mechanism has been enhanced with new logit capping methods and XAI (Explainable AI) temperature scaling. These additions provide finer-grained control over attention scores, allowing for more precise model behavior and potentially improving generation quality.
YaRN-like RoPE Scaling: Support for YaRN-like (Yet another RoPE) scaling has been added for Rotary Positional Embeddings (RoPE). This is a significant improvement for handling longer sequence lengths more effectively, as it allows the model to generalize better beyond its original training context length.
Residual Mixture-of-Experts (MoE) Architecture: A new Grok1MLP class and modifications to the Grok1DecoderLayer introduce support for a residual Mixture-of-Experts (MoE) architecture. This setup allows a standard MLP path to run in parallel with the MoE path, combining their outputs for potentially improved model performance and stability.
Optimized Memory Management: Explicit dispose_tensor calls have been strategically added throughout the attention and decoder layers. These calls help in managing GPU memory more efficiently by releasing tensors that are no longer needed, reducing the overall memory footprint during forward passes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the Grok model and a tiktoken-based tokenizer, along with several enhancements to the attention mechanisms, including YaRN scaling and logit capping. The changes are extensive and well-structured. My review focuses on a few potential issues related to correctness and code quality. I've identified a likely bug in the YaRN scaling implementation where a scaling factor is calculated but not applied, a duplicated code block, and an empty ValueError that should be more descriptive.

gemini-code-assist · 2025-08-23T12:12:21Z

python/sglang/srt/models/grok.py

+        # cos = freqs.cos() * self.mscale
+        # sin = freqs.sin() * self.mscale
+        cos = freqs.cos()
+        sin = freqs.sin()


The mscale attribute is calculated in the __init__ method but appears to be unused here, as the lines applying it are commented out. The YaRN scaling method typically includes this magnitude scaling factor. If this is intentional, a comment explaining why would be helpful. Otherwise, this seems like a bug and the scaling should be applied.

Suggested change

# cos = freqs.cos() * self.mscale

# sin = freqs.sin() * self.mscale

cos = freqs.cos()

sin = freqs.sin()

cos = freqs.cos() * self.mscale

sin = freqs.sin() * self.mscale

gemini-code-assist · 2025-08-23T12:12:21Z

python/sglang/srt/layers/attention/triton_backend.py

+    if logit_capping_method == "tanh":
+        return logit_cap
+    else:
+        raise ValueError()


Raising a ValueError without a descriptive message makes debugging difficult. It's better to include information about why the error is being raised, such as the unsupported method that was passed.

Suggested change

raise ValueError()

raise ValueError(f"Unsupported logit_capping_method: {logit_capping_method}")

gemini-code-assist · 2025-08-23T12:12:21Z

python/sglang/srt/models/grok.py

+        default_replicate_lm_head = False
+        self.replicate_lm_head = getattr(
+            config, "replicate_lm_head", default_replicate_lm_head
+        )
+


This block of code, which initializes self.replicate_lm_head, is duplicated. The same logic appears again at lines 845-848. You can remove this block to avoid redundancy.

sync code; tokenizer

595be18

merrymercy requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam and zhyncs as code owners August 23, 2025 12:07

gemini-code-assist bot reviewed Aug 23, 2025

View reviewed changes

merrymercy merged commit 86d10d2 into main Aug 23, 2025
58 of 68 checks passed

merrymercy deleted the lianmin/sync-code branch August 23, 2025 12:40

CISC mentioned this pull request Aug 25, 2025

model : add grok-2 support ggml-org/llama.cpp#15539

Merged

mgoin mentioned this pull request Aug 25, 2025

[New Model]: Grok 2 vllm-project/vllm#23557

Open

1 task

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

Update grok.py and tiktoken tokenizer (sgl-project#9532)

b0c51cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update grok.py and tiktoken tokenizer #9532

Update grok.py and tiktoken tokenizer #9532

merrymercy commented Aug 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 23, 2025

Uh oh!

gemini-code-assist bot Aug 23, 2025

Uh oh!

gemini-code-assist bot Aug 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	raise ValueError()
	raise ValueError(f"Unsupported logit_capping_method: {logit_capping_method}")

Update grok.py and tiktoken tokenizer #9532

Update grok.py and tiktoken tokenizer #9532

Conversation

merrymercy commented Aug 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants