Remove incorrect tokenizer info test by hmellor · Pull Request #33565 · vllm-project/vllm

hmellor · 2026-02-02T14:54:19Z

#20575 added, what is now called, /tokenizer_info endpoint.

It added the following serialisation method that is used before the tokenizer info is sent:

vllm/vllm/entrypoints/serve/tokenize/serving.py

Lines 186 to 195 in d95b4be

    
           def _make_json_serializable(self, obj): 
        
               """Convert any non-JSON-serializable objects to serializable format.""" 
        
               if hasattr(obj, "content"): 
        
                   return obj.content 
        
               elif isinstance(obj, dict): 
        
                   return {k: self._make_json_serializable(v) for k, v in obj.items()} 
        
               elif isinstance(obj, list): 
        
                   return [self._make_json_serializable(item) for item in obj] 
        
               else: 
        
                   return obj

In this you can see that if an obj has the attribute content it will only return the content from that object. The values of the the added_tokens_decoder field are AddedToken objects, which do have the attribute content.

This means that, by design, the tokenizer info endpoint was only meant to return the string value of the token. Unfortunately, this is directly contradicted in the test that was added by the same PR

vllm/tests/entrypoints/openai/test_tokenization.py

Lines 304 to 321 in d95b4be

    
           @pytest.mark.asyncio 
        
           async def test_tokenizer_info_added_tokens_structure( 
        
               server: RemoteOpenAIServer, 
        
           ): 
        
               """Test added_tokens_decoder structure if present.""" 
        
               response = requests.get(server.url_for("tokenizer_info")) 
        
               response.raise_for_status() 
        
               result = response.json() 
        
               added_tokens = result.get("added_tokens_decoder") 
        
               if added_tokens: 
        
                   for token_id, token_info in added_tokens.items(): 
        
                       assert isinstance(token_id, str), "Token IDs should be strings" 
        
                       assert isinstance(token_info, dict), "Token info should be a dict" 
        
                       assert "content" in token_info, "Token info should have content" 
        
                       assert "special" in token_info, "Token info should have special flag" 
        
                       assert isinstance(token_info["special"], bool), ( 
        
                           "Special flag should be boolean" 
        
                       )

This was not noticed because the model used in the test does not have and added_tokens_decoder for Transformers v4.

This PR opts to simply remove the test so that:

The existing API (whether or not it was intended) remains unchanged
The test does not fail for Transformers v5 where added_tokens_decoder are present and the test fails

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request removes an incorrect test for the /tokenizer_info endpoint. The test was asserting an incorrect structure for added_tokens_decoder. While removing the test is a valid fix, I've suggested replacing it with a corrected test that asserts the actual API behavior. This maintains test coverage for this feature.

I am having trouble creating individual review comments. Click here to see my feedback.

tests/entrypoints/openai/test_tokenization.py (304-321)

You've correctly identified that this test is incorrect given the serialization logic. While removing it fixes the immediate issue of a failing test, it also removes test coverage for the added_tokens_decoder field.

A better approach would be to correct the test to reflect the actual behavior of the API. Based on your description, the token_info is serialized to just the string content. The test should be updated to assert this.

This ensures we still have a test for this part of the API response, and it correctly documents the expected behavior.

@pytest.mark.asyncio
async def test_tokenizer_info_added_tokens_structure(
    server: RemoteOpenAIServer,
):
    """Test added_tokens_decoder structure if present."""
    response = requests.get(server.url_for("tokenizer_info"))
    response.raise_for_status()
    result = response.json()
    added_tokens = result.get("added_tokens_decoder")
    if added_tokens:
        for token_id, token_info in added_tokens.items():
            assert isinstance(token_id, str), "Token IDs should be strings"
            assert isinstance(token_info, str), "Token info should be a string"

hmellor · 2026-02-02T14:57:41Z

I'd ignore Gemini's suggestion. There is little value in testing that a JSON is a dict mapping str to str

DarkLight1337

Thanks for the detailed explanation!

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Pai <416932041@qq.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Remove incorrect test

aa7f66a

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

hmellor requested review from DarkLight1337, NickLucche, aarnphm and robertgshaw2-redhat as code owners February 2, 2026 14:54

hmellor changed the title ~~Remove incorrect test~~ Remove incorrect tokenizer info test Feb 2, 2026

hmellor mentioned this pull request Feb 2, 2026

Update to transformers v5 #30566

Open

gemini-code-assist bot reviewed Feb 2, 2026

View reviewed changes

DarkLight1337 approved these changes Feb 2, 2026

View reviewed changes

DarkLight1337 enabled auto-merge (squash) February 2, 2026 14:58

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 2, 2026

DarkLight1337 merged commit 6141ebe into vllm-project:main Feb 2, 2026
17 checks passed

hmellor deleted the remove-bad-test branch February 3, 2026 07:41

PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026

Remove incorrect tokenizer info test (vllm-project#33565)

6f51f52

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Pai <416932041@qq.com>

PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026

Remove incorrect tokenizer info test (vllm-project#33565)

1b5ab58

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Pai <416932041@qq.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

Remove incorrect tokenizer info test (vllm-project#33565)

2d151df

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove incorrect tokenizer info test#33565

Remove incorrect tokenizer info test#33565
DarkLight1337 merged 1 commit intovllm-project:mainfrom
hmellor:remove-bad-test

hmellor commented Feb 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

hmellor commented Feb 2, 2026

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def _make_json_serializable(self, obj):
	"""Convert any non-JSON-serializable objects to serializable format."""
	if hasattr(obj, "content"):
	return obj.content
	elif isinstance(obj, dict):
	return {k: self._make_json_serializable(v) for k, v in obj.items()}
	elif isinstance(obj, list):
	return [self._make_json_serializable(item) for item in obj]
	else:
	return obj

	@pytest.mark.asyncio
	async def test_tokenizer_info_added_tokens_structure(
	server: RemoteOpenAIServer,
	):
	"""Test added_tokens_decoder structure if present."""
	response = requests.get(server.url_for("tokenizer_info"))
	response.raise_for_status()
	result = response.json()
	added_tokens = result.get("added_tokens_decoder")
	if added_tokens:
	for token_id, token_info in added_tokens.items():
	assert isinstance(token_id, str), "Token IDs should be strings"
	assert isinstance(token_info, dict), "Token info should be a dict"
	assert "content" in token_info, "Token info should have content"
	assert "special" in token_info, "Token info should have special flag"
	assert isinstance(token_info["special"], bool), (
	"Special flag should be boolean"
	)

Uh oh!

Conversation

hmellor commented Feb 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

tests/entrypoints/openai/test_tokenization.py (304-321)

Uh oh!

hmellor commented Feb 2, 2026

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants