Add option in OpenAI API to disable special tokens when tokenizing by liuyanyi · Pull Request #2263 · vllm-project/vllm

liuyanyi · 2023-12-26T02:30:22Z

Follow the instruction in huggingface doc, i use a llama style chat template for my own model

template = (
    "{% if messages[0]['role'] == 'system' %}"
    "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
    "{% set system_message = messages[0]['content'] %}"
    "{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}"
    "{% set loop_messages = messages %}"  # Or use the default system message if the flag is set
    "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
    "{% else %}"
    "{% set loop_messages = messages %}"
    "{% set system_message = false %}"
    "{% endif %}"
    "{% for message in loop_messages %}"  # Loop over all non-system messages
    "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
    "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
    "{% endif %}"
    "{% if loop.index0 == 0 and system_message != false %}"  # Embed system message in first message
    "{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}"
    "{% else %}"
    "{% set content = message['content'] %}"
    "{% endif %}"
    "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
    "{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}"
    "{% elif message['role'] == 'system' %}"
    "{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}"
    "{% elif message['role'] == 'assistant' %}"
    "{{ ' '  + content.strip() + ' ' + eos_token }}"
    "{% endif %}"
    "{% endfor %}"
)

There is already a bos_token in template, after tokenize in OpenAI API server, there are two bos at the start.

So i add an option to control the behavior of the tokenizer.

This may fix #2012

DarkLight1337 · 2024-05-31T04:46:24Z

Closed as superseded by #4688

### What this PR does / why we need it? Update user guide for suported models - vLLM version: v0.10.0 - vLLM main: vllm-project@4be02a3 --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>

Add option to disable special tokens when tokenizing

2365e6c

DarkLight1337 closed this May 31, 2024

maxdebayser mentioned this pull request Oct 10, 2025

Change the default value of truncate_prompt_tokens in the embedding/rerank/pooling model to -1 #24235

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option in OpenAI API to disable special tokens when tokenizing#2263

Add option in OpenAI API to disable special tokens when tokenizing#2263
liuyanyi wants to merge 1 commit intovllm-project:mainfrom
liuyanyi:main

liuyanyi commented Dec 26, 2023 •

edited

Loading

Uh oh!

DarkLight1337 commented May 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

liuyanyi commented Dec 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented May 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liuyanyi commented Dec 26, 2023 •

edited

Loading