Skip to content

feat - add a new endpoint get_tokenizer_info to provide tokenizer/chat-template information #20575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

m-misiura
Copy link

@m-misiura m-misiura commented Jul 7, 2025

Purpose

Certain upstream packages that can leverage vllm deployed models require access to a model’s chat template or some information pertaining the tokenizer itself. Given that we already require an API connection to the model server, being able to get this info out of the model server would reduce duplication and prevent template-model mismatch. Consequently, having a specific vllm endpoint that would display tokenizer and chat info would be useful.

Concrete examples of where this would be useful:

  1. model explainability / interpretability, in PyTorch Captum, you need to grab a tokenizer from HF again, even for a remote vllm deployment; this appears to be suboptimal
  2. lm-evaluation-harness: evaluations of a /completions endpoint use the chat-template to format the inbound prompt accordingly for the model; having to e.g. download the tokenizer again from HF is inefficient and could be error prone

Quick illustration of what a new endpoint would return

  1. Start server locally
vllm serve Qwen/Qwen2.5-0.5B-Instruct
  1. Make a request
curl -X GET "http://localhost:8000/get_tokenizer_info" | jq

which should return:

{
  "tokenizer_class": "Qwen2TokenizerFast",
  "unk_token": null,
  "bos_token": null,
  "eos_token": "<|im_end|>",
  "pad_token": "<|endoftext|>",
  "add_bos_token": false,
  "add_prefix_space": false,
  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>",
    "<|object_ref_start|>",
    "<|object_ref_end|>",
    "<|box_start|>",
    "<|box_end|>",
    "<|quad_start|>",
    "<|quad_end|>",
    "<|vision_start|>",
    "<|vision_end|>",
    "<|vision_pad|>",
    "<|image_pad|>",
    "<|video_pad|>"
  ],
  "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- else %}\n        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n    {%- endif %}\n    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- else %}\n        {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
  "clean_up_tokenization_spaces": false,
  "errors": "replace",
  "model_max_length": 131072,
  "split_special_tokens": false,
  "max_loras": 0,
  "truncation_side": "left",
  "name_or_path": "Qwen/Qwen2.5-0.5B-Instruct"
}

Test Plan

I added tests in tests/entrypoints/openai/test_tokenization.py

pytest tests/entrypoints/openai/test_tokenization.py

Test Result

Tests were passing locally on a CPU backend with a zephyr model that requires sliding window attention, but this is not the case since latest changes on upstream/main; I do not think these tests are no longer compatible with my local test setup

On another branch, I have tests using a model that does not require sliding window attention and the tests appear to pass

VLLM_USE_V1=1 pytest tests/entrypoints/openai/test_tokenization.py

I am happy to perform further testing / contribute additional tests

Copy link

github-actions bot commented Jul 7, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @m-misiura, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the vLLM server's API by adding a dedicated endpoint to expose detailed tokenizer and chat template information. This capability is crucial for external applications that need to interact with vLLM deployed models, enabling them to correctly handle tokenization and prompt formatting without needing to re-download or infer tokenizer details, thereby improving interoperability and reducing potential mismatches.

Highlights

  • New API Endpoint: Introduced a new GET endpoint /get_tokenizer_info to the vLLM OpenAI-compatible API server. This endpoint provides comprehensive information about the loaded model's tokenizer and its associated chat template.
  • Tokenizer Information Exposure: The new endpoint returns a JSON object containing key tokenizer properties such as tokenizer_class, special tokens (unk_token, bos_token, eos_token, pad_token), chat_template, model_max_length, and additional_special_tokens. It also includes added_tokens_decoder if available, and ensures all data is JSON-serializable.
  • Enhanced Test Coverage: Added a suite of new asynchronous tests to test_tokenization.py to validate the functionality, schema, and consistency of the /get_tokenizer_info endpoint. These tests cover basic response structure, data types, the format of added tokens, and consistency with existing tokenization endpoints.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the frontend label Jul 7, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new /get_tokenizer_info endpoint, which is a valuable feature for tools that interact with vLLM. The implementation is straightforward and the accompanying tests cover various aspects of the new endpoint, including schema validation and consistency checks. My main suggestions focus on ensuring the model_max_length reported is the effective one used by the engine and improving error logging.

Comment on lines 364 to 365
if info_max_len and tokenize_max_len:
assert info_max_len >= tokenize_max_len, "Info max length should be >= tokenize max length"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The get_tokenizer_info endpoint should return the effective max_model_len that the vLLM engine is using. Currently, it returns the model_max_length from the tokenizer's configuration, which can be different if it was overridden during engine initialization. This can be misleading for clients of this API.

To fix this, please update _get_tokenizer_config in vllm/entrypoints/openai/serving_tokenization.py to report the correct value from self.model_config.

Suggested change
if info_max_len and tokenize_max_len:
assert info_max_len >= tokenize_max_len, "Info max length should be >= tokenize max length"
if info_max_len and tokenize_max_len:
assert info_max_len == tokenize_max_len, "Info max length should be == tokenize max length"

Comment on lines +172 to +174
except Exception as e:
return self.create_error_response(
f"Failed to get tokenizer info: {str(e)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception can hide bugs and swallow important exceptions like KeyboardInterrupt. It's better to log the full traceback to aid in debugging when an unexpected error occurs.

Suggested change
except Exception as e:
return self.create_error_response(
f"Failed to get tokenizer info: {str(e)}")
except Exception as e:
logger.exception("Failed to get tokenizer info.")
return self.create_error_response(
f"Failed to get tokenizer info: {e!s}")

Comment on lines 204 to 213
def _make_json_serializable(self, obj):
"""Convert any non-JSON-serializable objects to serializable format."""
if hasattr(obj, 'content'):
return obj.content
elif isinstance(obj, dict):
return {k: self._make_json_serializable(v) for k, v in obj.items()}
elif isinstance(obj, list):
return [self._make_json_serializable(item) for item in obj]
else:
return obj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _make_json_serializable method is a bit magical, especially the if hasattr(obj, 'content') check. Adding type hints and a comment explaining this special case (e.g., for handling AddedToken objects from the tokenizer) would improve readability and maintainability for future contributors.

Suggested change
def _make_json_serializable(self, obj):
"""Convert any non-JSON-serializable objects to serializable format."""
if hasattr(obj, 'content'):
return obj.content
elif isinstance(obj, dict):
return {k: self._make_json_serializable(v) for k, v in obj.items()}
elif isinstance(obj, list):
return [self._make_json_serializable(item) for item in obj]
else:
return obj
def _make_json_serializable(self, obj: Any) -> Any:
"""Convert any non-JSON-serializable objects to serializable format."""
# This is a special case to handle AddedToken objects, which are not
# directly JSON-serializable but have a 'content' attribute.
if hasattr(obj, 'content'):
return obj.content
elif isinstance(obj, dict):
return {k: self._make_json_serializable(v) for k, v in obj.items()}
elif isinstance(obj, list):
return [self._make_json_serializable(item) for item in obj]
else:
return obj

@DarkLight1337 DarkLight1337 requested a review from mgoin July 7, 2025 14:40
Copy link
Collaborator

simon-mo commented Jul 7, 2025

Agree with Cyrus. Another point is that if you try this with Llama3/4, the payload will be large and a potential DDoS vector.

@m-misiura m-misiura force-pushed the feat_get_tokenizer_info_endpoint branch from f45e47f to ade7167 Compare July 8, 2025 08:42
@m-misiura
Copy link
Author

Agree with Cyrus. Another point is that if you try this with Llama3/4, the payload will be large and a potential DDoS vector.

Thanks for raising your concern @simon-mo; with the latest change, the get_tokenizer_info endpoint is disabled by default and must be explicitly enabled via --enable-tokenizer-info-endpoint flag; this should mitigate the accidental exposure

If you think further safeguards are warranted, just let me know. I'm happy to discuss and implement them!

@DarkLight1337
Copy link
Member

Can you fix pre-commit?

def maybe_register_tokenizer_info_endpoint(args):
"""Conditionally register the tokenizer info endpoint if enabled."""
if getattr(args, 'enable_tokenizer_info_endpoint', False):
@router.get("/get_tokenizer_info")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@router.get("/get_tokenizer_info")
@router.get("/tokenizer_info")

get_ is redundant because this is a GET endpoint already

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @DarkLight1337 -- I've renamed this endpoint accordingly

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 8, 2025

cc @aarnphm @noooop This is basically a subset of #19604

@noooop
Copy link
Contributor

noooop commented Jul 8, 2025

@m-misiura

Would you be willing to collaborate on #19604, and persuade others that this PR is valuable and should be merged?

@m-misiura m-misiura force-pushed the feat_get_tokenizer_info_endpoint branch from 2adc763 to 855b7bf Compare July 8, 2025 14:09
@m-misiura
Copy link
Author

Can you fix pre-commit?

of course, this should be now fixed in commit 855b7bf

@m-misiura
Copy link
Author

@m-misiura

Would you be willing to collaborate on #19604, and persuade others that this PR is valuable and should be merged?

hey @noooop; many thanks for reaching out :)

let's see what the reviewers decide is the best way forward and let's take it from there

from my perspective (as well as some of my team members), we seem to have a use-case around eval and explainability, but I can also see how your broader approach might be beneficial e.g. for RAG

@m-misiura m-misiura requested a review from DarkLight1337 July 9, 2025 12:54
@@ -800,18 +801,18 @@ def check_tool_usage(cls, data):
# make sure that tool choice is either a named tool
# OR that it's set to "auto" or "required"
if data["tool_choice"] not in [
"auto", "required"
"auto",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you get rid of the formatting changes that have nothing to do with this PR?

Comment on lines 184 to 186
config = (dict(self.tokenizer.init_kwargs)
if hasattr(self.tokenizer, "init_kwargs")
and self.tokenizer.init_kwargs else {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
config = (dict(self.tokenizer.init_kwargs)
if hasattr(self.tokenizer, "init_kwargs")
and self.tokenizer.init_kwargs else {})
config = dict(getattr(self.tokenizer, "init_kwargs", None) or {})

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now! @simon-mo can you double-check?

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 10, 2025
@DarkLight1337
Copy link
Member

I think you need to update the tests to enable the endpoint

…dpoint is optional; also reflected name change of the endpoint in the tests
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) July 11, 2025 09:04
@m-misiura
Copy link
Author

m-misiura commented Jul 11, 2025

I think you need to update the tests to enable the endpoint

many thanks for a yet another great spot -- the tests should now reflect the endpoint renaming and the fact that this endpoint is opt-in

please note also that when running tests locally on a cpu, I had to add "--disable-sliding-window", change precision and max-model-len (this is not reflected in the commit a42e7e9 though)

Copy link

mergify bot commented Jul 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @m-misiura.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend needs-rebase ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants