feat - add a new endpoint `get_tokenizer_info` to provide tokenizer/chat-template information #20575

m-misiura · 2025-07-07T14:30:57Z

Purpose

Certain upstream packages that can leverage vllm deployed models require access to a model’s chat template or some information pertaining the tokenizer itself. Given that we already require an API connection to the model server, being able to get this info out of the model server would reduce duplication and prevent template-model mismatch. Consequently, having a specific vllm endpoint that would display tokenizer and chat info would be useful.

Concrete examples of where this would be useful:

model explainability / interpretability, in PyTorch Captum, you need to grab a tokenizer from HF again, even for a remote vllm deployment; this appears to be suboptimal
lm-evaluation-harness: evaluations of a /completions endpoint use the chat-template to format the inbound prompt accordingly for the model; having to e.g. download the tokenizer again from HF is inefficient and could be error prone

Quick illustration of what a new endpoint would return

Start server locally

vllm serve Qwen/Qwen2.5-0.5B-Instruct

Make a request

curl -X GET "http://localhost:8000/get_tokenizer_info" | jq

which should return:

{
  "tokenizer_class": "Qwen2TokenizerFast",
  "unk_token": null,
  "bos_token": null,
  "eos_token": "<|im_end|>",
  "pad_token": "<|endoftext|>",
  "add_bos_token": false,
  "add_prefix_space": false,
  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>",
    "<|object_ref_start|>",
    "<|object_ref_end|>",
    "<|box_start|>",
    "<|box_end|>",
    "<|quad_start|>",
    "<|quad_end|>",
    "<|vision_start|>",
    "<|vision_end|>",
    "<|vision_pad|>",
    "<|image_pad|>",
    "<|video_pad|>"
  ],
  "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- else %}\n        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n    {%- endif %}\n    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- else %}\n        {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
  "clean_up_tokenization_spaces": false,
  "errors": "replace",
  "model_max_length": 131072,
  "split_special_tokens": false,
  "max_loras": 0,
  "truncation_side": "left",
  "name_or_path": "Qwen/Qwen2.5-0.5B-Instruct"
}

Test Plan

I added tests in tests/entrypoints/openai/test_tokenization.py

pytest tests/entrypoints/openai/test_tokenization.py

Test Result

Tests were passing locally on a CPU backend with a zephyr model that requires sliding window attention, but this is not the case since latest changes on upstream/main; I do not think these tests are no longer compatible with my local test setup

On another branch, I have tests using a model that does not require sliding window attention and the tests appear to pass

VLLM_USE_V1=1 pytest tests/entrypoints/openai/test_tokenization.py

I am happy to perform further testing / contribute additional tests

…tion Signed-off-by: m-misiura <[email protected]>

github-actions · 2025-07-07T14:31:10Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @m-misiura, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the vLLM server's API by adding a dedicated endpoint to expose detailed tokenizer and chat template information. This capability is crucial for external applications that need to interact with vLLM deployed models, enabling them to correctly handle tokenization and prompt formatting without needing to re-download or infer tokenizer details, thereby improving interoperability and reducing potential mismatches.

Highlights

New API Endpoint: Introduced a new GET endpoint /get_tokenizer_info to the vLLM OpenAI-compatible API server. This endpoint provides comprehensive information about the loaded model's tokenizer and its associated chat template.
Tokenizer Information Exposure: The new endpoint returns a JSON object containing key tokenizer properties such as tokenizer_class, special tokens (unk_token, bos_token, eos_token, pad_token), chat_template, model_max_length, and additional_special_tokens. It also includes added_tokens_decoder if available, and ensures all data is JSON-serializable.
Enhanced Test Coverage: Added a suite of new asynchronous tests to test_tokenization.py to validate the functionality, schema, and consistency of the /get_tokenizer_info endpoint. These tests cover basic response structure, data types, the format of added tokens, and consistency with existing tokenization endpoints.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces a new /get_tokenizer_info endpoint, which is a valuable feature for tools that interact with vLLM. The implementation is straightforward and the accompanying tests cover various aspects of the new endpoint, including schema validation and consistency checks. My main suggestions focus on ensuring the model_max_length reported is the effective one used by the engine and improving error logging.

gemini-code-assist · 2025-07-07T14:33:00Z

tests/entrypoints/openai/test_tokenization.py

+    if info_max_len and tokenize_max_len:
+        assert info_max_len >= tokenize_max_len, "Info max length should be >= tokenize max length"


The get_tokenizer_info endpoint should return the effective max_model_len that the vLLM engine is using. Currently, it returns the model_max_length from the tokenizer's configuration, which can be different if it was overridden during engine initialization. This can be misleading for clients of this API.

To fix this, please update _get_tokenizer_config in vllm/entrypoints/openai/serving_tokenization.py to report the correct value from self.model_config.

Suggested change

if info_max_len and tokenize_max_len:

assert info_max_len >= tokenize_max_len, "Info max length should be >= tokenize max length"

if info_max_len and tokenize_max_len:

assert info_max_len == tokenize_max_len, "Info max length should be == tokenize max length"

gemini-code-assist · 2025-07-07T14:33:00Z

vllm/entrypoints/openai/serving_tokenization.py

+        except Exception as e:
+            return self.create_error_response(
+                f"Failed to get tokenizer info: {str(e)}")


Catching a broad Exception can hide bugs and swallow important exceptions like KeyboardInterrupt. It's better to log the full traceback to aid in debugging when an unexpected error occurs.

Suggested change

except Exception as e:

return self.create_error_response(

f"Failed to get tokenizer info: {str(e)}")

except Exception as e:

logger.exception("Failed to get tokenizer info.")

return self.create_error_response(

f"Failed to get tokenizer info: {e!s}")

gemini-code-assist · 2025-07-07T14:33:00Z

vllm/entrypoints/openai/serving_tokenization.py

+    def _make_json_serializable(self, obj):
+        """Convert any non-JSON-serializable objects to serializable format."""
+        if hasattr(obj, 'content'): 
+            return obj.content
+        elif isinstance(obj, dict):
+            return {k: self._make_json_serializable(v) for k, v in obj.items()}
+        elif isinstance(obj, list):
+            return [self._make_json_serializable(item) for item in obj]
+        else:
+            return obj


The _make_json_serializable method is a bit magical, especially the if hasattr(obj, 'content') check. Adding type hints and a comment explaining this special case (e.g., for handling AddedToken objects from the tokenizer) would improve readability and maintainability for future contributors.

Suggested change

def _make_json_serializable(self, obj):

"""Convert any non-JSON-serializable objects to serializable format."""

if hasattr(obj, 'content'):

return obj.content

elif isinstance(obj, dict):

return {k: self._make_json_serializable(v) for k, v in obj.items()}

elif isinstance(obj, list):

return [self._make_json_serializable(item) for item in obj]

else:

return obj

def _make_json_serializable(self, obj: Any) -> Any:

"""Convert any non-JSON-serializable objects to serializable format."""

# This is a special case to handle AddedToken objects, which are not

# directly JSON-serializable but have a 'content' attribute.

if hasattr(obj, 'content'):

return obj.content

elif isinstance(obj, dict):

return {k: self._make_json_serializable(v) for k, v in obj.items()}

elif isinstance(obj, list):

return [self._make_json_serializable(item) for item in obj]

else:

return obj

vllm/entrypoints/openai/api_server.py

simon-mo · 2025-07-07T18:04:13Z

Agree with Cyrus. Another point is that if you try this with Llama3/4, the payload will be large and a potential DDoS vector.

… to vllm serve

m-misiura · 2025-07-08T08:49:05Z

Agree with Cyrus. Another point is that if you try this with Llama3/4, the payload will be large and a potential DDoS vector.

Thanks for raising your concern @simon-mo; with the latest change, the get_tokenizer_info endpoint is disabled by default and must be explicitly enabled via --enable-tokenizer-info-endpoint flag; this should mitigate the accidental exposure

If you think further safeguards are warranted, just let me know. I'm happy to discuss and implement them!

vllm/entrypoints/openai/api_server.py

DarkLight1337 · 2025-07-08T10:28:06Z

Can you fix pre-commit?

DarkLight1337 · 2025-07-08T10:29:08Z

vllm/entrypoints/openai/api_server.py

+def maybe_register_tokenizer_info_endpoint(args):
+    """Conditionally register the tokenizer info endpoint if enabled."""
+    if getattr(args, 'enable_tokenizer_info_endpoint', False):
+        @router.get("/get_tokenizer_info")


Suggested change

@router.get("/get_tokenizer_info")

@router.get("/tokenizer_info")

get_ is redundant because this is a GET endpoint already

thanks @DarkLight1337 -- I've renamed this endpoint accordingly

DarkLight1337 · 2025-07-08T10:29:51Z

cc @aarnphm @noooop This is basically a subset of #19604

noooop · 2025-07-08T10:50:44Z

@m-misiura

Would you be willing to collaborate on #19604, and persuade others that this PR is valuable and should be merged?

…ran pre-commit

m-misiura · 2025-07-08T14:16:16Z

Can you fix pre-commit?

of course, this should be now fixed in commit 855b7bf

m-misiura · 2025-07-08T14:33:32Z

@m-misiura

Would you be willing to collaborate on #19604, and persuade others that this PR is valuable and should be merged?

hey @noooop; many thanks for reaching out :)

let's see what the reviewers decide is the best way forward and let's take it from there

from my perspective (as well as some of my team members), we seem to have a use-case around eval and explainability, but I can also see how your broader approach might be beneficial e.g. for RAG

DarkLight1337 · 2025-07-09T13:33:25Z

vllm/entrypoints/openai/protocol.py

@@ -800,18 +801,18 @@ def check_tool_usage(cls, data):
            # make sure that tool choice is either a named tool
            # OR that it's set to "auto" or "required"
            if data["tool_choice"] not in [
-                    "auto", "required"
+                    "auto",


Can you get rid of the formatting changes that have nothing to do with this PR?

vllm/entrypoints/openai/serving_tokenization.py

vllm/entrypoints/openai/protocol.py

…ibute

DarkLight1337 · 2025-07-10T09:50:37Z

vllm/entrypoints/openai/serving_tokenization.py

+        config = (dict(self.tokenizer.init_kwargs)
+                  if hasattr(self.tokenizer, "init_kwargs")
+                  and self.tokenizer.init_kwargs else {})


Suggested change

config = (dict(self.tokenizer.init_kwargs)

if hasattr(self.tokenizer, "init_kwargs")

and self.tokenizer.init_kwargs else {})

config = dict(getattr(self.tokenizer, "init_kwargs", None) or {})

DarkLight1337

LGTM now! @simon-mo can you double-check?

DarkLight1337 · 2025-07-10T17:44:12Z

I think you need to update the tests to enable the endpoint

…dpoint is optional; also reflected name change of the endpoint in the tests

m-misiura · 2025-07-11T09:07:04Z

I think you need to update the tests to enable the endpoint

many thanks for a yet another great spot -- the tests should now reflect the endpoint renaming and the fact that this endpoint is opt-in

please note also that when running tests locally on a cpu, I had to add "--disable-sliding-window", change precision and max-model-len (this is not reflected in the commit a42e7e9 though)

mergify · 2025-07-15T16:37:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @m-misiura.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

✨ added a new endpoint to extract tokenizer and chat template informa…

e2efa8f

…tion Signed-off-by: m-misiura <[email protected]>

m-misiura requested review from DarkLight1337, robertgshaw2-redhat, simon-mo and aarnphm as code owners July 7, 2025 14:30

gemini-code-assist bot reviewed Jul 7, 2025

View reviewed changes

mergify bot added the frontend label Jul 7, 2025

gemini-code-assist bot reviewed Jul 7, 2025

View reviewed changes

DarkLight1337 reviewed Jul 7, 2025

View reviewed changes

vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved

DarkLight1337 requested a review from mgoin July 7, 2025 14:40

🚧 made the get_tokenizer_info opt-in; disable unless a flag is passed…

ade7167

… to vllm serve

m-misiura force-pushed the feat_get_tokenizer_info_endpoint branch from f45e47f to ade7167 Compare July 8, 2025 08:42

DarkLight1337 reviewed Jul 8, 2025

View reviewed changes

vllm/entrypoints/openai/api_server.py Show resolved Hide resolved

DarkLight1337 reviewed Jul 8, 2025

View reviewed changes

🚧 renamed endpoint from get_tokenizer_info to tokenizer_info and …

855b7bf

…ran pre-commit

m-misiura force-pushed the feat_get_tokenizer_info_endpoint branch from 2adc763 to 855b7bf Compare July 8, 2025 14:09

m-misiura requested a review from DarkLight1337 July 9, 2025 12:54

DarkLight1337 reviewed Jul 9, 2025

View reviewed changes

🎨 formatting changes

8ff1d1b

DarkLight1337 reviewed Jul 9, 2025

View reviewed changes

vllm/entrypoints/openai/serving_tokenization.py Outdated Show resolved Hide resolved

♻️ TokenizerInfo is now a dataclass to reduce boiler plate

167fd62

DarkLight1337 reviewed Jul 9, 2025

View reviewed changes

vllm/entrypoints/openai/serving_tokenization.py Outdated Show resolved Hide resolved

🚧 simplifying tokenizer_class extraction

3b2ea85

DarkLight1337 reviewed Jul 9, 2025

View reviewed changes

vllm/entrypoints/openai/protocol.py Show resolved Hide resolved

🚧 move ConfigDict to top of TokenizerInfoResponse as class-level attr…

0e0c04e

…ibute

DarkLight1337 reviewed Jul 10, 2025

View reviewed changes

♻️ make config more pythonic

8eb5461

DarkLight1337 approved these changes Jul 10, 2025

View reviewed changes

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 10, 2025

🚧 updated test_tokenization tests to reflect that tokenizer_info en…

a42e7e9

…dpoint is optional; also reflected name change of the endpoint in the tests

DarkLight1337 enabled auto-merge (squash) July 11, 2025 09:04

mergify bot added the needs-rebase label Jul 15, 2025

		if info_max_len and tokenize_max_len:
		assert info_max_len >= tokenize_max_len, "Info max length should be >= tokenize max length"

	@router.get("/get_tokenizer_info")
	@router.get("/tokenizer_info")

Uh oh!

feat - add a new endpoint get_tokenizer_info to provide tokenizer/chat-template information #20575

Are you sure you want to change the base?

feat - add a new endpoint get_tokenizer_info to provide tokenizer/chat-template information #20575

Conversation

m-misiura commented Jul 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jul 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

simon-mo commented Jul 7, 2025

Uh oh!

m-misiura commented Jul 8, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Jul 8, 2025

Uh oh!

DarkLight1337 Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

m-misiura Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Jul 8, 2025

Uh oh!

m-misiura commented Jul 8, 2025

Uh oh!

m-misiura commented Jul 8, 2025

Uh oh!

DarkLight1337 Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Jul 10, 2025

Uh oh!

m-misiura commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jul 15, 2025

Uh oh!

Uh oh!

feat - add a new endpoint `get_tokenizer_info` to provide tokenizer/chat-template information #20575

feat - add a new endpoint `get_tokenizer_info` to provide tokenizer/chat-template information #20575

m-misiura commented Jul 7, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Jul 8, 2025 •

edited

Loading

m-misiura commented Jul 11, 2025 •

edited

Loading