-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
feat - add a new endpoint get_tokenizer_info
to provide tokenizer/chat-template information
#20575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat - add a new endpoint get_tokenizer_info
to provide tokenizer/chat-template information
#20575
Conversation
…tion Signed-off-by: m-misiura <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @m-misiura, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request enhances the vLLM server's API by adding a dedicated endpoint to expose detailed tokenizer and chat template information. This capability is crucial for external applications that need to interact with vLLM deployed models, enabling them to correctly handle tokenization and prompt formatting without needing to re-download or infer tokenizer details, thereby improving interoperability and reducing potential mismatches.
Highlights
- New API Endpoint: Introduced a new GET endpoint
/get_tokenizer_info
to the vLLM OpenAI-compatible API server. This endpoint provides comprehensive information about the loaded model's tokenizer and its associated chat template. - Tokenizer Information Exposure: The new endpoint returns a JSON object containing key tokenizer properties such as
tokenizer_class
, special tokens (unk_token
,bos_token
,eos_token
,pad_token
),chat_template
,model_max_length
, andadditional_special_tokens
. It also includesadded_tokens_decoder
if available, and ensures all data is JSON-serializable. - Enhanced Test Coverage: Added a suite of new asynchronous tests to
test_tokenization.py
to validate the functionality, schema, and consistency of the/get_tokenizer_info
endpoint. These tests cover basic response structure, data types, the format of added tokens, and consistency with existing tokenization endpoints.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces a new /get_tokenizer_info
endpoint, which is a valuable feature for tools that interact with vLLM. The implementation is straightforward and the accompanying tests cover various aspects of the new endpoint, including schema validation and consistency checks. My main suggestions focus on ensuring the model_max_length
reported is the effective one used by the engine and improving error logging.
if info_max_len and tokenize_max_len: | ||
assert info_max_len >= tokenize_max_len, "Info max length should be >= tokenize max length" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_tokenizer_info
endpoint should return the effective max_model_len
that the vLLM engine is using. Currently, it returns the model_max_length
from the tokenizer's configuration, which can be different if it was overridden during engine initialization. This can be misleading for clients of this API.
To fix this, please update _get_tokenizer_config
in vllm/entrypoints/openai/serving_tokenization.py
to report the correct value from self.model_config
.
if info_max_len and tokenize_max_len: | |
assert info_max_len >= tokenize_max_len, "Info max length should be >= tokenize max length" | |
if info_max_len and tokenize_max_len: | |
assert info_max_len == tokenize_max_len, "Info max length should be == tokenize max length" |
except Exception as e: | ||
return self.create_error_response( | ||
f"Failed to get tokenizer info: {str(e)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching a broad Exception
can hide bugs and swallow important exceptions like KeyboardInterrupt
. It's better to log the full traceback to aid in debugging when an unexpected error occurs.
except Exception as e: | |
return self.create_error_response( | |
f"Failed to get tokenizer info: {str(e)}") | |
except Exception as e: | |
logger.exception("Failed to get tokenizer info.") | |
return self.create_error_response( | |
f"Failed to get tokenizer info: {e!s}") |
def _make_json_serializable(self, obj): | ||
"""Convert any non-JSON-serializable objects to serializable format.""" | ||
if hasattr(obj, 'content'): | ||
return obj.content | ||
elif isinstance(obj, dict): | ||
return {k: self._make_json_serializable(v) for k, v in obj.items()} | ||
elif isinstance(obj, list): | ||
return [self._make_json_serializable(item) for item in obj] | ||
else: | ||
return obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _make_json_serializable
method is a bit magical, especially the if hasattr(obj, 'content')
check. Adding type hints and a comment explaining this special case (e.g., for handling AddedToken
objects from the tokenizer) would improve readability and maintainability for future contributors.
def _make_json_serializable(self, obj): | |
"""Convert any non-JSON-serializable objects to serializable format.""" | |
if hasattr(obj, 'content'): | |
return obj.content | |
elif isinstance(obj, dict): | |
return {k: self._make_json_serializable(v) for k, v in obj.items()} | |
elif isinstance(obj, list): | |
return [self._make_json_serializable(item) for item in obj] | |
else: | |
return obj | |
def _make_json_serializable(self, obj: Any) -> Any: | |
"""Convert any non-JSON-serializable objects to serializable format.""" | |
# This is a special case to handle AddedToken objects, which are not | |
# directly JSON-serializable but have a 'content' attribute. | |
if hasattr(obj, 'content'): | |
return obj.content | |
elif isinstance(obj, dict): | |
return {k: self._make_json_serializable(v) for k, v in obj.items()} | |
elif isinstance(obj, list): | |
return [self._make_json_serializable(item) for item in obj] | |
else: | |
return obj |
Agree with Cyrus. Another point is that if you try this with Llama3/4, the payload will be large and a potential DDoS vector. |
f45e47f
to
ade7167
Compare
Thanks for raising your concern @simon-mo; with the latest change, the If you think further safeguards are warranted, just let me know. I'm happy to discuss and implement them! |
Can you fix pre-commit? |
def maybe_register_tokenizer_info_endpoint(args): | ||
"""Conditionally register the tokenizer info endpoint if enabled.""" | ||
if getattr(args, 'enable_tokenizer_info_endpoint', False): | ||
@router.get("/get_tokenizer_info") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@router.get("/get_tokenizer_info") | |
@router.get("/tokenizer_info") |
get_
is redundant because this is a GET endpoint already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @DarkLight1337 -- I've renamed this endpoint accordingly
Would you be willing to collaborate on #19604, and persuade others that this PR is valuable and should be merged? |
2adc763
to
855b7bf
Compare
of course, this should be now fixed in commit |
hey @noooop; many thanks for reaching out :) let's see what the reviewers decide is the best way forward and let's take it from there from my perspective (as well as some of my team members), we seem to have a use-case around eval and explainability, but I can also see how your broader approach might be beneficial e.g. for RAG |
vllm/entrypoints/openai/protocol.py
Outdated
@@ -800,18 +801,18 @@ def check_tool_usage(cls, data): | |||
# make sure that tool choice is either a named tool | |||
# OR that it's set to "auto" or "required" | |||
if data["tool_choice"] not in [ | |||
"auto", "required" | |||
"auto", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you get rid of the formatting changes that have nothing to do with this PR?
config = (dict(self.tokenizer.init_kwargs) | ||
if hasattr(self.tokenizer, "init_kwargs") | ||
and self.tokenizer.init_kwargs else {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config = (dict(self.tokenizer.init_kwargs) | |
if hasattr(self.tokenizer, "init_kwargs") | |
and self.tokenizer.init_kwargs else {}) | |
config = dict(getattr(self.tokenizer, "init_kwargs", None) or {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now! @simon-mo can you double-check?
I think you need to update the tests to enable the endpoint |
…dpoint is optional; also reflected name change of the endpoint in the tests
many thanks for a yet another great spot -- the tests should now reflect the endpoint renaming and the fact that this endpoint is opt-in please note also that when running tests locally on a cpu, I had to add |
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Certain upstream packages that can leverage vllm deployed models require access to a model’s chat template or some information pertaining the tokenizer itself. Given that we already require an API connection to the model server, being able to get this info out of the model server would reduce duplication and prevent template-model mismatch. Consequently, having a specific vllm endpoint that would display tokenizer and chat info would be useful.
Concrete examples of where this would be useful:
Quick illustration of what a new endpoint would return
which should return:
Test Plan
I added tests in
tests/entrypoints/openai/test_tokenization.py
Test Result
Tests were passing locally on a CPU backend with a zephyr model that requires sliding window attention, but this is not the case since latest changes on
upstream/main
; I do not think these tests are no longer compatible with my local test setupOn another branch, I have tests using a model that does not require sliding window attention and the tests appear to pass
I am happy to perform further testing / contribute additional tests