Skip to content

[BUG]: openai token count tracking: ValueError when the prompt contains <|endoftext|> #13397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
qwhex opened this issue May 13, 2025 · 0 comments
Labels

Comments

@qwhex
Copy link

qwhex commented May 13, 2025

Tracer Version(s)

2.5.1

Python Version(s)

Python 3.11

Pip Version(s)

pip 24.0

Bug Report

ddtrace has a contrib module to count tokens if we have tiktoken installed.
it's found in ddtrace/contrib/internal/openai/utils.py

openai usually doesn't encode special tokens from the user's input, because it would mess with the generation. these special tokens are used as separators for the user/assistant messages and also marking the end of message. one exception is <|endoftext|>.

we encountered an issue with sending the text <|endoftext|> - the tiktoken call in ddtrace raises an exception, because it's on the list of disallowed_special tokens.

while this might indicate a bug on the caller side (since the caller might not want to send a special token, but rather just plain text), it shouldn't change how we calculate token counts. encoding this text as a token for token count calculation is the right behaviour, since this is what will happen on the openai backend.

I'd say this is the right behaviour for any LLM with its default tokenizer - and tiktoken loads the default tokenizer based on the model name.

I suggest setting disallowed_special=(), so we don't get runtime exceptions when this happens.

openai will interpret the literal string as the EOS token anyways, so this is correct.

another issue with his contrib module is that there's no way of disabling the exact token count calculation OR falling back to rough token count estimation when an exception occurs. ideally, there'd be an env var or some other option to disable tiktoken.

we have tiktoken installed in our project for other reasons - not for ddtrace - so auto-enabling it when tiktoken is available is not desirable in our case.


I'm happy to open a PR after we agreed on the right course of action.

Trace

 File "/usr/lib/python3.11/site-packages/ddtrace/contrib/openai/_endpoint_hooks.py", line 133, in _handle_streamed_response
    estimated, prompt_tokens = _compute_prompt_token_count(m.get("content", ""), kwargs.get("model"))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/ddtrace/contrib/openai/utils.py", line 37, in _compute_prompt_token_count
    num_prompt_tokens += len(enc.encode(prompt))
                             ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/tiktoken/core.py", line 117, in encode
    raise_disallowed_special_token(match.group())
  File "/usr/lib/python3.11/site-packages/tiktoken/core.py", line 400, in raise_disallowed_special_token
    raise ValueError(
ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.

Reproduction Code

No response

Error Logs

No response

Libraries in Use

No response

Operating System

No response

@qwhex qwhex added the bug label May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant