[BUG]: openai token count tracking: ValueError when the prompt contains <|endoftext|>

### Tracer Version(s)

2.5.1

### Python Version(s)

Python 3.11

### Pip Version(s)

pip 24.0

### Bug Report

`ddtrace` has a contrib module to count tokens if we have `tiktoken` installed.
it's found in `ddtrace/contrib/internal/openai/utils.py`

openai usually doesn't encode special tokens from the user's input, because it would mess with the generation. these special tokens are used as separators for the user/assistant messages and also marking the end of message. one exception is `<|endoftext|>`.

we encountered an issue with sending the text `<|endoftext|>` - the tiktoken call in ddtrace raises an exception, because it's on the list of `disallowed_special` tokens.

while this might indicate a bug on the caller side (since the caller might not want to send a special token, but rather just plain text), it shouldn't change how we calculate *token counts*. encoding this text as a token for token  count calculation is the **right behaviour**, since this is what will happen on the openai backend.

I'd say this is the right behaviour for any LLM with its default tokenizer - and tiktoken loads the default tokenizer based on the model name.

I suggest setting `disallowed_special=()`, so we don't get runtime exceptions when this happens.

openai will interpret the literal string as the EOS token anyways, so this is correct.

another issue with his contrib module is that there's no way of disabling the exact token count calculation OR falling back to rough token count estimation when an exception occurs. ideally, there'd be an env var or some other option to disable tiktoken.

we have tiktoken installed in our project for other reasons - not for ddtrace - so auto-enabling it when tiktoken is available is not desirable in our case.

---

I'm happy to open a PR after we agreed on the right course of action. 

### Trace

```
 File "/usr/lib/python3.11/site-packages/ddtrace/contrib/openai/_endpoint_hooks.py", line 133, in _handle_streamed_response
    estimated, prompt_tokens = _compute_prompt_token_count(m.get("content", ""), kwargs.get("model"))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/ddtrace/contrib/openai/utils.py", line 37, in _compute_prompt_token_count
    num_prompt_tokens += len(enc.encode(prompt))
                             ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/tiktoken/core.py", line 117, in encode
    raise_disallowed_special_token(match.group())
  File "/usr/lib/python3.11/site-packages/tiktoken/core.py", line 400, in raise_disallowed_special_token
    raise ValueError(
ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.
```



### Reproduction Code

_No response_

### Error Logs

_No response_

### Libraries in Use

_No response_

### Operating System

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: openai token count tracking: ValueError when the prompt contains <|endoftext|> #13397

Tracer Version(s)

Python Version(s)

Pip Version(s)

Bug Report

Trace

Reproduction Code

Error Logs

Libraries in Use

Operating System

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: openai token count tracking: ValueError when the prompt contains <|endoftext|> #13397

Description

Tracer Version(s)

Python Version(s)

Pip Version(s)

Bug Report

Trace

Reproduction Code

Error Logs

Libraries in Use

Operating System

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions