-
Notifications
You must be signed in to change notification settings - Fork 431
proposal: support context cache for Improved conversation efficiency #1300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,252 @@ | ||
# EP: Support Context Cache for Improved Conversation Efficiency | ||
|
||
## Background | ||
|
||
In multi-turn or session-based Large Language Model (LLM) inference scenarios, the current practice involves sending the entire conversation history with each new query. This leads to redundant computation of past prompts' Key-Value (KV) Caches, resulting in significant performance bottlenecks and increased computational costs (especially for longer contexts). To address this, efficiently reusing and managing the KV Cache for conversational history becomes critical. Many leading LLM providers have already adopted context caching functionalities to mitigate these challenges. | ||
|
||
## Goal | ||
|
||
Based on the capabilities of the already implemented KV Cache Sidecar, this proposal aims to build and integrate an **optional** context caching feature for the Aibrix system. This feature will allow users to efficiently reuse the model inference's KV Cache via a session ID, thereby significantly reducing redundant computation overhead and optimizing overall performance and resource consumption in multi-turn or conversational LLM interactions. | ||
|
||
## Implementation | ||
|
||
### Request Flow | ||
|
||
We will introduce a new endpoint: `/v1/context` to manage context caches. The following fields will be used: | ||
|
||
- `session_id`: A unique identifier for each context cache, created upon the first request and used in subsequent requests. | ||
- `ttl`: The time-to-live for the cache, after which it will be automatically cleared. | ||
|
||
#### Creating a Cache for a Session | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant C as Client | ||
participant E as Envoy | ||
participant G as Gateway Plugin | ||
participant R as Context Cache Manager | ||
participant IP as InferencePod | ||
participant V as vLLM Main Container | ||
participant S as KV Cache Sidecar | ||
|
||
C->>+E: POST /v1/context (prompt, model, ttl) | ||
E->>+G: Forward Request | ||
G->>+R: 1. Request Session ID & Metadata Creation | ||
R->>-G: Return Session ID | ||
G->>+V: 2. Submit Prompt for Initial Inference | ||
V->>V: 2.1. Compute KV Cache for Prompt | ||
V-->>V: 2.2. Generate Completion (if needed) | ||
V->>+S: 3. Export & Store KV Cache (via Sidecar API/IPC) | ||
S->>S: 3.1. Persist KV Cache Data | ||
S->>-V: Confirmation of Storage & Sidecar Cache ID | ||
V->>-G: 4. Return Initial Inference Response (incl. Sidecar Cache ID) | ||
G->>+R: 5. Register Session Metadata (session_id, Sidecar Cache ID, TTL) | ||
R->>-G: Confirmation of Registration | ||
G->>-E: Pipe back Response (with session_id, usage) | ||
E->>-C: Complete Response | ||
Note over R,S: Context Cache Manager manages metadata. Sidecar handles actual KV Cache data. | ||
``` | ||
|
||
Before using context caching, users first need to create it. Here, we create a context cache with a `ttl` of one hour. | ||
|
||
```shell | ||
curl -X POST http://localhost:8000/v1/context \ | ||
-H "Content-Type: application/json" \ | ||
-H "Authorization: Bearer test-key-1234567890" \ | ||
-d '{ | ||
"model": "facebook-opt-125m", | ||
"prompt": "Say this is a test", | ||
"ttl": 3600, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in this case, it won't be openai compatible. could we use compatible way now? for example, use customized header etc. btw, how does other solutions support such case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For compatibility, there are two options here:
I believe using the header is the better choice. Even if we decide to adapt to other request standards in the future (Gemini, Anthropic), we won't need to make other adaptations there. |
||
}' | ||
``` | ||
|
||
In the response, we can obtain the unique identifier for the created session, `session_id`. | ||
|
||
```json | ||
{ | ||
"id": "cmpl-de1f99972bd34149968489cb100b2c88", | ||
"object": "text_completion", | ||
"created": 1752594611, | ||
"model": "facebook-opt-125m", | ||
"session_id": "session-01" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same problem here. do you expect the engine to return the session_id, it's not supported yet as I know? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can return the Furthermore, I don't believe we need to modify vLLM. What we need to do is create/delete the corresponding prefix cache for each session by calling existing KV cache APIs. So, we only need to add a new plugin in the data plane, much like current routing algorithm strategies. Therefore, we won't need to focus on how to create context for the KV cache during the prefill and decoding stages after the request is sent to the vLLM container. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After reviewing the relevant content, I've found that the caching interface currently only supports retrieval (read) functionality. Therefore, if feasible, it needs to support write capabilities as well. |
||
... | ||
"usage": { | ||
"prompt_tokens": 6, | ||
"total_tokens": 93, | ||
"completion_tokens": 87, | ||
"prompt_tokens_details": null | ||
} | ||
} | ||
``` | ||
|
||
#### Using Context Cache with `session_id` | ||
|
||
We can use the context cache by populating the obtained `session_id` into the request body. | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant C as Client | ||
participant E as Envoy | ||
participant G as Gateway Plugin | ||
participant R as Context Cache Manager | ||
participant IP as InferencePod | ||
participant V as vLLM Main Container | ||
participant S as KV Cache Sidecar | ||
|
||
C->>+E: POST /v1/completions (session_id="session-01", prompt="Next turn...") | ||
E->>+G: Forward Request | ||
G->>+R: 1. Lookup KV Cache Metadata (session_id="session-01") | ||
R->>R: 1.1. Check TTL & validity | ||
R->>-G: Return KV Cache Reference/ID (from Sidecar) | ||
zhengkezhou1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
G->>+S: 2. Load KV Cache Data (using reference/ID) | ||
S->>S: 2.1. Read KV Cache from persistent storage | ||
S->>-G: Return KV Cache Data | ||
G->>+V: 3. Submit Request with Loaded KV Cache & New Prompt | ||
V-->>V: Generate new completion tokens | ||
V->>-G: 4. Return Response (generated_text, adjusted usage_info) | ||
G->>-E: Pipe back Response | ||
E->>-C: Complete streaming | ||
``` | ||
zhengkezhou1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```shell | ||
curl -X POST http://localhost:8000/v1/completions \ | ||
-H "Content-Type: application/json" \ | ||
-H "Authorization: Bearer test-key-1234567890" \ | ||
-d '{ | ||
"session_id": "session-01" | ||
"model": "facebook-opt-125m", | ||
"prompt": "Say this is a test", | ||
}' | ||
zhengkezhou1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
The expected effect is that when we use context caching, token consumption in multi-turn conversations will be reduced. | ||
|
||
```json | ||
{ | ||
... | ||
"usage": { | ||
"prompt_tokens": 1, | ||
Jeffwan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"total_tokens": 50, | ||
"completion_tokens": 49, | ||
"prompt_tokens_details": null | ||
} | ||
... | ||
} | ||
``` | ||
|
||
#### Clearing Context Cache | ||
|
||
When the TTL expires, the cache will be cleared. Manual early clearing is also provided. | ||
|
||
```shell | ||
curl -X DELETE http://localhost:8000/v1/context/$session_id \ | ||
Jeffwan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
-H "Content-Type: application/json" \ | ||
-H "Authorization: Bearer test-key-1234567890" \ | ||
``` | ||
zhengkezhou1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Runtime Container Changes | ||
Jeffwan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
In our context caching solution, the Runtime Container hosts the core **Context Cache Manager**. It is an independent logical unit whose primary responsibility is to act as the **registration center and lifecycle manager for context cache session metadata** within the entire system. Unlike the `KV Cache Sidecar` which directly handles KV Cache data, the `Context Cache Manager` is not responsible for the physical storage, serialization, or injection of KV Cache, but focuses solely on **logical-level management**. | ||
|
||
#### Core Responsibilities: | ||
|
||
##### Session Metadata Management: | ||
|
||
- **Registration and Mapping:** When a client first creates a context cache, the Context Cache Manager generates a unique `session_id` and associates it with a **unique reference (`kv_cache_sidecar_ref`)** returned by the KV Cache Sidecar, pointing to the actual KV Cache data. This mapping, along with model ID, TTL, and other information, is stored as session metadata. | ||
|
||
- **Query and Validation:** In subsequent requests, the Gateway Plugin queries the Context Cache Manager to obtain the `kv_cache_sidecar_ref` corresponding to a given `session_id`. The Manager will also validate the session's validity, including checking for expiration (TTL). | ||
|
||
- **Deregistration and Deletion:** When a user manually requests cache deletion, or when a cache needs to be cleared due to TTL expiration, the Context Cache Manager is responsible for removing the corresponding session metadata from its storage. | ||
|
||
##### Lifecycle Management (TTL): | ||
|
||
The Context Cache Manager stores the expiration time (`expires_at`) for each session. It will provide mechanisms (e.g., internal background tasks or external calls) to periodically check and clean up expired session metadata, ensuring timely release of cached resources. | ||
|
||
##### System Coordination Layer: | ||
|
||
The Context Cache Manager provides clear API interfaces (e.g., `register_session_cache`, `get_session_metadata`, `unregister_session_cache`) to the Gateway Plugin, enabling it to smoothly complete the creation, usage, and deletion processes for context caches. It **does not directly interact with the `vLLM Main Container` or the `KV Cache Sidecar` for data transfer**, but rather, through the passing of metadata, it guides the Gateway Plugin to coordinate with the KV Cache Sidecar for the loading and storage of KV Cache data. | ||
|
||
```python | ||
from typing import Union, Optional | ||
from pydantic import BaseModel # Assuming pydantic for request/response models | ||
|
||
class CreateContextCacheRequest(BaseModel): | ||
model: str | ||
prompt: str | ||
ttl: int = 3600 # seconds | ||
|
||
class CreateCacheResponse(BaseModel): | ||
id: str # ID of the initial inference | ||
session_id: str | ||
model: str | ||
created: int | ||
usage: dict # Contains prompt_tokens, total_tokens, etc. | ||
|
||
class DeleteCacheRequest(BaseModel): | ||
session_id: str | ||
|
||
class DeleteCacheResponse(BaseModel): | ||
session_id: str | ||
status: str = "success" | ||
|
||
class ErrorResponse(BaseModel): | ||
detail: str | ||
``` | ||
|
||
```python | ||
class CacheSessionMetadata(BaseModel): | ||
"""Session metadata stored in the ContextCacheManager""" | ||
session_id: str | ||
model_id: str # The model this cache is for | ||
kv_cache_sidecar_ref: str # Reference/ID used by the KV Cache Sidecar to identify the actual KV cache data | ||
expires_at: int # Unix timestamp for TTL expiry | ||
|
||
class ContextCacheManager: | ||
""" | ||
Context Cache Manager, running in the Runtime Container. | ||
Main responsibilities: | ||
1. Manage context cache session metadata (session_id, TTL, KV Cache Sidecar reference). | ||
2. Provide API for Gateway Plugin to register, query, and delete session metadata. | ||
3. Handle TTL expiration checks and cleanup of sessions (potentially via background tasks). | ||
""" | ||
|
||
def __init__(self): | ||
# In a production environment, Redis, a distributed key-value store, or a database would typically be used for metadata storage. | ||
self.session_metadata_store: dict[str, CacheSessionMetadata] = {} | ||
# Note: ContextCacheManager does not directly interact with KV Cache Sidecar for data. | ||
# It only stores the reference provided by the Sidecar. Actual data interaction with the Sidecar is coordinated by the Gateway Plugin. | ||
pass | ||
|
||
async def register_session_cache( | ||
self, | ||
request: CreateContextCacheRequest, # Metadata from the original create request | ||
initial_inference_response: CreateCacheResponse, # Response obtained from initial vLLM inference | ||
kv_cache_sidecar_ref: str # Unique reference to the internal KV Cache returned by the KV Cache Sidecar | ||
) -> Union[ErrorResponse, CreateCacheResponse]: | ||
zhengkezhou1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Implementation details will go here. | ||
pass | ||
|
||
async def unregister_session_cache( | ||
self, | ||
session_id: str, | ||
) -> Union[ErrorResponse, DeleteCacheResponse]: | ||
""" | ||
Deletes the metadata for the specified context cache session. | ||
Note: This method only deletes metadata and **does not directly trigger** the KV Cache Sidecar to delete actual data. | ||
**Actual KV Cache data deletion should be coordinated by the Gateway Plugin, or handled by the Sidecar itself based on TTL periodic cleanup.** | ||
""" | ||
# Implementation details will go here. | ||
pass | ||
|
||
async def get_session_metadata( | ||
self, | ||
session_id: str, | ||
) -> Optional[CacheSessionMetadata]: | ||
""" | ||
Retrieves cache metadata for the specified session. | ||
To be used by the Gateway Plugin in subsequent requests. | ||
Also performs TTL check. | ||
""" | ||
# Implementation details will go here. | ||
pass | ||
``` |
Uh oh!
There was an error while loading. Please reload this page.