-
Notifications
You must be signed in to change notification settings - Fork 431
proposal: support context cache for Improved conversation efficiency #1300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: zhengkezhou1 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @zhengkezhou1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new context caching feature to the Aibrix system, designed to significantly improve the efficiency and reduce computational costs of multi-turn LLM conversations by reusing Key-Value (KV) Caches. It outlines a new API for managing these caches and details the architecture of a new Context Cache Manager component responsible for metadata and lifecycle management.
Highlights
- Context Caching Feature: Introduces an optional context caching mechanism for LLM inference in multi-turn conversations to reuse Key-Value (KV) Caches and reduce redundant computation, aiming to improve efficiency and reduce costs.
- New API Endpoint for Cache Management: Proposes a new
/v1/context
API endpoint for creating and deleting context caches. Each cache is identified by a uniquesession_id
and managed with attl
(time-to-live). - KV Cache Reuse in Completions: Details how existing
/v1/completions
requests can leverage asession_id
to load and utilize previously computed KV Cache data from the KV Cache Sidecar, significantly reducingprompt_tokens
in subsequent turns. - New Context Cache Manager Component: Defines a new
Context Cache Manager
component, intended to run within the Runtime Container. This manager is responsible for handling session metadata (session_id, TTL, KV Cache Sidecar references) and their lifecycle, but explicitly not the physical KV Cache data itself. - Architectural Request Flows: Provides detailed sequence diagrams illustrating the end-to-end request flows for both creating a new context cache and subsequently using an existing one, outlining interactions between the Client, Envoy, Gateway Plugin, Context Cache Manager, InferencePod, vLLM Main Container, and KV Cache Sidecar.
- Proposed Data Models and Interfaces: Includes Python
pydantic
models and aContextCacheManager
class structure, outlining the proposed API request/response formats and the internal interfaces for metadata management within the Context Cache Manager.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The design document proposes a context caching feature to improve conversation efficiency by reusing KV caches. The separation of concerns between the Context Cache Manager and the KV Cache Sidecar is a good design choice. The review focuses on ensuring the KV cache is updated after each turn, clarifying request flows, and refining lifecycle management for robustness and predictability.
Hi @Jeffwan I've submitted this draft PR for the context caching feature proposal. As this is a preliminary design for a new feature, I'm eagerly looking forward to receiving valuable feedback on aspects such as the overall architectural design, API definitions, and integration approach with the existing KV Cache Sidecar. If you have a moment, please take a look. Thank you! |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Zhengke Zhou <[email protected]>
@zhengkezhou1 thanks for driving the efforts. @happyandslow did you get a chance to review the proposal ? not sure if #633 cover similar features? |
@zhengkezhou1 BTW, I see this PR is still in progress. is it ready for review? If the PR scope is more on the proposal, please change the PR title to avoid confusion |
Yes, this is a proposal. I've already changed the title, and I'd appreciate any feedback to ensure I'm on the right track. |
-d '{ | ||
"model": "facebook-opt-125m", | ||
"prompt": "Say this is a test", | ||
"ttl": 3600, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, it won't be openai compatible. could we use compatible way now? for example, use customized header etc.
btw, how does other solutions support such case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For compatibility, there are two options here:
- Put
ttl
in themetadata
field of the request body, which is unique to OpenAI. - Add
ttl
to the request header:"x-session-ttl": "3600"
.
I believe using the header is the better choice. Even if we decide to adapt to other request standards in the future (Gemini, Anthropic), we won't need to make other adaptations there.
"object": "text_completion", | ||
"created": 1752594611, | ||
"model": "facebook-opt-125m", | ||
"session_id": "session-01" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same problem here. do you expect the engine to return the session_id, it's not supported yet as I know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can return the session_id
in the response header, similar to what we discussed earlier.
Furthermore, I don't believe we need to modify vLLM. What we need to do is create/delete the corresponding prefix cache for each session by calling existing KV cache APIs. So, we only need to add a new plugin in the data plane, much like current routing algorithm strategies.
Therefore, we won't need to focus on how to create context for the KV cache during the prefill and decoding stages after the request is sent to the vLLM container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing the relevant content, I've found that the caching interface currently only supports retrieval (read) functionality. Therefore, if feasible, it needs to support write capabilities as well.
@zhengkezhou1 please take a look my initial comments |
@zhengkezhou1 I am busy with v0.4.0 release, I will check your reply later today |
Sorry for late response. could you check such usage, If we need to do some necessary changes on the engine side, that's ok.
|
follow https://aibrix.readthedocs.io/latest/development/development.html#development testing on macOS first time inference takes 03:25
again same request, takes 0:00:34
|
I'm looking for more prompt caching info, so I'll update this proposal soon, maybe later this week? |
Pull Request Description
This PR aims to propose and discuss a solution for introducing an optional context caching feature into the Aibrix system. Its core goal is to address the performance bottlenecks and resource consumption issues caused by redundant KV Cache computation during LLM inference in multi-turn conversation scenarios, thereby significantly enhancing the efficiency and user experience of conversational AI.
Related Issues
Resolves: #1248