Skip to content

proposal: support context cache for Improved conversation efficiency #1300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zhengkezhou1
Copy link

Pull Request Description

This PR aims to propose and discuss a solution for introducing an optional context caching feature into the Aibrix system. Its core goal is to address the performance bottlenecks and resource consumption issues caused by redundant KV Cache computation during LLM inference in multi-turn conversation scenarios, thereby significantly enhancing the efficiency and user experience of conversational AI.

Related Issues

Resolves: #1248

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @zhengkezhou1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new context caching feature to the Aibrix system, designed to significantly improve the efficiency and reduce computational costs of multi-turn LLM conversations by reusing Key-Value (KV) Caches. It outlines a new API for managing these caches and details the architecture of a new Context Cache Manager component responsible for metadata and lifecycle management.

Highlights

  • Context Caching Feature: Introduces an optional context caching mechanism for LLM inference in multi-turn conversations to reuse Key-Value (KV) Caches and reduce redundant computation, aiming to improve efficiency and reduce costs.
  • New API Endpoint for Cache Management: Proposes a new /v1/context API endpoint for creating and deleting context caches. Each cache is identified by a unique session_id and managed with a ttl (time-to-live).
  • KV Cache Reuse in Completions: Details how existing /v1/completions requests can leverage a session_id to load and utilize previously computed KV Cache data from the KV Cache Sidecar, significantly reducing prompt_tokens in subsequent turns.
  • New Context Cache Manager Component: Defines a new Context Cache Manager component, intended to run within the Runtime Container. This manager is responsible for handling session metadata (session_id, TTL, KV Cache Sidecar references) and their lifecycle, but explicitly not the physical KV Cache data itself.
  • Architectural Request Flows: Provides detailed sequence diagrams illustrating the end-to-end request flows for both creating a new context cache and subsequently using an existing one, outlining interactions between the Client, Envoy, Gateway Plugin, Context Cache Manager, InferencePod, vLLM Main Container, and KV Cache Sidecar.
  • Proposed Data Models and Interfaces: Includes Python pydantic models and a ContextCacheManager class structure, outlining the proposed API request/response formats and the internal interfaces for metadata management within the Context Cache Manager.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The design document proposes a context caching feature to improve conversation efficiency by reusing KV caches. The separation of concerns between the Context Cache Manager and the KV Cache Sidecar is a good design choice. The review focuses on ensuring the KV cache is updated after each turn, clarifying request flows, and refining lifecycle management for robustness and predictability.

@zhengkezhou1
Copy link
Author

Hi @Jeffwan

I've submitted this draft PR for the context caching feature proposal.

As this is a preliminary design for a new feature, I'm eagerly looking forward to receiving valuable feedback on aspects such as the overall architectural design, API definitions, and integration approach with the existing KV Cache Sidecar.

If you have a moment, please take a look. Thank you!

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Zhengke Zhou <[email protected]>
@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 24, 2025

@zhengkezhou1 thanks for driving the efforts. @happyandslow did you get a chance to review the proposal ? not sure if #633 cover similar features?

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 24, 2025

@zhengkezhou1 BTW, I see this PR is still in progress. is it ready for review? If the PR scope is more on the proposal, please change the PR title to avoid confusion

@zhengkezhou1 zhengkezhou1 changed the title Support Context Cache for Improved Conversation Efficiency proposal: support context cache for Improved conversation efficiency Jul 24, 2025
@zhengkezhou1
Copy link
Author

Yes, this is a proposal. I've already changed the title, and I'd appreciate any feedback to ensure I'm on the right track.

-d '{
"model": "facebook-opt-125m",
"prompt": "Say this is a test",
"ttl": 3600,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, it won't be openai compatible. could we use compatible way now? for example, use customized header etc.

btw, how does other solutions support such case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compatibility, there are two options here:

  1. Put ttl in the metadata field of the request body, which is unique to OpenAI.
  2. Add ttl to the request header: "x-session-ttl": "3600".

I believe using the header is the better choice. Even if we decide to adapt to other request standards in the future (Gemini, Anthropic), we won't need to make other adaptations there.

"object": "text_completion",
"created": 1752594611,
"model": "facebook-opt-125m",
"session_id": "session-01"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same problem here. do you expect the engine to return the session_id, it's not supported yet as I know?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can return the session_id in the response header, similar to what we discussed earlier.

Furthermore, I don't believe we need to modify vLLM. What we need to do is create/delete the corresponding prefix cache for each session by calling existing KV cache APIs. So, we only need to add a new plugin in the data plane, much like current routing algorithm strategies.

Therefore, we won't need to focus on how to create context for the KV cache during the prefill and decoding stages after the request is sent to the vLLM container.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing the relevant content, I've found that the caching interface currently only supports retrieval (read) functionality. Therefore, if feasible, it needs to support write capabilities as well.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 24, 2025

@zhengkezhou1 please take a look my initial comments

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 28, 2025

@zhengkezhou1 I am busy with v0.4.0 release, I will check your reply later today

@Jeffwan
Copy link
Collaborator

Jeffwan commented Aug 21, 2025

@zhengkezhou1

Sorry for late response. could you check such usage, If we need to do some necessary changes on the engine side, that's ok.

response = client.responses.create(
    model="model_name",
    input=[
            {
             "role": "system", 
             "content": "you are a helpful assistant, please help us generate a python program to generate a random number"
            }
          ],
    extra_body={
        "caching": {"type": "enabled"}
    }
)

@zhengkezhou1
Copy link
Author

follow https://aibrix.readthedocs.io/latest/development/development.html#development testing on macOS

first time inference takes 03:25

ccurl -v http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer test-key-1234567890" \
-d '{
    "model": "facebook-opt-125m",
    "messages": [
        {
            "role": "system",
            "content": "Say this is a test"
        }
    ],
    "caching": {
        "type": "enabled"
    }
}' | jq 
* Host localhost:8000 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying [::1]:8000...
* Connected to localhost (::1) port 8000
> POST /v1/chat/completions HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Authorization: Bearer test-key-1234567890
> Content-Length: 205
> 
} [205 bytes data]
* upload completely sent off: 205 bytes
100   205    0     0  100   205      0      1  0:03:25  0:01:54  0:01:31     0< HTTP/1.1 200 OK
< date: Fri, 22 Aug 2025 11:15:50 GMT
< server: uvicorn
< content-length: 6777
< content-type: application/json
< 
{ [6777 bytes data]
100  6982  100  6777  100   205     58      1  0:03:25  0:01:54  0:01:31  1531
* Connection #0 to host localhost left intact
{
  "id": "chatcmpl-9272ff327d73424ea9af101c95e45885",
  "object": "chat.completion",
  "created": 1755861350,
  "model": "facebook-opt-125m",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "\nLINK THIS RESULTS IN COMMON ENLARGE\n\nYou can click here to Firefox Extension 20 or newer, or if you haven't tried it, they're free on the first step in accessing the web on OS X - like in the content catalog rather than elements of the web you click on. It also assumes that you have proper internet access (which I have on PC, if you wish to) when you click RIncubator.\n\nFor minor websites, I'll let you guys use the FSSG.net.net if you're early-generation linux ir, but if you have or want to be early-generation, you have to fully install the Muse version on macOS (don't expect a Montessori High School-period path, but hey, when the wheel runs back in time to girls, do it\") and in other software, you need to spin hundreds of unknown pieces of floppy disks that wrinkle randomly all through the day - but moving your numbers includes unlocking safe open windows and allowing your game programming apps to run them, to install them, and enable them. You could also run Bart: Arch.\n\nREPLACE\n\nAt the end of each window, a list of 64 characters to write to. Then click Apply. The window closes, you then shift and close them again by clicking OK. This is for iOS apps, and any media you've created to use the notification and navigation store.\n\nThen, if you want to repurchase some of the disk, your arbitrary values tap the bubble icon when you press enter and go to the item /home/science/sat430/bufStationers.\n\nThe BufStationers.net.net stuff is quite familiar. Committed by Nasir Hussain (lighthum.com) - http://nafsromanghazi.org- the official site of the normies.com network - and many others. There's also an in-depth profile on the BufStationers.net site. They have lots of vaulted pages, including a true me page on a Warped Tour in Cambodia click-through at https://archive.org/mylegalacidband-beat on WTF. I mean, it's conceivable that some people actually started \"taking that shell\" on TF2 before the launch, but it may be socrimorically plausible. Nasir and others commented on various aspects of \"running a virtual game archive\" the site provides including \"Having access to a floppy disk via USB as well as system resources on your system.\" Could anyone link that a full-observation - aology - documenting the archive (in details at reddit.com/r/circlejerk) with any extra notes?\n\nAnd, of course, make sure you go to the very first \"Support nafsromanghazi\" thread to ask about tracking down the couple researching \"token trade\" topics, which we'll draw two, and we'll see you at /home/the_glass_duck-drinking-smee-line-doodlywaffen_ussia.\n\nEven a few minutes\n\nUntil a quick time in the end when the artist turned his icons into websites for a Flickr Thread which they see in the sidebar:\n\nProbably education and entertainment stunts i.e,cancer research\n\nAnd, then there's this collection of probably storied websites, blogs, blogs and media sites. Remember, the goal is to provide something of value to and associated social engagements, as well as to provide avenues both personally associative media (like a photo of yourself) and in order to educate oneself and others, and accomplish something beyond just passive educational rituals (like creating a blog).\n\nUnfortunately, the preceding comments and submissions give me nothing I have to gain at all. I want music interviews, requests for creativity, freebies (card storage) and I am (finally) writing Fiction, even though it would mean going back and learning what I wrote offline in the past, and mostly only in the hopes of creating something that was user-friendly\n\nThis \\~blog \\~_for\\~ and also \\~_NET_CALL made me feel optimistic about this subject -- as mainstream music and journalists rapidly ramp up their activism, so is the Seymour Observer ? (unfortunately I don't speak Norwegian, but you get the picture, theIOR¹ translated as something kind ...)\n\nI also really liked : \"Return of Substance\" \\~ \\~ _with_ IHTS :\\)\n\nThank you. Siw The Lost...\n\nAh, that was quite happy to see you guys at /home/the_glass_duck-drinking-smee-line-doodlywaffen_ussia :\\)\n\nStill less hopeful about \\~ this creepy blob that is the legion opinion being peddled -- by the McGee staff:\n\nHis Last Words seem very difficult to run, composing a message in a really slow .bashrc canvas (rather unlike the \\~_blog \\~_coop project), with the first three most expected values 0-88. If truth be told, I believe that was what \\~_class\\~ planned was going to be, not \\~Siw@yornce4&#39;s original idea.\n\nThe problem is, I have no idea what \\~_class\\~ intended \\~_and~~ _-^* or \\~_class\\~_not ----. What \\~_class\\~ intended ...? You're right, the artifact aesthetics range from the less long pieces, with accepted values, to the more cryptic ones that give little/usually more complex results. The issue seems that \\~_store\\~\\~_feet\\~_stevedoresin à la globale est \\~_shited+body \\~_de_soire\\~_nicken. I'd conclude that \\~_class\\~ just intended \\~_synth\\~_lamp\\~_a~~birdkiller \\~_bittick. It's been a while since I've read it, and okay, you know. My Aulas sense of the context, to be expected, rather defines in <6 w/b&gt; is long time, though. Despite all of this, even is hidden by the journal Pride/Success, and I'll just have to look for the particular repair to an object trying to get it to fix small problems before I can try another read -- although if i don't complete it right into the original old title rather than \"reeeee-n.*\" ^^^ at once, I don't know if it'll cut any further.\n\nEdit: Got a little stronger, and it is still a precaution : leans out, possibly in the pipe route to angled back into it's eardrisive gay nodular shape.\n\nThat Let Us Run - edit:thanks, I saw the first issue, with quotes from workmate. Let Me See CoP shut off the flavor analysis software, and the first 3 lines it could have possibly installed because it could not run , the second one was even more clunky to use because of both .bashrc supported versions, & the third one was incompatible with the guidance attributes of first author, in otherwise upbeat651, main text control ....+/infocommune#vim\n\nThat basic graphics editor will never finish somestrike in my opinion. He doesn't have skillbuild , I don't know how he got there .\n\nNice Build\n\nAvailability seems to be the issue.\n\nWhat kind of design is this?",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 11,
    "total_tokens": 1593,
    "completion_tokens": 1582,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

again same request, takes 0:00:34

curl -v http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer test-key-1234567890" \
-d '{
    "model": "facebook-opt-125m",
    "messages": [
        {
            "role": "system",
            "content": "Say this is a test"
        }
    ],
    "caching": {
        "type": "enabled"
    }
}' | jq
* Host localhost:8000 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying [::1]:8000...
* Connected to localhost (::1) port 8000
> POST /v1/chat/completions HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Authorization: Bearer test-key-1234567890
> Content-Length: 205
> 
} [205 bytes data]
* upload completely sent off: 205 bytes
100   205    0     0  100   205      0      6  0:00:34  0:00:29  0:00:05     0< HTTP/1.1 200 OK
< date: Fri, 22 Aug 2025 11:18:47 GMT
< server: uvicorn
< content-length: 2032
< content-type: application/json
< 
{ [2032 bytes data]
100  2237  100  2032  100   205     68      6  0:00:34  0:00:29  0:00:05   448
* Connection #0 to host localhost left intact
{
  "id": "chatcmpl-bda2677dbcad4a96b18a2c7c33ef7f82",
  "object": "chat.completion",
  "created": 1755861528,
  "model": "facebook-opt-125m",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "Now, NASENDA has gone to Bet-a-Am! With its swirling power and fast speed impressed attention and looks like it is cinder-bulb-powered for Mega Juggled Jugglers, it may well be the field’s most exciting new energy destination. That’s because, when served properly, Mega Juggled Fury is full of creative feminine battle-ish flavors. It can be made to dance at night, block out the scary noise, and be funky for your DJing effort, but it can also give its handlers an assaultive tosser that turns everything the helled out in the open.\nWhen Mega Juggled Fury is setup, we’ve got the pan-Canadian porridge police team waiting in the parking lot under tanks. That solution: when Mega Juggled Fury is mixed into Formula One’s two buggy cans, it can produce 3,500 pure-fuel IV pellets. Our anaesthetic enterriser will help to arrest the bodying squirts and uproot them to their beerier glory. Certainly far superior than this run-of-the-mill Legwar catalyst/proper-house-cannon combo, where the energy-driving jerkbutt overwhelms the fighter.\nThis column originally appeared on The Canadian Press, and was edited by Penny Murray. Send questions to Meteza Steele at [email protected]. Follow her on Twitter @meteza_Jray.\nV 1080p - Crying Blood over the Spare Lives of Silver Knights\n(Video): The images of tiny silver knights are soft, sporty victory images for Kate Fleming. Want to dive in? Be sure to follow Cortana LaEngle (@Cortana) on Instagram and Facebook. Find Grave Watch highlights on Cranesharkan.\nLoading... Loading... Loading... Loading... Loading... Loading...\nMORE ON SPORTS:",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 11,
    "total_tokens": 399,
    "completion_tokens": 388,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

caching field look dont works:

llm-engine      | WARNING 08-22 11:18:48 protocol.py:69] The following fields were present in the request but ignored: {'caching'}
llm-engine      | INFO 08-22 11:18:48 logger.py:39] Received request chatcmpl-bda2677dbcad4a96b18a2c7c33ef7f82: prompt: '</s>Say this is a testASSISTANT:\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=2037, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
llm-engine      | INFO 08-22 11:18:48 engine.py:275] Added request chatcmpl-bda2677dbcad4a96b18a2c7c33ef7f82.
aibrix-runtime  | INFO:     10.244.1.1:42420 - "GET /healthz HTTP/1.1" 200 OK
llm-engine      | INFO 08-22 11:18:50 metrics.py:455] Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 6.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
aibrix-runtime  | INFO:     10.244.1.1:42424 - "GET /healthz HTTP/1.1" 200 OK

@zhengkezhou1
Copy link
Author

I'm looking for more prompt caching info, so I'll update this proposal soon, maybe later this week?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Context Cache for Improved Conversation Efficiency
2 participants