Skip to content

feat: add support for lora affinity routing strategy #1181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Xunzhuo
Copy link
Member

@Xunzhuo Xunzhuo commented Jun 5, 2025

feat: add support for lora affinity routing strategy

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Xunzhuo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR introduces a new routing algorithm specifically designed for LoRA (Low-Rank Adaptation) workloads. The primary goal is to leverage pod affinity for specific LoRA models to improve performance by routing requests for a model to a pod that already has that model loaded or waiting to be loaded. It also incorporates a probabilistic approach to allow some requests to be routed to other available pods, enabling load balancing and preventing hot-spotting, even when affinity exists. The implementation involves adding a new routing algorithm file and making minor adjustments to metric definitions.

Highlights

  • New Routing Algorithm: Adds a new routing algorithm named lora-affinity to the gateway's routing plugins.
  • LoRA Affinity Logic: Implements logic to evaluate pods based on whether they are currently running or waiting for the requested LoRA model, using metrics like RunningLoraAdapters and WaitingLoraAdapters.
  • Probabilistic Routing: Introduces a configurable probability threshold (AIBRIX_LORA_AFFINITY_THRESHOLD) to decide whether to route a request to a pod with affinity or to a pod with available capacity (for load balancing), when both options exist.
  • Metric Integration: Utilizes existing LoRA-related metrics (RunningLoraAdapters, WaitingLoraAdapters, MaxLora) to determine pod affinity and capacity.
  • Metric Name Refactor: Refactors the metric names in pkg/metrics/metrics.go by introducing a constant LoraRequestsInfo for consistency.

Changelog

Click here to see the changelog
  • pkg/metrics/metrics.go
    • Added a new constant LoraRequestsInfo on line 49.
    • Updated the RawMetricName field for MaxLora, RunningLoraAdapters, and WaitingLoraAdapters to use the new LoraRequestsInfo constant instead of the hardcoded string "lora_requests_info" on lines 286, 295, and 304 respectively.
  • pkg/plugins/gateway/algorithms/lora_affinity.go
    • Added a new file implementing the lora-affinity routing algorithm.
    • Defined the algorithm name LoraAffinity on line 43.
    • Defined a defaultLoraAffinityThreshold (0.999) and loaded the configurable loraAffinityThreshold from the environment variable AIBRIX_LORA_AFFINITY_THRESHOLD on lines 39 and 45.
    • Registered the new router NewLoraAffinityRouter on line 49.
    • Implemented the Route method (lines 74-138) which filters pods based on LoRA affinity and capacity, uses the probability threshold for selection, and falls back to a LeastRequest selection within the chosen group or a random selection if no suitable pods are found.
    • Implemented the evaluatePodForLoraAffinity helper method (lines 143-171) to check a pod's affinity and capacity using LoRA metrics.
    • Implemented the checkModelAffinity helper method (lines 174-176) to check if a target model is in the running or waiting sets.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


A router's choice, a path to find,
For LoRA models, one of a kind.
Affinity calls, a favored place,
Or load balance, just in case.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new LoRA affinity routing strategy, which aims to prioritize pods with existing LoRA model affinity while allowing for load balancing. The implementation appears well-structured and includes relevant logging for debugging. However, there are a few areas that could be improved for clarity and efficiency.

Summary of Findings

  • Incorrect Metric Usage: The evaluatePodForLoraAffinity function retrieves maxAdapters using metrics.WaitingLoraAdapters instead of metrics.MaxLora, potentially leading to incorrect capacity evaluations.
  • Missing Error Context in Logging: The log message for failed LoRA metric retrieval lacks specific error details, hindering debugging efforts.
  • Potential Division by Zero: The code does not explicitly check if maxCount is zero before performing capacity checks, which could lead to unexpected behavior.

Merge Readiness

The pull request introduces a valuable new routing strategy. However, the high severity issue regarding the incorrect metric usage in evaluatePodForLoraAffinity must be addressed before merging. Additionally, the medium severity issues should be considered for improvement. I am unable to approve this pull request, and recommend that others review and approve this code before merging.

// Get LoRA metrics for the pod
runningAdapters, runningErr := r.cache.GetMetricValueByPod(pod.Name, pod.Namespace, metrics.RunningLoraAdapters)
waitingAdapters, waitingErr := r.cache.GetMetricValueByPod(pod.Name, pod.Namespace, metrics.WaitingLoraAdapters)
maxAdapters, maxErr := r.cache.GetMetricValueByPod(pod.Name, pod.Namespace, metrics.WaitingLoraAdapters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The maxAdapters value is retrieved using metrics.WaitingLoraAdapters, but the description suggests it should be metrics.MaxLora. This discrepancy could lead to incorrect capacity evaluations. It should be metrics.MaxLora to align with the intended logic.

Suggested change
maxAdapters, maxErr := r.cache.GetMetricValueByPod(pod.Name, pod.Namespace, metrics.WaitingLoraAdapters)
maxAdapters, maxErr := r.cache.GetMetricValueByPod(pod.Name, pod.Namespace, metrics.MaxLora)

Comment on lines +36 to +39
// Default threshold for LoRA affinity selection probability
// When both affinity and available pods exist, this probability determines
// how often we select from affinity pods vs available pods for load balancing
defaultLoraAffinityThreshold = 0.999
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a comment explaining why defaultLoraAffinityThreshold is set to 0.999. What is the reasoning behind this specific value? Is it empirically derived, or based on some theoretical consideration?

Comment on lines +125 to +126
} else {
targetPod, err = utils.SelectRandomPod(readyPods, rand.Intn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the case where len(targetPods) == 0, the code falls back to utils.SelectRandomPod(readyPods, rand.Intn). However, readyPods has already been filtered for routable pods. It might be more appropriate to use the original readyPods list to ensure a pod is always selected if available, or return an error indicating no routable pods are available.

Comment on lines +163 to +165
// Check if the pod has available capacity
totalActiveAdapters := runningModels.Len() + waitingModels.Len()
hasCapacity := totalActiveAdapters < maxCount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a check to ensure that maxCount is not zero before proceeding with the capacity check. A zero value could lead to unexpected behavior.

}

// checkModelAffinity checks if the pod has the target model active or waiting
func (r loraAffinityRouter) checkModelAffinity(targetModel string, runningModels, waitingModels sets.String) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tradeoff here is runningModels are already loaded but the waitingModels may not, based on this assumption, waitingModels are suboptimal.

@Jeffwan Jeffwan self-requested a review June 19, 2025 00:05
@Jeffwan
Copy link
Collaborator

Jeffwan commented Aug 21, 2025

@Xunzhuo could you help update this PR? lora improvement is planned in v0.5.0 and let's keep moving and improve the quality

@Xunzhuo
Copy link
Member Author

Xunzhuo commented Aug 21, 2025

@Jeffwan sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants