-
Notifications
You must be signed in to change notification settings - Fork 432
feat: add support for lora affinity routing strategy #1181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: bitliu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Xunzhuo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here to provide a summary of this pull request. This PR introduces a new routing algorithm specifically designed for LoRA (Low-Rank Adaptation) workloads. The primary goal is to leverage pod affinity for specific LoRA models to improve performance by routing requests for a model to a pod that already has that model loaded or waiting to be loaded. It also incorporates a probabilistic approach to allow some requests to be routed to other available pods, enabling load balancing and preventing hot-spotting, even when affinity exists. The implementation involves adding a new routing algorithm file and making minor adjustments to metric definitions.
Highlights
- New Routing Algorithm: Adds a new routing algorithm named
lora-affinity
to the gateway's routing plugins. - LoRA Affinity Logic: Implements logic to evaluate pods based on whether they are currently running or waiting for the requested LoRA model, using metrics like
RunningLoraAdapters
andWaitingLoraAdapters
. - Probabilistic Routing: Introduces a configurable probability threshold (
AIBRIX_LORA_AFFINITY_THRESHOLD
) to decide whether to route a request to a pod with affinity or to a pod with available capacity (for load balancing), when both options exist. - Metric Integration: Utilizes existing LoRA-related metrics (
RunningLoraAdapters
,WaitingLoraAdapters
,MaxLora
) to determine pod affinity and capacity. - Metric Name Refactor: Refactors the metric names in
pkg/metrics/metrics.go
by introducing a constantLoraRequestsInfo
for consistency.
Changelog
Click here to see the changelog
- pkg/metrics/metrics.go
- Added a new constant
LoraRequestsInfo
on line 49. - Updated the
RawMetricName
field forMaxLora
,RunningLoraAdapters
, andWaitingLoraAdapters
to use the newLoraRequestsInfo
constant instead of the hardcoded string "lora_requests_info" on lines 286, 295, and 304 respectively.
- Added a new constant
- pkg/plugins/gateway/algorithms/lora_affinity.go
- Added a new file implementing the
lora-affinity
routing algorithm. - Defined the algorithm name
LoraAffinity
on line 43. - Defined a
defaultLoraAffinityThreshold
(0.999) and loaded the configurableloraAffinityThreshold
from the environment variableAIBRIX_LORA_AFFINITY_THRESHOLD
on lines 39 and 45. - Registered the new router
NewLoraAffinityRouter
on line 49. - Implemented the
Route
method (lines 74-138) which filters pods based on LoRA affinity and capacity, uses the probability threshold for selection, and falls back to a LeastRequest selection within the chosen group or a random selection if no suitable pods are found. - Implemented the
evaluatePodForLoraAffinity
helper method (lines 143-171) to check a pod's affinity and capacity using LoRA metrics. - Implemented the
checkModelAffinity
helper method (lines 174-176) to check if a target model is in the running or waiting sets.
- Added a new file implementing the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
A router's choice, a path to find,
For LoRA models, one of a kind.
Affinity calls, a favored place,
Or load balance, just in case.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces a new LoRA affinity routing strategy, which aims to prioritize pods with existing LoRA model affinity while allowing for load balancing. The implementation appears well-structured and includes relevant logging for debugging. However, there are a few areas that could be improved for clarity and efficiency.
Summary of Findings
- Incorrect Metric Usage: The
evaluatePodForLoraAffinity
function retrievesmaxAdapters
usingmetrics.WaitingLoraAdapters
instead ofmetrics.MaxLora
, potentially leading to incorrect capacity evaluations. - Missing Error Context in Logging: The log message for failed LoRA metric retrieval lacks specific error details, hindering debugging efforts.
- Potential Division by Zero: The code does not explicitly check if
maxCount
is zero before performing capacity checks, which could lead to unexpected behavior.
Merge Readiness
The pull request introduces a valuable new routing strategy. However, the high severity issue regarding the incorrect metric usage in evaluatePodForLoraAffinity
must be addressed before merging. Additionally, the medium severity issues should be considered for improvement. I am unable to approve this pull request, and recommend that others review and approve this code before merging.
// Get LoRA metrics for the pod | ||
runningAdapters, runningErr := r.cache.GetMetricValueByPod(pod.Name, pod.Namespace, metrics.RunningLoraAdapters) | ||
waitingAdapters, waitingErr := r.cache.GetMetricValueByPod(pod.Name, pod.Namespace, metrics.WaitingLoraAdapters) | ||
maxAdapters, maxErr := r.cache.GetMetricValueByPod(pod.Name, pod.Namespace, metrics.WaitingLoraAdapters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maxAdapters
value is retrieved using metrics.WaitingLoraAdapters
, but the description suggests it should be metrics.MaxLora
. This discrepancy could lead to incorrect capacity evaluations. It should be metrics.MaxLora
to align with the intended logic.
maxAdapters, maxErr := r.cache.GetMetricValueByPod(pod.Name, pod.Namespace, metrics.WaitingLoraAdapters) | |
maxAdapters, maxErr := r.cache.GetMetricValueByPod(pod.Name, pod.Namespace, metrics.MaxLora) |
// Default threshold for LoRA affinity selection probability | ||
// When both affinity and available pods exist, this probability determines | ||
// how often we select from affinity pods vs available pods for load balancing | ||
defaultLoraAffinityThreshold = 0.999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else { | ||
targetPod, err = utils.SelectRandomPod(readyPods, rand.Intn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where len(targetPods) == 0
, the code falls back to utils.SelectRandomPod(readyPods, rand.Intn)
. However, readyPods
has already been filtered for routable pods. It might be more appropriate to use the original readyPods
list to ensure a pod is always selected if available, or return an error indicating no routable pods are available.
// Check if the pod has available capacity | ||
totalActiveAdapters := runningModels.Len() + waitingModels.Len() | ||
hasCapacity := totalActiveAdapters < maxCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: bitliu <[email protected]>
} | ||
|
||
// checkModelAffinity checks if the pod has the target model active or waiting | ||
func (r loraAffinityRouter) checkModelAffinity(targetModel string, runningModels, waitingModels sets.String) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One tradeoff here is runningModels are already loaded but the waitingModels may not, based on this assumption, waitingModels are suboptimal.
@Xunzhuo could you help update this PR? lora improvement is planned in v0.5.0 and let's keep moving and improve the quality |
@Jeffwan sure |
feat: add support for lora affinity routing strategy