-
Notifications
You must be signed in to change notification settings - Fork 433
Refactor: Extract KV event management to break circular dependency #1401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ae86zhizhi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've refactored the KV event management system to eliminate a circular dependency between the cache.Store and KVEventManager components. This was achieved by extracting the event management logic into a new, dedicated pkg/kvevent package. The core of this solution involves a dual adapter pattern, which allows for clean architectural boundaries and proper dependency inversion. This change significantly improves the testability, maintainability, and extensibility of the codebase by ensuring a clear, unidirectional flow of dependencies.
Highlights
- Dependency Inversion: Introduced a new pkg/kvevent package with focused interfaces (PodProvider, SyncIndexProvider, SyncIndexer) to abstract dependencies.
- Dual Adapter Pattern: Implemented storeProviderAdapter and syncIndexerAdapter in pkg/cache to bridge cache.Store with kvevent and syncprefixcacheindexer, acting as an Anti-Corruption Layer and handling type conversions.
- Circular Dependency Elimination: Successfully broke the circular reference between cache.Store and KVEventManager, leading to a cleaner, unidirectional dependency flow.
- Improved Testability: Components can now be tested in isolation with mocks.
- Enhanced Maintainability & Extensibility: Clear separation of concerns makes the codebase easier to maintain and extend.
- Configuration Validation Moved: Configuration validation for KV event sync was moved to ensure requirements are met.
- Comprehensive Testing: Added extensive unit and integration tests for all new components, achieving 82.5% coverage for the kvevent package.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is an excellent and well-documented refactoring that successfully breaks a circular dependency by introducing a new kvevent
package and applying the Dependency Inversion Principle with a dual-adapter pattern. The changes significantly improve the architecture, making the system more modular, testable, and maintainable. The new tests are comprehensive and cover unit, integration, and lifecycle scenarios. My feedback focuses on improving robustness in configuration parsing, using more idiomatic Go for control flow, and highlighting a potential data loss scenario for future improvement.
@ae86zhizhi Seems there're lots of new changes? are they necessary? I thought this is a refactor PR (I mean the added/removed line should be equal or close) |
I think #1387 is proper scope for this change? |
Explanation for Line Count Increase (+2665/-1208)This commit is an intentional investment in our codebase's health and contains no new features. The line count increase is due to two main reasons:
|
@ae86zhizhi if that's ready, please make it please also address the DCO issue, sign off all commits |
41b0b2e
to
42759d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is an excellent and well-executed refactoring to eliminate a circular dependency between cache.Store
and KVEventManager
. By extracting the event management logic into a new pkg/kvevent
package and using interfaces with a dual adapter pattern, the architecture is now much cleaner, more testable, and easier to maintain. The changes are substantial but well-contained, and the addition of comprehensive unit and integration tests provides confidence in the new implementation. My review focuses on improving maintainability by replacing hardcoded environment variable names with constants.
Other parts LGTM, please help address existing comments, thanks. |
3aea2d2
to
9d7c9b8
Compare
// - Subscription state (status.Phase) changed, this applies to the same pod or different pods | ||
if !isSamePod(oldPod, newPod) || oldSubscribable != newSubscribable { | ||
if oldSubscribable { | ||
m.unsubscribeFromPod(podKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it is intentional to not delete from sync Indexer
but why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DwyaneShi What do you mean by "delete from sync indexer"? If oldSubscribable == false
, it was not subscribed, then there is no need to unsubscribeFromPod
. If new new pod is updated (added), it shall go to onPodDelete
instead of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DwyaneShi Let me provide a detailed explanation of the situation before and after KV-Sync was introduced, as well as our current design considerations.
Background: The "Ghost Cache" Problem Before KV-Sync
Before the KV-Sync feature was merged, AIBrix had a "ghost KV cache" problem. Specifically, when a pod crashed or disconnected, its KV cache entries would remain in the prefix indexer. These stale entries persisted until their TTL expired (20 minutes by default), and were only passively evicted. During this time, new requests could still be routed to these dead pods, leading to connection failures and 503 errors. We lacked an active cleanup mechanism at the time.
Current State: Improvements Brought by KV-Sync
The current implementation with KV-Sync has actually improved this situation. When a pod is deleted, OnPodDelete
in kvevent/manager.go
is called, which in turn triggers RemovePrefix
to actively clean up all of the pod's prefix entries. This means we have introduced an active cleanup capability for pod deletions, which we did not have before.
Design Rationale for the OnPodUpdate
Path
You are correct in observing that in the OnPodUpdate
path, we do not immediately remove the entries from the sync indexer. This is intentional, and the reasoning is as follows:
- A pod update event does not necessarily mean the pod is terminated; it might just be in a temporarily unroutable state.
- We anticipate that the pod could become routable again after a brief interruption (e.g., a network hiccup).
- If the pod is ultimately deleted, the
OnPodDelete
logic will ensure the cleanup is performed.
Future Improvements
I agree with you that this logic could be further improved. Ideally, we should also clean up the indexer entries when a pod remains in an unroutable state for an extended period.
However, implementing this reliably requires careful consideration of the state machine and timing. It's a separate concern from the circular dependency issue that this PR aims to solve. A more appropriate approach would be to address this in a dedicated PR with thorough testing covering various pod lifecycle scenarios.
This current PR focuses on fixing the core dependency issue while maintaining backward compatibility. We can enhance the cleanup logic for pod state transitions in a follow-up PR.
What are your thoughts? Shall we create a new issue to track this enhancement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current and old behavior (before introducing KV event pub/sub)
If a pod is disconnected, either because of deletion or update, its corresponding KV cache entries will not be proactively deleted from sync indexer. If a request follows the KV cache and tries to access the pod, it will return error and the LLM engine needs to decide whether to compute it or try another pod (which is not available for now).
Basically, if a false positive is triggered in the KV cache, the engine needs to compute.
Obsolete cache entries will eventually be evicted based on LRU.
What KV event manager can do
Since the event manager now has the information about pod disconnection, it can invoke a callback to the KV cache to evict obsolete entries. However, this requires scanning of the KV cache, which requires long time locking. Thus it may not be a good idea to proactively evict all obsolete cache entries.
Potential workarounds
The problem of not removing obsolete cache entries is due to false positives. The cache itself can also check the pod state before returning the cache entries. If it finds out the pod has been disconnected but cache entries are still alive, it can set some dirty/garbage flag to these entries and return cache miss, or try cache entries from a different pod.
Overall design can be:
- Background GC: kv event manager notifies the background GC scheduler of kv cache that all cache entries from this pod become invalid, so these cache entries have higher priorities to be evicted in the background GC process.
- Lazy check pod state: If the cache entries are not yet evicted, the kv cache shall also check the pod state before returning it, or mark the cache entries to be invalid for eviction.
However, this still cannot fully resolve the false positives. It's possible that the pod is alive when the kv cache checks it, but connection dropped when the kv cache tries to access the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's have a followup PR to address the issue.
- Move KV event management from pkg/cache to new pkg/kvevent package - Define interfaces (PodProvider, SyncIndexProvider, SyncIndexer) for dependency inversion - Implement adapter pattern to bridge cache.Store with kvevent interfaces - Maintain backward compatibility with existing KVEventManager API - Improve testability with clear separation of concerns - Add comprehensive test coverage for the new package structure Signed-off-by: ZHENYU <[email protected]>
This commit addresses code review feedback to improve robustness: 1. Add proper error handling for strconv.ParseBool in validateKVEventConfiguration() - Return descriptive errors instead of silently ignoring parse failures - Help users identify configuration issues early 2. Add warning logs in validateConfiguration() for invalid boolean values - Log warnings when environment variables contain invalid boolean values - Default to false but inform users about the parsing issue 3. Replace goto with labeled break for better Go idioms - Use labeled break instead of goto for loop exit - Improves code readability and follows Go best practices These changes make configuration errors more visible and easier to debug, preventing silent failures when users set invalid environment variable values. Signed-off-by: ZHENYU <[email protected]>
…inability - Add constants for remote tokenizer environment variables in pkg/constants/kv_event_sync.go: - EnvUseRemoteTokenizer for AIBRIX_USE_REMOTE_TOKENIZER - EnvPrefixCacheTokenizerType for AIBRIX_PREFIX_CACHE_TOKENIZER_TYPE - EnvRemoteTokenizerEndpoint for AIBRIX_REMOTE_TOKENIZER_ENDPOINT - Replace all hardcoded environment variable strings with constants throughout: - pkg/kvevent/manager.go - pkg/cache/kv_event_manager_zmq.go - Test files for consistent usage This change improves code maintainability by preventing typos and providing a single source of truth for environment variable names, following the established pattern in the AIBrix codebase. Signed-off-by: ZHENYU <[email protected]>
Signed-off-by: Qizhong Mao <[email protected]>
…context Signed-off-by: ZHENYU <[email protected]>
9d7c9b8
to
a916362
Compare
Refactor: Extract KV Event Management to Break Circular Dependency
Summary
This PR refactors the KV event management system to eliminate a circular dependency between
cache.Store
andKVEventManager
. The solution extracts event management into a newpkg/kvevent
package and uses a dual adapter pattern to maintain clean architectural boundaries.Background
The original architecture had a circular dependency:
cache.Store
containedKVEventManager
KVEventManager
needed access toStore.metaPods
andStore.syncPrefixIndexer
This circular dependency made the code difficult to test, maintain, and reason about.
Solution: Dependency Inversion with Dual Adapters
1. Interface Segregation
We defined three focused interfaces in the new
pkg/kvevent
package:2. Dual Adapter Architecture
The key innovation is using two adapters that work together:
First Adapter:
storeProviderAdapter
Located in
pkg/cache/store_providers.go
, this adapter:PodProvider
andSyncIndexProvider
interfacescache.Store
with thekvevent
packageStore.metaPods
GetSyncIndexer()
is calledSecond Adapter:
syncIndexerAdapter
Also in
pkg/cache/store_providers.go
, this adapter:SyncIndexer
interfacesyncprefixcacheindexer.SyncPrefixHashTable
kvevent
event types andsyncindexer
event types3. Why Two Adapters?
The dual adapter pattern solves a complex architectural challenge:
Type Conversion: Go doesn't allow direct conversion between identical structs from different packages. The
syncIndexerAdapter
handles conversion betweenkvevent.BlockStoredEvent
andsyncindexer.BlockStored
.Dependency Direction:
syncprefixcacheindexer
would need to implementkvevent.SyncIndexer
, creating a dependency onkvevent
cache
→kvevent
andcache
→syncprefixcacheindexer
Clean Boundaries: Each package remains focused on its core responsibility without knowledge of the others' internals.
Architecture Diagram
Benefits
Implementation Details
Backward Compatibility
The existing
KVEventManager
API is preserved through a wrapper:Configuration Validation
Configuration validation was moved to ensure KV event sync requirements are met:
Testing
Comprehensive test coverage has been added:
Migration Guide
No changes required for existing code. The refactoring maintains full backward compatibility.
Future Improvements
Review Checklist
make test
make lint
zmq
andnozmq
(no flag) tags