Skip to content

Refactor: Extract KV event management to break circular dependency #1401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 18, 2025

Conversation

ae86zhizhi
Copy link
Contributor

Refactor: Extract KV Event Management to Break Circular Dependency

Summary

This PR refactors the KV event management system to eliminate a circular dependency between cache.Store and KVEventManager. The solution extracts event management into a new pkg/kvevent package and uses a dual adapter pattern to maintain clean architectural boundaries.

Background

The original architecture had a circular dependency:

  • cache.Store contained KVEventManager
  • KVEventManager needed access to Store.metaPods and Store.syncPrefixIndexer
  • This created a circular reference: Store → KVEventManager → Store

This circular dependency made the code difficult to test, maintain, and reason about.

Solution: Dependency Inversion with Dual Adapters

1. Interface Segregation

We defined three focused interfaces in the new pkg/kvevent package:

// PodProvider provides access to pod information
type PodProvider interface {
    GetPod(ctx context.Context, podKey string) (*PodInfo, bool)
    RangePods(ctx context.Context, f func(key string, pod *PodInfo) bool) error
}

// SyncIndexProvider provides access to sync indexer
type SyncIndexProvider interface {
    GetSyncIndexer(ctx context.Context) (SyncIndexer, error)
}

// SyncIndexer handles cache event processing
type SyncIndexer interface {
    ProcessBlockStored(ctx context.Context, event BlockStoredEvent) error
    ProcessBlockRemoved(ctx context.Context, event BlockRemovedEvent) error
    RemovePrefix(ctx context.Context, modelName string, loraID int64, podKey string) error
}

2. Dual Adapter Architecture

The key innovation is using two adapters that work together:

First Adapter: storeProviderAdapter

Located in pkg/cache/store_providers.go, this adapter:

  • Implements both PodProvider and SyncIndexProvider interfaces
  • Bridges cache.Store with the kvevent package
  • Provides pod information from Store.metaPods
  • Returns the second adapter when GetSyncIndexer() is called

Second Adapter: syncIndexerAdapter

Also in pkg/cache/store_providers.go, this adapter:

  • Implements the SyncIndexer interface
  • Wraps syncprefixcacheindexer.SyncPrefixHashTable
  • Converts between kvevent event types and syncindexer event types
  • Acts as an Anti-Corruption Layer between the two domains

3. Why Two Adapters?

The dual adapter pattern solves a complex architectural challenge:

  1. Type Conversion: Go doesn't allow direct conversion between identical structs from different packages. The syncIndexerAdapter handles conversion between kvevent.BlockStoredEvent and syncindexer.BlockStored.

  2. Dependency Direction:

    • Without adapters: syncprefixcacheindexer would need to implement kvevent.SyncIndexer, creating a dependency on kvevent
    • With adapters: Dependencies flow correctly: cachekvevent and cachesyncprefixcacheindexer
  3. Clean Boundaries: Each package remains focused on its core responsibility without knowledge of the others' internals.

Architecture Diagram

Before (Circular Dependency):
┌───────────────────┐
│cache.Store        │ ←────┐
│  └─KVEventManager │      │
└───────────────────┘      │
      ↑                    │
      └────────────────────┘

After (Clean Architecture):
┌─────────────┐     ┌──────────────┐     ┌──────────────────────┐
│cache.Store  │ ──→ │pkg/kvevent   │ ←── │syncprefixcacheindexer│
└─────────────┘     │  - Manager   │     └──────────────────────┘
      │             │  - Interfaces│              ↑
      │             └──────────────┘              │
      │                    ↑                      │
      │                    │                      │
      └──→ storeProviderAdapter ──→ syncIndexerAdapter
           (implements interfaces)   (wraps sync indexer)

Benefits

  1. No Circular Dependencies: Clean, unidirectional dependency flow
  2. Testability: Each component can be tested in isolation with mocks
  3. Maintainability: Clear separation of concerns
  4. Extensibility: New implementations can be added without modifying existing code
  5. Type Safety: Compile-time verification of interface implementations

Implementation Details

Backward Compatibility

The existing KVEventManager API is preserved through a wrapper:

// pkg/cache/kv_event_manager_zmq.go
type KVEventManager struct {
    *kvevent.Manager
}

func NewKVEventManager(store *Store) *KVEventManager {
    podProvider, syncProvider := NewStoreProviderAdapter(store)
    manager := kvevent.NewManager(podProvider, syncProvider)
    return &KVEventManager{Manager: manager}
}

Configuration Validation

Configuration validation was moved to ensure KV event sync requirements are met:

  • Remote tokenizer must be enabled
  • Tokenizer type must be "remote"
  • Remote tokenizer endpoint must be configured

Testing

Comprehensive test coverage has been added:

  • Unit tests for all new components
  • Integration tests with real components
  • Validation tests for configuration
  • Mock implementations for isolated testing
  • Test coverage: 82.5% for the kvevent package

Migration Guide

No changes required for existing code. The refactoring maintains full backward compatibility.

Future Improvements

  1. Consider moving more ZMQ-specific code into the kvevent package
  2. Add metrics and observability for event processing
  3. Implement retry logic with exponential backoff for transient failures

Review Checklist

  • No breaking changes to public APIs
  • All tests pass with make test
  • Lint passes with make lint
  • Build succeeds with both zmq and nozmq(no flag) tags
  • Documentation updated
  • Comprehensive test coverage added

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ae86zhizhi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've refactored the KV event management system to eliminate a circular dependency between the cache.Store and KVEventManager components. This was achieved by extracting the event management logic into a new, dedicated pkg/kvevent package. The core of this solution involves a dual adapter pattern, which allows for clean architectural boundaries and proper dependency inversion. This change significantly improves the testability, maintainability, and extensibility of the codebase by ensuring a clear, unidirectional flow of dependencies.

Highlights

  • Dependency Inversion: Introduced a new pkg/kvevent package with focused interfaces (PodProvider, SyncIndexProvider, SyncIndexer) to abstract dependencies.
  • Dual Adapter Pattern: Implemented storeProviderAdapter and syncIndexerAdapter in pkg/cache to bridge cache.Store with kvevent and syncprefixcacheindexer, acting as an Anti-Corruption Layer and handling type conversions.
  • Circular Dependency Elimination: Successfully broke the circular reference between cache.Store and KVEventManager, leading to a cleaner, unidirectional dependency flow.
  • Improved Testability: Components can now be tested in isolation with mocks.
  • Enhanced Maintainability & Extensibility: Clear separation of concerns makes the codebase easier to maintain and extend.
  • Configuration Validation Moved: Configuration validation for KV event sync was moved to ensure requirements are met.
  • Comprehensive Testing: Added extensive unit and integration tests for all new components, achieving 82.5% coverage for the kvevent package.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent and well-documented refactoring that successfully breaks a circular dependency by introducing a new kvevent package and applying the Dependency Inversion Principle with a dual-adapter pattern. The changes significantly improve the architecture, making the system more modular, testable, and maintainable. The new tests are comprehensive and cover unit, integration, and lifecycle scenarios. My feedback focuses on improving robustness in configuration parsing, using more idiomatic Go for control flow, and highlighting a potential data loss scenario for future improvement.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Aug 6, 2025

@ae86zhizhi Seems there're lots of new changes? are they necessary? I thought this is a refactor PR (I mean the added/removed line should be equal or close)

@Jeffwan
Copy link
Collaborator

Jeffwan commented Aug 6, 2025

I think #1387 is proper scope for this change?

@ae86zhizhi
Copy link
Contributor Author

I think #1387 is proper scope for this change?

Yes, this change targets the same problem as #1387. The previous PR explored several potential fixes, but unfortunately none of them were successful.

@ae86zhizhi
Copy link
Contributor Author

ae86zhizhi commented Aug 6, 2025

@ae86zhizhi Seems there're lots of new changes? are they necessary? I thought this is a refactor PR (I mean the added/removed line should be equal or close)

Explanation for Line Count Increase (+2665/-1208)

This commit is an intentional investment in our codebase's health and contains no new features. The line count increase is due to two main reasons:

  1. Architectural Refactoring
    The primary driver was to break a circular dependency between the cache and kvevent packages. This was achieved by introducing standard abstraction layers (interfaces and adapters), which accounts for the new production code.

  2. Enhanced Test Coverage
    I added approximately 1,400 lines of new code for comprehensive unit, integration, and validation tests. This raised test coverage from a minimal baseline to over 82%, drastically improving the code's reliability.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Aug 6, 2025

@ae86zhizhi if that's ready, please make it ready for review
image

please also address the DCO issue, sign off all commits

@ae86zhizhi ae86zhizhi marked this pull request as ready for review August 7, 2025 00:39
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an excellent and well-executed refactoring to eliminate a circular dependency between cache.Store and KVEventManager. By extracting the event management logic into a new pkg/kvevent package and using interfaces with a dual adapter pattern, the architecture is now much cleaner, more testable, and easier to maintain. The changes are substantial but well-contained, and the addition of comprehensive unit and integration tests provides confidence in the new implementation. My review focuses on improving maintainability by replacing hardcoded environment variable names with constants.

@DwyaneShi
Copy link
Collaborator

Other parts LGTM, please help address existing comments, thanks.

@autopear autopear force-pushed the refactor/kv-event branch 7 times, most recently from 3aea2d2 to 9d7c9b8 Compare August 13, 2025 22:54
// - Subscription state (status.Phase) changed, this applies to the same pod or different pods
if !isSamePod(oldPod, newPod) || oldSubscribable != newSubscribable {
if oldSubscribable {
m.unsubscribeFromPod(podKey)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it is intentional to not delete from sync Indexer but why?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DwyaneShi What do you mean by "delete from sync indexer"? If oldSubscribable == false, it was not subscribed, then there is no need to unsubscribeFromPod. If new new pod is updated (added), it shall go to onPodDelete instead of this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DwyaneShi Let me provide a detailed explanation of the situation before and after KV-Sync was introduced, as well as our current design considerations.

Background: The "Ghost Cache" Problem Before KV-Sync

Before the KV-Sync feature was merged, AIBrix had a "ghost KV cache" problem. Specifically, when a pod crashed or disconnected, its KV cache entries would remain in the prefix indexer. These stale entries persisted until their TTL expired (20 minutes by default), and were only passively evicted. During this time, new requests could still be routed to these dead pods, leading to connection failures and 503 errors. We lacked an active cleanup mechanism at the time.

Current State: Improvements Brought by KV-Sync

The current implementation with KV-Sync has actually improved this situation. When a pod is deleted, OnPodDelete in kvevent/manager.go is called, which in turn triggers RemovePrefix to actively clean up all of the pod's prefix entries. This means we have introduced an active cleanup capability for pod deletions, which we did not have before.

Design Rationale for the OnPodUpdate Path

You are correct in observing that in the OnPodUpdate path, we do not immediately remove the entries from the sync indexer. This is intentional, and the reasoning is as follows:

  1. A pod update event does not necessarily mean the pod is terminated; it might just be in a temporarily unroutable state.
  2. We anticipate that the pod could become routable again after a brief interruption (e.g., a network hiccup).
  3. If the pod is ultimately deleted, the OnPodDelete logic will ensure the cleanup is performed.

Future Improvements

I agree with you that this logic could be further improved. Ideally, we should also clean up the indexer entries when a pod remains in an unroutable state for an extended period.

However, implementing this reliably requires careful consideration of the state machine and timing. It's a separate concern from the circular dependency issue that this PR aims to solve. A more appropriate approach would be to address this in a dedicated PR with thorough testing covering various pod lifecycle scenarios.

This current PR focuses on fixing the core dependency issue while maintaining backward compatibility. We can enhance the cleanup logic for pod state transitions in a follow-up PR.

What are your thoughts? Shall we create a new issue to track this enhancement?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DwyaneShi

Current and old behavior (before introducing KV event pub/sub)

If a pod is disconnected, either because of deletion or update, its corresponding KV cache entries will not be proactively deleted from sync indexer. If a request follows the KV cache and tries to access the pod, it will return error and the LLM engine needs to decide whether to compute it or try another pod (which is not available for now).

Basically, if a false positive is triggered in the KV cache, the engine needs to compute.

Obsolete cache entries will eventually be evicted based on LRU.

What KV event manager can do

Since the event manager now has the information about pod disconnection, it can invoke a callback to the KV cache to evict obsolete entries. However, this requires scanning of the KV cache, which requires long time locking. Thus it may not be a good idea to proactively evict all obsolete cache entries.

Potential workarounds

The problem of not removing obsolete cache entries is due to false positives. The cache itself can also check the pod state before returning the cache entries. If it finds out the pod has been disconnected but cache entries are still alive, it can set some dirty/garbage flag to these entries and return cache miss, or try cache entries from a different pod.

Overall design can be:

  1. Background GC: kv event manager notifies the background GC scheduler of kv cache that all cache entries from this pod become invalid, so these cache entries have higher priorities to be evicted in the background GC process.
  2. Lazy check pod state: If the cache entries are not yet evicted, the kv cache shall also check the pod state before returning it, or mark the cache entries to be invalid for eviction.

However, this still cannot fully resolve the false positives. It's possible that the pod is alive when the kv cache checks it, but connection dropped when the kv cache tries to access the pod.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's have a followup PR to address the issue.

ae86zhizhi and others added 4 commits August 18, 2025 13:25
- Move KV event management from pkg/cache to new pkg/kvevent package
- Define interfaces (PodProvider, SyncIndexProvider, SyncIndexer) for dependency inversion
- Implement adapter pattern to bridge cache.Store with kvevent interfaces
- Maintain backward compatibility with existing KVEventManager API
- Improve testability with clear separation of concerns
- Add comprehensive test coverage for the new package structure

Signed-off-by: ZHENYU <[email protected]>
This commit addresses code review feedback to improve robustness:

1. Add proper error handling for strconv.ParseBool in validateKVEventConfiguration()
   - Return descriptive errors instead of silently ignoring parse failures
   - Help users identify configuration issues early

2. Add warning logs in validateConfiguration() for invalid boolean values
   - Log warnings when environment variables contain invalid boolean values
   - Default to false but inform users about the parsing issue

3. Replace goto with labeled break for better Go idioms
   - Use labeled break instead of goto for loop exit
   - Improves code readability and follows Go best practices

These changes make configuration errors more visible and easier to debug,
preventing silent failures when users set invalid environment variable values.

Signed-off-by: ZHENYU <[email protected]>
…inability

- Add constants for remote tokenizer environment variables in pkg/constants/kv_event_sync.go:
  - EnvUseRemoteTokenizer for AIBRIX_USE_REMOTE_TOKENIZER
  - EnvPrefixCacheTokenizerType for AIBRIX_PREFIX_CACHE_TOKENIZER_TYPE
  - EnvRemoteTokenizerEndpoint for AIBRIX_REMOTE_TOKENIZER_ENDPOINT

- Replace all hardcoded environment variable strings with constants throughout:
  - pkg/kvevent/manager.go
  - pkg/cache/kv_event_manager_zmq.go
  - Test files for consistent usage

This change improves code maintainability by preventing typos and providing
a single source of truth for environment variable names, following the
established pattern in the AIBrix codebase.

Signed-off-by: ZHENYU <[email protected]>
Signed-off-by: Qizhong Mao <[email protected]>
@autopear autopear merged commit 7df85c5 into vllm-project:main Aug 18, 2025
14 checks passed
@autopear autopear deleted the refactor/kv-event branch August 18, 2025 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants