Skip to content

Add simple PUT retry logic when initial peer doesn't acknowledge #1659

Open
@sanity

Description

@sanity

Simple PUT Retry Logic

Problem

When a client initiates a PUT operation, if the selected target peer doesn't respond with SuccessfulPut, the operation fails permanently. This causes:

  • River chat updates to fail silently
  • Poor user experience when a single peer is unreachable
  • Unnecessary failures when alternative peers are available

Current Behavior

// In request_put() 
let target = op_manager
    .ring
    .closest_potentially_caching(&key, [&sender.peer].as_slice())
    .into_iter()
    .next()
    .ok_or(RingError::EmptyRing)?;

// Send RequestPut to target
// If no SuccessfulPut received → operation fails permanently

Proposed Solution

Add simple retry logic with alternative peers:

pub struct PutState {
    // ... existing fields ...
    AwaitingResponse {
        key: ContractKey,
        // ... other fields ...
        retry_count: usize,
        tried_peers: HashSet<PeerId>,
    }
}

// When timeout occurs (no SuccessfulPut within ~500ms-2s):
if retry_count < MAX_RETRIES {
    // Get alternative peer
    let candidates = op_manager
        .ring
        .k_closest_potentially_caching(&key, &tried_peers, 5);
    
    if let Some(next_peer) = candidates.first() {
        // Send RequestPut to next_peer
        // Increment retry_count
        // Add current peer to tried_peers
    }
}

Key Points

  1. Simple retry only: We only retry the initial PUT request to the first peer
  2. No propagation tracking: Once any peer sends SuccessfulPut, they have responsibility
  3. Fast timeout: 500ms-2s per attempt (not the 60-second operation TTL)
  4. Limited retries: ~5-10 attempts max to avoid infinite loops

Implementation Approach

  1. Add retry fields to PutState::AwaitingResponse
  2. Add timeout detection in PUT operation processing
  3. On timeout, select next peer and retry
  4. On SuccessfulPut, complete operation normally

Success Criteria

  • PUT operations succeed even when initial target peer is unreachable
  • No changes to PUT propagation logic
  • No protocol changes required
  • Simple, minimal code changes

Priority

High - This directly impacts River chat reliability and user experience

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions