Resilience to TCP SYN Node Loss

## Is your feature request related to a problem? Please describe
For long lived TCP connection if the Node which received the TCP SYN packet goes down while the connection is still open then traffic gets routed to a different node but it is not routed to the backend pod. This is a problem since the pod is still around to serve the traffic but can't do so because the Node which had the connection went down.

## Describe the solution you'd like
New node which handles the traffic should successfully route the traffic to the backend pod.


## Additional context
1. There is k8s cluster with 2 nodes
2. Service has DSR and maglev enabled
```yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    kube-router.io/service.dsr: "tunnel"
    kube-router.io/service.scheduler: "mh"
    kube-router.io/service.schedflags: "flag-1,flag-2"
```
3. There are 3 pods behind this service. All the pods are running on `eqx-sjc-kubenode1-staging`
```
root@gce-del-km-staging-anupam:~/anupam/manifests $ kubectl get svc,endpoints
NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP     PORT(S)    AGE
service/debian-server-lb   ClusterIP   192.168.97.188   199.27.151.10   8099/TCP   5h34m

NAME                         ENDPOINTS                                         AGE
endpoints/debian-server-lb   10.36.0.84:8099,10.36.0.85:8099,10.36.0.86:8099   5h34m

root@gce-del-km-staging-anupam:~/anupam/manifests $ kubectl get pods -o wide
NAME                            READY   STATUS    RESTARTS   AGE   IP              NODE                       
debian-server-8b5467777-2shpz   1/1     Running   0          49m   10.36.0.85      eqx-sjc-kubenode1-staging 
debian-server-8b5467777-hbw29   1/1     Running   0          49m   10.36.0.86      eqx-sjc-kubenode1-staging
debian-server-8b5467777-pv9sr   1/1     Running   0          49m   10.36.0.84      eqx-sjc-kubenode1-staging  
```

4. IPVS entries are successfully applied by kube-router
```
root@eqx-sjc-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn         
TCP  192.168.97.188:8099 mh (mh-fallback,mh-port)
  -> 10.36.0.84:8099              Masq    1      0          0         
  -> 10.36.0.85:8099              Masq    1      0          0         
  -> 10.36.0.86:8099              Masq    1      0          0         
FWM  3754 mh (mh-fallback,mh-port)
  -> 10.36.0.84:8099              Tunnel  1      0          0         
  -> 10.36.0.85:8099              Tunnel  1      0          0         
  -> 10.36.0.86:8099              Tunnel  1      0          0   

root@tlx-dal-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn       
TCP  192.168.97.188:8099 mh (mh-fallback,mh-port)
  -> 10.36.0.84:8099              Masq    1      0          0         
  -> 10.36.0.85:8099              Masq    1      0          0         
  -> 10.36.0.86:8099              Masq    1      0          0         
FWM  3754 mh (mh-fallback,mh-port)
  -> 10.36.0.84:8099              Tunnel  1      0          0         
  -> 10.36.0.85:8099              Tunnel  1      0          0         
  -> 10.36.0.86:8099              Tunnel  1      1          0     
```
5. In all the 3 pods start a TCP server on port 8099 using `nc -lv 0.0.0.0 8099`
6. Create a session from client which is closer to `tlx-dal-kubenode1-staging` using `nc <service-ip> 8099`
7. A connection is established where the NAT translation happens on `tlx-dal-kubenode1-staging`. Now stop the external IP advertisement by removing `--advertise-external-ip` from the kube-router argument.
8. Now send a message from the nc client. The traffic gets routed to `eqx-sjc-kubenode1-staging`, verified by running tcpdump. But the message is not visible on the backend pod server.

### Add any other context or screenshots about the feature request here.
kube-router version: `version 2.5.0, built on 2025-02-14T20:20:43Z, go1.23.6`
kubernetes version: 1.29.14
kernel version: 5.10.0-34-amd64


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resilience to TCP SYN Node Loss #1860

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Additional context

Add any other context or screenshots about the feature request here.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Resilience to TCP SYN Node Loss #1860

Description

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Additional context

Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions