Intermittent connectivity loss to routing peers requiring manual restart #2452

Open
opened 2025-11-20 07:09:59 -05:00 by saavagebueno · 2 comments
Owner

Originally created by @theophileds on GitHub (Nov 10, 2025).

Describe the problem

Users intermittently lose connectivity to NetBird routing peers (VPC gateway instances), making infrastructure resources unreachable. The issue affects both redundant routing peers simultaneously, despite
each being on separate EC2 instances with independent network paths.

Currently, the only reliable recovery method is manually executing netbird down && netbird up on the affected routing peer(s). We have implemented automated daily restarts via cron (with retry logic and
randomized delays), but connectivity issues still occur randomly between scheduled restarts, requiring manual intervention from operations team.

To Reproduce

Steps to reproduce the behavior:

  1. Set up self-hosted NetBird with routing peers configured as VPC gateways
  2. Configure redundant routing peers (2 per VPC) for high availability
  3. Connect client machines to access resources through routing peers
  4. Normal operation works initially
  5. After random intervals (hours to days), clients lose connectivity to resources behind routing peers
  6. Routing peer shows Status: Idle or Status: Connecting for affected client peers with "Last WireGuard handshake: -"
  7. Client-side netbird status shows routing peer in "Connecting" state
  8. Only resolution: SSH to routing peer and run netbird down && netbird up

Expected behavior

  • Routing peers should maintain stable connectivity to client peers

Are you using NetBird Cloud?

Self-hosted NetBird control plane (on-premise deployment)

NetBird version

Routing peers: 0.59.12 (Amazon Linux 2023)
Clients: Mixed versions 0.59.10 - 0.59.12 (all affected randomly regardless of version)

Is any other VPN software installed?

No other VPN software installed on routing peers or affected clients.

Debug output

ERRO shared/signal/client/grpc.go:417: Stream receive error: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: PROTOCOL_ERROR
WARN shared/signal/client/grpc.go:177: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: PROTOCOL_ERROR

Additional Errors Found:

  1. Management service keepalive failures:
WARN shared/management/client/grpc.go:172: disconnected from the Management service but will retry silently. Reason: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
  1. Relay connection issues:
WARN client/internal/peer/worker_relay.go:124: failed to close relay connection: use of closed network connection
  1. TURN connection failures:
ERRO client/iface/wgproxy/ebpf/wrapper.go:159: failed to read from turn conn: use of closed network connection

Environment:

  • Deployment: AWS EC2 routing peers (Amazon Linux 2023.9.20251014, kernel 6.12.46)
  • Architecture: 2 routing peers per VPC (dev + production VPCs)
  • Use case: VPC gateway routing to access RDS, EC2, ...
  • Network: Behind AWS NLB with HTTP2Optional ALPN policy for gRPC support

Mitigation attempts:

  • Implemented automated daily reconnection via cron with 10 retry attempts
  • Deployed redundant routing peers for failover
  • Neither prevents the issue - both peers can become unreachable simultaneously for some users.

Pattern observations:

  • Issue affects users randomly regardless of client version (0.59.10-0.59.12)
  • Both routing peers can fail simultaneously despite being on separate instances
  • Manual netbird down && netbird up on routing peer immediately resolves connectivity

Question: Have you seen this pattern before in other deployments? It seems like a connection state management issue where NetBird doesn't detect or recover from stale peer connections automatically.

Have you tried these troubleshooting steps?

  • Reviewed https://docs.netbird.io/how-to/troubleshooting-client
  • Checked for newer NetBird versions (running latest 0.59.x)
  • Searched for similar issues on GitHub
  • Restarted the NetBird client (this is the only working fix)
  • Disabled other VPN software (none installed)
  • Checked firewall settings (NLB ALPN policy configured correctly)
Originally created by @theophileds on GitHub (Nov 10, 2025). **Describe the problem** Users intermittently lose connectivity to NetBird routing peers (VPC gateway instances), making infrastructure resources unreachable. The issue affects both redundant routing peers simultaneously, despite each being on separate EC2 instances with independent network paths. Currently, the only reliable recovery method is manually executing `netbird down && netbird up` on the affected routing peer(s). We have implemented automated daily restarts via cron (with retry logic and randomized delays), but connectivity issues still occur randomly between scheduled restarts, requiring manual intervention from operations team. **To Reproduce** Steps to reproduce the behavior: 1. Set up self-hosted NetBird with routing peers configured as VPC gateways 2. Configure redundant routing peers (2 per VPC) for high availability 3. Connect client machines to access resources through routing peers 4. Normal operation works initially 5. After random intervals (hours to days), clients lose connectivity to resources behind routing peers 6. Routing peer shows Status: Idle or Status: Connecting for affected client peers with "Last WireGuard handshake: -" 7. Client-side netbird status shows routing peer in "Connecting" state 8. Only resolution: SSH to routing peer and run `netbird down && netbird up` **Expected behavior** - Routing peers should maintain stable connectivity to client peers **Are you using NetBird Cloud?** Self-hosted NetBird control plane (on-premise deployment) **NetBird version** Routing peers: 0.59.12 (Amazon Linux 2023) Clients: Mixed versions 0.59.10 - 0.59.12 (all affected randomly regardless of version) **Is any other VPN software installed?** No other VPN software installed on routing peers or affected clients. **Debug output** ``` ERRO shared/signal/client/grpc.go:417: Stream receive error: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: PROTOCOL_ERROR WARN shared/signal/client/grpc.go:177: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: PROTOCOL_ERROR ``` Additional Errors Found: 1. Management service keepalive failures: ``` WARN shared/management/client/grpc.go:172: disconnected from the Management service but will retry silently. Reason: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout ``` 2. Relay connection issues: ``` WARN client/internal/peer/worker_relay.go:124: failed to close relay connection: use of closed network connection ```` 3. TURN connection failures: ``` ERRO client/iface/wgproxy/ebpf/wrapper.go:159: failed to read from turn conn: use of closed network connection ``` Environment: - Deployment: AWS EC2 routing peers (Amazon Linux 2023.9.20251014, kernel 6.12.46) - Architecture: 2 routing peers per VPC (dev + production VPCs) - Use case: VPC gateway routing to access RDS, EC2, ... - Network: Behind AWS NLB with HTTP2Optional ALPN policy for gRPC support Mitigation attempts: - Implemented automated daily reconnection via cron with 10 retry attempts - Deployed redundant routing peers for failover - Neither prevents the issue - both peers can become unreachable simultaneously for some users. Pattern observations: - Issue affects users randomly regardless of client version (0.59.10-0.59.12) - Both routing peers can fail simultaneously despite being on separate instances - Manual netbird down && netbird up on routing peer immediately resolves connectivity Question: Have you seen this pattern before in other deployments? It seems like a connection state management issue where NetBird doesn't detect or recover from stale peer connections automatically. Have you tried these troubleshooting steps? - Reviewed https://docs.netbird.io/how-to/troubleshooting-client - Checked for newer NetBird versions (running latest 0.59.x) - Searched for similar issues on GitHub - Restarted the NetBird client (this is the only working fix) - Disabled other VPN software (none installed) - Checked firewall settings (NLB ALPN policy configured correctly)
saavagebueno added the triage-needed label 2025-11-20 07:09:59 -05:00
Author
Owner

@pkarc commented on GitHub (Nov 13, 2025):

I have exactly the same problem. The only thing that's different from the self-hosted guide is that I'm using Haproxy instead of nginx. But the behavior it's the same I have to restart the router peer client to make the routing work again, the clients see them connected, but no routing.

@pkarc commented on GitHub (Nov 13, 2025): I have exactly the same problem. The only thing that's different from the self-hosted guide is that I'm using Haproxy instead of nginx. But the behavior it's the same I have to restart the router peer client to make the routing work again, the clients see them connected, but no routing.
Author
Owner

@theophileds commented on GitHub (Nov 14, 2025):

I was experiencing similar timeout errors:

2025/11/09 23:00:40 [error] 29#29: *12284319 upstream timed out (110: Operation timed out) while reading upstream, client: x.x.x.x, server: example.domain.com, request: "POST /management.ManagementService/Sync HTTP/2.0", upstream: "grpc://192.168.x.x:80", host: "example.domain.com:443"

Fixed it with these ingress settings:

ingress:
  enabled: true
  className: nginx-nlb
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: 'GRPC'
    nginx.ingress.kubernetes.io/proxy-read-timeout: '604800'
    nginx.ingress.kubernetes.io/proxy-send-timeout: '604800'
    nginx.ingress.kubernetes.io/server-snippet: |
      grpc_socket_keepalive on;
      client_header_timeout 7d;
      client_body_timeout 7d;
@theophileds commented on GitHub (Nov 14, 2025): I was experiencing similar timeout errors: ``` 2025/11/09 23:00:40 [error] 29#29: *12284319 upstream timed out (110: Operation timed out) while reading upstream, client: x.x.x.x, server: example.domain.com, request: "POST /management.ManagementService/Sync HTTP/2.0", upstream: "grpc://192.168.x.x:80", host: "example.domain.com:443" ``` Fixed it with these ingress settings: ```yaml ingress: enabled: true className: nginx-nlb annotations: nginx.ingress.kubernetes.io/backend-protocol: 'GRPC' nginx.ingress.kubernetes.io/proxy-read-timeout: '604800' nginx.ingress.kubernetes.io/proxy-send-timeout: '604800' nginx.ingress.kubernetes.io/server-snippet: | grpc_socket_keepalive on; client_header_timeout 7d; client_body_timeout 7d; ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#2452