Netbird relay connection stale for some peers (workaround found) #1944

Closed
opened 2025-11-20 06:09:55 -05:00 by saavagebueno · 7 comments
Owner

Originally created by @Silex on GitHub (Jun 6, 2025).

Hello

With netbird self hosted version 0.45.1, peers version 0.45.3 and 0.36.5 that are relayed due to CGNAT issues (one peer is a 5G router, other peer is a windows PC behind corporate firewall) after a while the relay becomes "stale" in the sense that you cannot ping anymore between the peers, yet it says it's connected:

$ netbird status -d

pictet-nvr1.netbird.stvs:
  NetBird IP: 100.70.94.175
  Public key: wNWlJ95DqnJMCdXX77gZwVLB4oDDInwp7DpACxy/SV4=
  Status: Connected
  -- detail --
  Connection type: Relayed
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: rels://netbird.stvs.com:443
  Last connection update: 7 hours, 9 minutes ago
  Last WireGuard handshake: 7 hours, 10 minutes ago
  Transfer status (received/sent) 711.3 MiB/18.1 GiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 52.905573ms

$ wg show

peer: wNWlJ95DqnJMCdXX77gZwVLB4oDDInwp7DpACxy/SV4=
  endpoint: 127.0.0.1:38500
  allowed ips: 100.70.94.175/32
  latest handshake: 7 hours, 13 minutes, 32 seconds ago
  transfer: 711.28 MiB received, 18.11 GiB sent
  persistent keepalive: every 25 seconds

As you see the latest handshake is way too old. A simple workaround is to stop/start netbird, but that kills all other connections (the PC is connected to many routers). Another workaround is to remove problematic router from policy group & add it again to force an update, but having to handle that manually is annoying.

I guess one could also wg set his way into removing the offending peer, and netbird would recreate the wireguard peer? So maybe I can monitor latest handshakes and "kill" the peers that are stuck?

Any ideas welcome.

Originally created by @Silex on GitHub (Jun 6, 2025). Hello With netbird self hosted version `0.45.1`, peers version `0.45.3` and `0.36.5` that are relayed due to CGNAT issues (one peer is a 5G router, other peer is a windows PC behind corporate firewall) after a while the relay becomes "stale" in the sense that you cannot ping anymore between the peers, yet it says it's connected: ``` shell $ netbird status -d pictet-nvr1.netbird.stvs: NetBird IP: 100.70.94.175 Public key: wNWlJ95DqnJMCdXX77gZwVLB4oDDInwp7DpACxy/SV4= Status: Connected -- detail -- Connection type: Relayed ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: rels://netbird.stvs.com:443 Last connection update: 7 hours, 9 minutes ago Last WireGuard handshake: 7 hours, 10 minutes ago Transfer status (received/sent) 711.3 MiB/18.1 GiB Quantum resistance: false Routes: - Networks: - Latency: 52.905573ms $ wg show peer: wNWlJ95DqnJMCdXX77gZwVLB4oDDInwp7DpACxy/SV4= endpoint: 127.0.0.1:38500 allowed ips: 100.70.94.175/32 latest handshake: 7 hours, 13 minutes, 32 seconds ago transfer: 711.28 MiB received, 18.11 GiB sent persistent keepalive: every 25 seconds ``` As you see the latest handshake is way too old. A simple workaround is to stop/start netbird, but that kills all other connections (the PC is connected to many routers). Another workaround is to remove problematic router from policy group & add it again to force an update, but having to handle that manually is annoying. I guess one could also `wg set` his way into removing the offending peer, and netbird would recreate the wireguard peer? So maybe I can monitor latest handshakes and "kill" the peers that are stuck? Any ideas welcome.
saavagebueno added the triage-needed label 2025-11-20 06:09:55 -05:00
Author
Owner

@Silex commented on GitHub (Jun 6, 2025):

I found this which is interesting, but seems netbird already does the right thing:

https://www.reddit.com/r/WireGuard/comments/k3d1hc/latest_handshake_few_hours_ago/

@Silex commented on GitHub (Jun 6, 2025): I found this which is interesting, but seems netbird already does the right thing: https://www.reddit.com/r/WireGuard/comments/k3d1hc/latest_handshake_few_hours_ago/
Author
Owner

@Silex commented on GitHub (Jun 6, 2025):

Just to clarify the setup:

Netbird runs on multiple 5G routers (Teltonika TRB500) and on multiple servers (windows). The connexions are relayed due to CGNAT/firewall issues.

One of these server records cameras served through the multiple routers.

Almost every night, some of the routers relayed connexions become stale and thus the cameras are unreachable. Simply restarting netbird fixes the issues.

From the other servers most of the time the connexions to the routers are not stale, but it also happens from time to time.

This problematic server is a VM that runs with by different provider so maybe the network issues are mainly due to this other provider, but my guess is that it has more to do with the wireguard tunnel not being correctly detected as not working (e.g 5G router IP changed, 5G connection glitches, etc).

@Silex commented on GitHub (Jun 6, 2025): Just to clarify the setup: Netbird runs on multiple 5G routers (Teltonika TRB500) and on multiple servers (windows). The connexions are relayed due to CGNAT/firewall issues. One of these server records cameras served through the multiple routers. Almost every night, some of the routers relayed connexions become stale and thus the cameras are unreachable. Simply restarting netbird fixes the issues. From the other servers most of the time the connexions to the routers are not stale, but it also happens from time to time. This problematic server is a VM that runs with by different provider so maybe the network issues are mainly due to this other provider, but my guess is that it has more to do with the wireguard tunnel not being correctly detected as not working (e.g 5G router IP changed, 5G connection glitches, etc).
Author
Owner

@Silex commented on GitHub (Jun 6, 2025):

Meh, I though it was the wireguard tunnel but it seems deeper than that:

When peer is unreachable:

peer: 6kq3/G775aJK5slDq1OyEyLFK4TvyZiurx+OddRotVw=
  endpoint: 127.1.189.16:51820
  allowed ips: 100.70.189.16/32
  transfer: 0 B received, 148 B sent
  persistent keepalive: every 25 seconds

When peer is reachable:

peer: 6kq3/G775aJK5slDq1OyEyLFK4TvyZiurx+OddRotVw=
  endpoint: 127.1.189.16:51820
  allowed ips: 100.70.189.16/32
  latest handshake: 28 seconds ago
  transfer: 796.04 KiB received, 247.33 KiB sent
  persistent keepalive: every 25 seconds

I removed/recreated the peer using plain wg set commands but it does not reconnect the peer.

The only thing working at this point is netbird down/up or editing the peer policy so netbird "resets" the config.

Should I give 0.46.0 a try?

@Silex commented on GitHub (Jun 6, 2025): Meh, I though it was the wireguard tunnel but it seems deeper than that: When peer is unreachable: ``` peer: 6kq3/G775aJK5slDq1OyEyLFK4TvyZiurx+OddRotVw= endpoint: 127.1.189.16:51820 allowed ips: 100.70.189.16/32 transfer: 0 B received, 148 B sent persistent keepalive: every 25 seconds ``` When peer is reachable: ``` peer: 6kq3/G775aJK5slDq1OyEyLFK4TvyZiurx+OddRotVw= endpoint: 127.1.189.16:51820 allowed ips: 100.70.189.16/32 latest handshake: 28 seconds ago transfer: 796.04 KiB received, 247.33 KiB sent persistent keepalive: every 25 seconds ``` I removed/recreated the peer using plain `wg set` commands but it does not reconnect the peer. The only thing working at this point is netbird down/up or editing the peer policy so netbird "resets" the config. Should I give `0.46.0` a try?
Author
Owner

@nazarewk commented on GitHub (Jun 6, 2025):

I removed/recreated the peer using plain wg set commands but it does not reconnect the peer.

I'm pretty sure it uses elaborate negotiation process to establish connectivity. I wouldn't expect wg set to have any chance of working unless the Peer was directly reachable over the internet.

You can always try the 0.46.0 but after looking briefly at the notes, I don't see anything particularly relevant there.

@nazarewk commented on GitHub (Jun 6, 2025): > I removed/recreated the peer using plain `wg set` commands but it does not reconnect the peer. I'm pretty sure it uses elaborate negotiation process to establish connectivity. I wouldn't expect `wg set` to have any chance of working unless the Peer was directly reachable over the internet. You can always try the `0.46.0` but after looking briefly at the notes, I don't see anything particularly relevant there.
Author
Owner

@Silex commented on GitHub (Jun 6, 2025):

@nazarewk thanks.

I'm trying to find a workaroud so I only reset the stale peer instead of the whole netbird connection. Any idea? Removing & adding the wireguard peer seemed smart but I guess it's a dead end.

@Silex commented on GitHub (Jun 6, 2025): @nazarewk thanks. I'm trying to find a workaroud so I only reset the stale peer instead of the whole netbird connection. Any idea? Removing & adding the wireguard peer seemed smart but I guess it's a dead end.
Author
Owner

@Silex commented on GitHub (Jun 6, 2025):

Hum, forwarding UDP 51820 from WAN to peer does not seem to help P2P connection. Any idea what to try?

@Silex commented on GitHub (Jun 6, 2025): Hum, forwarding UDP 51820 from WAN to peer does not seem to help P2P connection. Any idea what to try?
Author
Owner

@Silex commented on GitHub (Jun 10, 2025):

I'll reopen this issue following the template and providing debug logs.

@Silex commented on GitHub (Jun 10, 2025): I'll reopen this issue following the template and providing debug logs.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#1944