Netbird relay connection stale for some peers (workaround found) #1944

New Issue

saavagebueno · 2025-11-20T06:09:55-05:00

saavagebueno commented

2025-11-20 06:09:55 -05:00

Originally created by @Silex on GitHub (Jun 6, 2025).

Hello

With netbird self hosted version 0.45.1, peers version 0.45.3 and 0.36.5 that are relayed due to CGNAT issues (one peer is a 5G router, other peer is a windows PC behind corporate firewall) after a while the relay becomes "stale" in the sense that you cannot ping anymore between the peers, yet it says it's connected:

$ netbird status -d

pictet-nvr1.netbird.stvs:
  NetBird IP: 100.70.94.175
  Public key: wNWlJ95DqnJMCdXX77gZwVLB4oDDInwp7DpACxy/SV4=
  Status: Connected
  -- detail --
  Connection type: Relayed
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: rels://netbird.stvs.com:443
  Last connection update: 7 hours, 9 minutes ago
  Last WireGuard handshake: 7 hours, 10 minutes ago
  Transfer status (received/sent) 711.3 MiB/18.1 GiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 52.905573ms

$ wg show

peer: wNWlJ95DqnJMCdXX77gZwVLB4oDDInwp7DpACxy/SV4=
  endpoint: 127.0.0.1:38500
  allowed ips: 100.70.94.175/32
  latest handshake: 7 hours, 13 minutes, 32 seconds ago
  transfer: 711.28 MiB received, 18.11 GiB sent
  persistent keepalive: every 25 seconds

As you see the latest handshake is way too old. A simple workaround is to stop/start netbird, but that kills all other connections (the PC is connected to many routers). Another workaround is to remove problematic router from policy group & add it again to force an update, but having to handle that manually is annoying.

I guess one could also wg set his way into removing the offending peer, and netbird would recreate the wireguard peer? So maybe I can monitor latest handshakes and "kill" the peers that are stuck?

Any ideas welcome.

Originally created by @Silex on GitHub (Jun 6, 2025). Hello With netbird self hosted version `0.45.1`, peers version `0.45.3` and `0.36.5` that are relayed due to CGNAT issues (one peer is a 5G router, other peer is a windows PC behind corporate firewall) after a while the relay becomes "stale" in the sense that you cannot ping anymore between the peers, yet it says it's connected: ``` shell $ netbird status -d pictet-nvr1.netbird.stvs: NetBird IP: 100.70.94.175 Public key: wNWlJ95DqnJMCdXX77gZwVLB4oDDInwp7DpACxy/SV4= Status: Connected -- detail -- Connection type: Relayed ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: rels://netbird.stvs.com:443 Last connection update: 7 hours, 9 minutes ago Last WireGuard handshake: 7 hours, 10 minutes ago Transfer status (received/sent) 711.3 MiB/18.1 GiB Quantum resistance: false Routes: - Networks: - Latency: 52.905573ms $ wg show peer: wNWlJ95DqnJMCdXX77gZwVLB4oDDInwp7DpACxy/SV4= endpoint: 127.0.0.1:38500 allowed ips: 100.70.94.175/32 latest handshake: 7 hours, 13 minutes, 32 seconds ago transfer: 711.28 MiB received, 18.11 GiB sent persistent keepalive: every 25 seconds ``` As you see the latest handshake is way too old. A simple workaround is to stop/start netbird, but that kills all other connections (the PC is connected to many routers). Another workaround is to remove problematic router from policy group & add it again to force an update, but having to handle that manually is annoying. I guess one could also `wg set` his way into removing the offending peer, and netbird would recreate the wireguard peer? So maybe I can monitor latest handshakes and "kill" the peers that are stuck? Any ideas welcome.

saavagebueno added the triage-needed label 2025-11-20 06:09:55 -05:00

saavagebueno closed this issue

2025-11-20 06:09:56 -05:00

saavagebueno commented

2025-11-20 06:09:56 -05:00

@Silex commented on GitHub (Jun 6, 2025):

I found this which is interesting, but seems netbird already does the right thing:

https://www.reddit.com/r/WireGuard/comments/k3d1hc/latest_handshake_few_hours_ago/

@Silex commented on GitHub (Jun 6, 2025): I found this which is interesting, but seems netbird already does the right thing: https://www.reddit.com/r/WireGuard/comments/k3d1hc/latest_handshake_few_hours_ago/

saavagebueno commented

2025-11-20 06:09:56 -05:00

@Silex commented on GitHub (Jun 6, 2025):

Just to clarify the setup:

Netbird runs on multiple 5G routers (Teltonika TRB500) and on multiple servers (windows). The connexions are relayed due to CGNAT/firewall issues.

One of these server records cameras served through the multiple routers.

Almost every night, some of the routers relayed connexions become stale and thus the cameras are unreachable. Simply restarting netbird fixes the issues.

From the other servers most of the time the connexions to the routers are not stale, but it also happens from time to time.

This problematic server is a VM that runs with by different provider so maybe the network issues are mainly due to this other provider, but my guess is that it has more to do with the wireguard tunnel not being correctly detected as not working (e.g 5G router IP changed, 5G connection glitches, etc).

@Silex commented on GitHub (Jun 6, 2025): Just to clarify the setup: Netbird runs on multiple 5G routers (Teltonika TRB500) and on multiple servers (windows). The connexions are relayed due to CGNAT/firewall issues. One of these server records cameras served through the multiple routers. Almost every night, some of the routers relayed connexions become stale and thus the cameras are unreachable. Simply restarting netbird fixes the issues. From the other servers most of the time the connexions to the routers are not stale, but it also happens from time to time. This problematic server is a VM that runs with by different provider so maybe the network issues are mainly due to this other provider, but my guess is that it has more to do with the wireguard tunnel not being correctly detected as not working (e.g 5G router IP changed, 5G connection glitches, etc).

saavagebueno commented

2025-11-20 06:09:57 -05:00

@Silex commented on GitHub (Jun 6, 2025):

Meh, I though it was the wireguard tunnel but it seems deeper than that:

When peer is unreachable:

peer: 6kq3/G775aJK5slDq1OyEyLFK4TvyZiurx+OddRotVw=
  endpoint: 127.1.189.16:51820
  allowed ips: 100.70.189.16/32
  transfer: 0 B received, 148 B sent
  persistent keepalive: every 25 seconds

When peer is reachable:

peer: 6kq3/G775aJK5slDq1OyEyLFK4TvyZiurx+OddRotVw=
  endpoint: 127.1.189.16:51820
  allowed ips: 100.70.189.16/32
  latest handshake: 28 seconds ago
  transfer: 796.04 KiB received, 247.33 KiB sent
  persistent keepalive: every 25 seconds

I removed/recreated the peer using plain wg set commands but it does not reconnect the peer.

The only thing working at this point is netbird down/up or editing the peer policy so netbird "resets" the config.

Should I give 0.46.0 a try?

@Silex commented on GitHub (Jun 6, 2025): Meh, I though it was the wireguard tunnel but it seems deeper than that: When peer is unreachable: ``` peer: 6kq3/G775aJK5slDq1OyEyLFK4TvyZiurx+OddRotVw= endpoint: 127.1.189.16:51820 allowed ips: 100.70.189.16/32 transfer: 0 B received, 148 B sent persistent keepalive: every 25 seconds ``` When peer is reachable: ``` peer: 6kq3/G775aJK5slDq1OyEyLFK4TvyZiurx+OddRotVw= endpoint: 127.1.189.16:51820 allowed ips: 100.70.189.16/32 latest handshake: 28 seconds ago transfer: 796.04 KiB received, 247.33 KiB sent persistent keepalive: every 25 seconds ``` I removed/recreated the peer using plain `wg set` commands but it does not reconnect the peer. The only thing working at this point is netbird down/up or editing the peer policy so netbird "resets" the config. Should I give `0.46.0` a try?

saavagebueno commented

2025-11-20 06:09:57 -05:00

@nazarewk commented on GitHub (Jun 6, 2025):

I removed/recreated the peer using plain wg set commands but it does not reconnect the peer.

I'm pretty sure it uses elaborate negotiation process to establish connectivity. I wouldn't expect wg set to have any chance of working unless the Peer was directly reachable over the internet.

You can always try the 0.46.0 but after looking briefly at the notes, I don't see anything particularly relevant there.

@nazarewk commented on GitHub (Jun 6, 2025): > I removed/recreated the peer using plain `wg set` commands but it does not reconnect the peer. I'm pretty sure it uses elaborate negotiation process to establish connectivity. I wouldn't expect `wg set` to have any chance of working unless the Peer was directly reachable over the internet. You can always try the `0.46.0` but after looking briefly at the notes, I don't see anything particularly relevant there.

saavagebueno commented

2025-11-20 06:09:57 -05:00

@Silex commented on GitHub (Jun 6, 2025):

@nazarewk thanks.

I'm trying to find a workaroud so I only reset the stale peer instead of the whole netbird connection. Any idea? Removing & adding the wireguard peer seemed smart but I guess it's a dead end.

@Silex commented on GitHub (Jun 6, 2025): @nazarewk thanks. I'm trying to find a workaroud so I only reset the stale peer instead of the whole netbird connection. Any idea? Removing & adding the wireguard peer seemed smart but I guess it's a dead end.

saavagebueno commented

2025-11-20 06:09:57 -05:00

@Silex commented on GitHub (Jun 6, 2025):

Hum, forwarding UDP 51820 from WAN to peer does not seem to help P2P connection. Any idea what to try?

@Silex commented on GitHub (Jun 6, 2025): Hum, forwarding UDP 51820 from WAN to peer does not seem to help P2P connection. Any idea what to try?

saavagebueno commented

2025-11-20 06:09:57 -05:00

@Silex commented on GitHub (Jun 10, 2025):

I'll reopen this issue following the template and providing debug logs.

@Silex commented on GitHub (Jun 10, 2025): I'll reopen this issue following the template and providing debug logs.

Sign in to join this conversation.

Branches Tags

main

wasm-client-fixes

components-impl-drop-indexes

diagnostic_logs

dependabot/go_modules/aws-sdk-8f849ebaed

dependabot/github_actions/actions-fd6be26d2e

dependabot/go_modules/otel-e34c790afd

dependabot/go_modules/pion-5f703e1eca

dependabot/go_modules/testcontainers-de325c0dd6

dependabot/go_modules/gorm-2271c8195b

dependabot/go_modules/wireguard-dbd6b95108

dependabot/npm_and_yarn/client/ui/frontend/npm_and_yarn-88714b13d0

dependabot/go_modules/github.com/Azure/go-ntlmssp-0.1.1

dependabot/go_modules/github.com/pkg/sftp-1.13.10

dependabot/go_modules/goauthentik.io/api/v3-3.2026050.3

dependabot/go_modules/github.com/eko/gocache/store/redis/v4-4.2.6

dependabot/go_modules/github.com/eko/gocache/lib/v4-4.2.3

fix/fail-to-create-upnp-port-mapping-on-opnsense-firewall

fix/forwarders_exclusion_from_lazy_conn

wg_watcher_debounce

feat/admin-cli

fix/routeselector-atomic-exit-node

netmap_progressive_alignment

nmap/components-impl

refactor/relay-foreign-cache

dependabot/go_modules/github.com/aws/aws-sdk-go-v2/service/s3-1.104.2

dependabot/go_modules/github.com/jackc/pgx/v5-5.9.2

dependabot/go_modules/github.com/pion/dtls/v3-3.1.4

dependabot/go_modules/github.com/oapi-codegen/runtime-1.4.2

dependabot/go_modules/github.com/gopacket/gopacket-1.6.1

dependabot/go_modules/github.com/coreos/go-oidc/v3-3.19.0

dependabot/go_modules/golang.org/x/sys-0.46.0

dependabot/go_modules/github.com/pires/go-proxyproto-0.12.0

peer-acl-multi-source

embedded-vnc

fix/signal-watchdog-sync-stop

docs/agent-network

feature/ios-ssh

client-json-socket

test/affected-logic

fix/revert-ice-filter

refactor/simplify-affected-peers

pascal-filter-policies-by-direction

claude/lock-contention-peer-connect-g8t6au

dmitri-filter-policies-by-direction

refactor/migrate-profiles-to-go

profile-bindings-ios

fix/skip-restart-unchanged-route

fix/mgmt-cache-async-resolve

refactor/wails-update-105

client_lifetime_serialization_refactor

fix/browser-ssh-2

fix/ipv6-and-netstack-accept-loop

fix/browser-ssh

profile-id-name-test

refactor/mgmt-bootstrap

mdm_integration

feat/getting-started-unified-wizard

socket-grpc-permissions

fix/mysql-index-migration

windows-dns-firewall

tests/enable-race-on-tests

ui-refactor-gtk3

feature/affected-peers-grpc

profile-id

lazyconn-first-packet-fix-v2

claude/focused-gates-VMTgb

ui-tray-linux-leftclick

fix/ctx-enrichment

daemon-owner

feature/android-client-ssh

worktree-accept-ra-forwarding

nmap/combined-deploy

task/align_protobuff_toolset

feature/session-extend

add-json-yaml-flags

refactor/ephemeral-cleanup

claude/webtransport-relay-wasm-mUjY9

claude/vnc-udp-feasibility-6KB1U

fix-ssh-authorized-users-multi-rule

fix/wgport-config

e2e-windows-dns-combined

fix/login-cmd-root-flags

feat/reseller-openapi-spec

github-issue-resolver

add-steamos-support

fix-darwin-uninstaller

flutter-test

ci/freebsd-pkg-bootstrap

cached-serial-check-on-sync

fix-mgmt-cache-bypass-overlay

revert-easyjson-5938

revert-ice-5820

revert-firewalld-5928

refactor/permissions-manager

revert-dns-5935-systemd-resolved

revert-dns-5935-5945

revert-dns-5945-mgmt-cache

feature/log-most-busy-peers

prototype/ui-wails

coderabbitai/utg/8ae8f20

feature/use-peer-fqdn-on-https

release/0.68.3

add-slack-channel

claude/rdp-token-passthrough-eNcqW

transparent-proxy

fix/macos-stale-route-eexist

crowdsec-selfhosted

fix/remove-otel-units

entire/checkpoints/v1

fix/getting-started

feat/static-connectors-combined-server

feature/use-local-keys-embedded

feature/fleetdm

set-env-only-if-not-fork

feature/expose-has-channel

fix/connection-status-race

fix/filter-cgnat-cni-ice-candidates

feature/check-cert-locker-before-acme

test/proxy-fixes

test/proxy-mtu

prototype/ui-tauri

test/proxy-speed

fix-reused-ports

feat/migrate-to-embedded-idp

feature/add-serial-to-proxy-merged

deploy/proxy-serial

test/connection

feature/disable-legacy-port

feature/flag-to-disable-legacy-port

test/perftest

fix/http-redirect

poc-token-command

dn-reverse-proxy

prototype/reverse-proxy-rename

prototype/reverse-proxy-logs-pagination

feature/client-metrics

prototype/reverse-proxy-clusters

debug-dns-route

fix/win-dns-batch

add-extra-route-logs

job-stream-notify-disconnection-eof

deploy/secrets-manager

trigger-proxy-update

bug/update-ios-client-code-build-tags

sync-client-netmap-serial

log/conn-disconn

nmap/compaction-deploy

ci-win-test

feature/disk-encryption-check

wasm-debug

swap-dns-prio

fix/dex-config

feature/migrate-auto-groups-to-table

nmap/compaction

dex-nocgo-stub

feature/exclude-terraform-from-rate-limiting

test-freebsd

retries-refactor

coderabbitai/docstrings/b7e98ac

feat/integrate-zitadel

bug/ios-hanging-reconection

zitadel-idp

feat/network-map-serial

refactor/get-account-no-users

feat/auto-upgrade

feature/report-high-pat-id

feature/temporary-access-for-resource

fix/nmap-fwrules

dont-restart-dns

prototype/ui

update-gomobile

go-dns-for-ice

wasm-ldflags

test-ldflags

wasmbuild-test

feature/networks-s2s

vk/compare-nmaps

dbg/bothmaps

feature/changeset

reorder-dns-shutdown

fix/relay-reconnection-race

fix/nmap-exitnodes

vk/debug/nmap-both

move-licensed-code

feat/better-daemon-connection-lost-message

feat/auto-update-2

test/timings

refactor/getaccount-raw

tests/nmap-getaccount

refactor/nmap

refactor/nmap-limit-buffer

feature/detect-mac-wakeup

feature/extract-modules

quick-setings

feat/sync-limiter

feature/store-cache-impl

fix-install-version

feature/store-metrics

feature/metrics-on-store

feature/use-gorm-cache

loadtest-signal

unsymmetrical-squash

refactor/reducate-signaling

test/update-reduce

feature/store-cache

feature/remote-debug

cli-ws-proxy-backend-addr

feat/mgmt-map-serial

snyk-fix-d9d0081a4c7f9137bdb59d0d50a141a2

snyk-fix-7415cea5a11acd66753540ca2c598c63

job-yml-update

feature/android-allow-selecting-routes

fix/up-sequence

fix/dns-hash-update

snyk-fix-967adae9863f17f108ce8948d9117b8d

log/getaccount-by-peer

signal-suppressor

dns-exit-node

feature/auto-updates

feature/cache-srv-key

merged-fixes

fix/missed-offers-and-debug

debug-and-fixes

poc-wasm-clean-backend-s2s

test/remote-debug

debug-api

fix/remove-gpo-if-empty

fix/test-freebsd

fix/mysql-setup

fix/remove-logout-btn

handle-existing-domain-user

chore/unify-domain-validation

snyk-fix-c5fafc8a50ce1f29046e25a1fc346185

feat/profile-edit-btn

snyk-fix-a54966211e18d4cf67e5a2757cc006d1

log-short-id

feat/logout-ephemeral

log-checks

batch-wg-ops

nb-interface-default

feat/aws-integration

add/race-test

feature/relay-feature-versioning

fix/systemd-service-logs

poc/preprocessed-map

add-account-onboarding

bind-ipv6

fix/merge-main

logs/peerlogs-addpeer

feature/net-297-network-migration

feature/support-skip-auto-apply-exit-node-routes

set-cmd

set-command-with-cursor

feature/limit-update-channel

stop-using-locking-share

feature/poc-lazy-detection

feature/net-248-removal-of-sync-mutex-locks

test/multiple-peer-logging

preresolve

add-ns-punnycode-support

apply-routes-early

windows-search-domains

fix/connecting-route-filter

feature/management/rest-client/impersonate

debug-local-records

resource-fields-snake-case

test/grpc-rate-limit

traffic-correlation-policy

feature/rest-client-options

feat/events-metrics

feature/buf-cli

test/add-ratelimiter

test/remove-write-lock-on-add-peer

fix/add-peer-semaphore

feature/users-roles-endpoint

mlsmaycon-patch-1

debug-user-role

chore/primary-key-on-networks

feature/update-account-peers-buffer-startup

remove-ubuntu2004-runners

refactor/permissions-no-pat-allowed

ref/logrus-factory

use-conntrack-zone

deploy/permissions-account

feature/lazy-connection-idle

ref/improve-test-cov

restore-pr-3440

test/increase-grpc-timeouts

feat/buffer-account-peers-update

test/networkmapgeneration-changes

feature/base-manager

feature/flow-receiver

chore/benchmark-with-large-runner

refactor/handshake-initiator

client/ui-update-systray-icons

userspace-router

wgwatcher-test

output-if-key-already-exists

fix/relay-reconnection

feature/port-forwarding-client-codecleaning

detached2

test/callbacks-nil-iceconninfo

refactor/optimize-peer-expiration

enable-udp-port-for-docker-template

fix/relay-update

feature/apply-posture-netmap

fix/group-update-existing-resource

conntrack-stats

upgrade-okta-sdk

multi-price

test/conn-stat

set-min-parallel-tests-for-management

dns-interceptor

debug-dns

router-dns

add-static-system-info

debug-0.29.4

debug-0.33.0

account-refactoring

relay/2800_quic

route-get-account-refactoring

test/seed-random-routes

feature/get-account-refactoring

test/reconnect-race-condition

refactor/get-account-usage

feature/add-session-id-to-update-channel

improve-ipv4conn

fix/async-pion-event-handling

debug

add-offload

feature/validate-group-association-debug

fix/limit-conn-for-sqlite

test/engine-iface

test/transaction-for-jwt-sync

fix/engine-stop-in-foreground

feature/add-mysql-support

test-migration

refactor/header-size-values

relay/eliminate-gob

test/signal-dispatcher-with-relay

relay/debug

validate-icon

feature/ipv6-support

use-pre-expanded-peers-map

feature/use-signal-dispatcher

validate/peer-status

add-read-write-times

fix/sync-peer-race

feature/relay-status

netmap

evaluate/network-map-hash

fix/lower-dns-resolve-interval-on-fail

feature/relay

fix/go-mod-version

upgrade-nftables

synology-userspace-mode

fix/use-ip-for-default-routes-on-darwin

fix/proxy_close

enable-release-workflow-on-pr

deploy/peer-performance

feature/permanent-turn

feature/permanent-turn-proxy

deploy/posture-check-sqlite

feature/optimize_sqlite_save

debug-ios-behavior

fix/delete-route-only-after-adding

tshoot/windows-logger

remove-new-routing

refactor/eliminate-repo-dependency

add-arm-to-ci

refactor-demo-account-object

test/abc2

test/abc

send-ssh-rosenpass-config-meta

refactor-demo

ensure-schedule-never-runs-non-positive

feature/peer-validator-groupmgm

feature/peer-validator-fix

fix/include-active-dashboard-users

fix/handle-canceling-schedule

fix/geo-download

debug-google-workspace

yury/resolve-ip-to-location

feature/extend-sysinfo

sqlite-async-peer-status

yury/add-postgresql-store

fix/route

test-build

posture-checks-poc

debug-keycloak-idp

poc/netstack

for-pascal-tmp

peer-logout-management

manual-peer-logout

detached

chore/refactor-management

test/dns-bind

fix/enforce-acl-for-containers

yury/use-sync-map-in-updatechannel

fix/events-key-handling

filter-cache-on-load-account

fix/user-expiration

handle-user-context-cancellation

nb-client-k8s-statefulset

fake-addr

fix/iptables_in_docker

ebpf-debug

update-getting-started-flow-use-postgres

fix/peer_list_notification

feature/device-authentication-with-client-secret

feature/keep_alive

feat-groups-from-jwt

separate_proxy_from_wgconfig

fix/wg_conn

wg_conn_fix

wg_bind_parallel_processing

fix-rollback-get-acls

proxy_cfg_cleanup

performance-improvement-rego

update-lock-log-level

feat-client-side-acl

refactor/move_grpcserver_logic_to_account_manager

feature/event-storage

feature/update-idp-redeeming-invite

feature/api-peer-info

return-groupminimum-setupkey

feature/interface-bind

documentation_enhancement

fix-peer-registration

ssh

users_cache

pass-client-caller

client_caller_type

revert-283-feat-fix-windows-installer

periodic-peer-updates

ebpf

braginini/wasm

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: SVI/netbird#1944