Every few hours a P2P connected peer falls back to "Relayed" for all peers and isn't reachable anymore #1563

Open
opened 2025-11-20 05:32:54 -05:00 by saavagebueno · 1 comment
Owner

Originally created by @the-project-group on GitHub (Jan 19, 2025).

Describe the problem

  • Every few hours a P2P connected peer falls back to "Relayed" for all peers but isn't reachable anymore
  • I'm not sure if the connectivity works a while when in state "relayed" or if it stops immediately
  • The relayed / not working peer is a docker container running Netbird as sidecar for connectivity used to provide custom DNS services (TechnitiumDNS)

Sidecar docker compose excerpt:

  netbird-dns-server:
    image: "netbirdio/netbird:latest"
    command: netbird up --management-url https://netbird.DOMAIN.com:33073 -k SETUPKEY
    hostname: Technitium-DNS
    cap_add:
      - net_admin # https://docs.netbird.io/how-to/examples#net-bird-client-in-docker
      - sys_admin # https://docs.netbird.io/how-to/examples#net-bird-client-in-docker
      - sys_resource # https://docs.netbird.io/how-to/examples#net-bird-client-in-docker
    restart: unless-stopped
    volumes:
     - netbird-data:/etc/netbird
  • the docker host is running on a public VM which also hosts the other Netbird containers
CONTAINER ID   IMAGE                          COMMAND                  CREATED        STATUS        PORTS                                                                      NAMES
ec293b45c8c9   technitium/dns-server:latest   "/usr/bin/dotnet /op…"   31 hours ago   Up 31 hours                                                                              dns-server
f0ef812897cc   netbirdio/netbird:latest       "/usr/local/bin/netb…"   31 hours ago   Up 31 hours                                                                              dnsserver-netbird-dns-server-1
4d52cadccbb5   netbirdio/management:latest    "/go/bin/netbird-mgm…"   31 hours ago   Up 31 hours   0.0.0.0:33073->443/tcp, [::]:33073->443/tcp                                netbird-management-1
b3ea4bf4aec6   netbirdio/relay:latest         "/go/bin/netbird-rel…"   31 hours ago   Up 14 hours   0.0.0.0:33080->33080/tcp, :::33080->33080/tcp                              netbird-relay-1
f621d1e13ba7   netbirdio/signal:latest        "/go/bin/netbird-sig…"   31 hours ago   Up 31 hours   0.0.0.0:10000->80/tcp, [::]:10000->80/tcp                                  netbird-signal-1
5158c9fbf755   netbirdio/dashboard:latest     "/usr/bin/supervisor…"   31 hours ago   Up 31 hours   0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp   netbird-dashboard-1
71fa9b82c220   coturn/coturn:latest           "docker-entrypoint.s…"   31 hours ago   Up 31 hours                                                                              netbird-coturn-1
  • The not working peer is not reachable from any peer (except from the Docker host which is also running netbird agent and still has P2P connectivity, all others are in state relayed), so all other peers on the network can't reach it (netbird down / up won't help)
  • I noticed that Ice candidate values for local/remote are empty for the affected peer while in state "relayed" < probably normal?
anon-BUzzm.domain:
  NetBird IP: 100.116.177.244
  Public key: KiYicIpIUWtzHV75IxVy6TsW4Lr3LxEgS7LqluYfqGk=
  Status: Connected
  -- detail --
  Connection type: Relayed
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: rel://netbird.anon-LexV5.domain:33080
  Last connection update: 6 seconds ago
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/296 B
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 0s

On the docker host it shows local/remote IPs for the affected peer (but with the local IPs of the docker network)

anon-NYn0Z.domain:
  NetBird IP: 100.116.177.244
  Public key: KiYicIpIUWtzHV75IxVy6TsW4Lr3LxEgS7LqluYfqGk=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/srflx
  ICE candidate endpoints (Local/Remote): 198.51.100.0:51820/172.19.0.2:51820
  Relay server address: rel://netbird.anon-DfWu5.domain:33080
  Last connection update: 14 hours, 32 minutes ago
  Last WireGuard handshake: 57 seconds ago
  Transfer status (received/sent) 221.6 KiB/297.5 KiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 1.643664ms
  • docker compose restart relay brings back connectivity immediately (at least in state relayed)
  • I didn't find a way to re-produce it yet
  • I'm not sure since when this issue occurs but I think it started with the "new relay"

Excerpt of a client log, notice the "weird ?" IP "127.1.177.244":

2025-01-19T08:04:05+01:00 ERRO client/iface/wgproxy/bind/proxy.go:118: failed to read from remote conn: rel://netbird.domain/.com:33080, use of closed network connection
2025-01-19T08:04:05+01:00 INFO [relay: rel://netbird.domain.com:33080] relay/client/client.go:216: open connection to peer: sha-Q9xyX+WBgKqA+4bVKeeU+EGITTsl+l5qQtSX3J/J+TM=
2025-01-19T08:04:05+01:00 INFO [peer: KiYicIpIUWtzHV75IxVy6TsW4Lr3LxEgS7LqluYfqGk=] client/internal/peer/conn.go:444: created new wgProxy for relay connection: 127.1.177.244:51820
2025-01-19T08:04:05+01:00 INFO [peer: KiYicIpIUWtzHV75IxVy6TsW4Lr3LxEgS7LqluYfqGk=] client/internal/peer/conn.go:473: start to communicate with peer via relay

Relay log:

relay-1  | 2025-01-19T06:46:00Z INFO relay/server/listener/ws/listener.go:91: WS client connected from: 88.217.39.44:58954
relay-1  | 2025-01-19T06:46:00Z INFO [peer_id: sha-bnw/WMyavOJjnGby785Go9eaVCsh9Wr4BL7lNggzx44=] relay/server/relay.go:129: peer connected from: 88.217.39.44:58954
relay-1  | 2025-01-19T06:58:39Z ERRO [peer_id: sha-/Dz/1ntGJXBfCBci/Ibl6g7imZbWJ7l7G1qPgVHeU6c=] relay/server/peer.go:180: peer healthcheck timeout
relay-1  | 2025-01-19T06:58:39Z INFO [peer_id: sha-/Dz/1ntGJXBfCBci/Ibl6g7imZbWJ7l7G1qPgVHeU6c=] relay/server/peer.go:185: peer connection closed due healthcheck timeout
relay-1  | 2025-01-19T06:58:39Z DEBG [peer_id: sha-/Dz/1ntGJXBfCBci/Ibl6g7imZbWJ7l7G1qPgVHeU6c=] relay/server/relay.go:137: relay connection closed

NetBird version
Selfhosted
0.36.3

Additional context

I'm still not sure what brings back P2P connectivity. I tried restarting the sidecar container as well...
Restarting the whole Docker host brought back P2P connectivity at least once for the DNS / Sidecar container, but then the connectivity to the Docker host itself switched to "relayed".

Could the new rootless feature NB_USE_NETSTACK_MODE help?

Originally created by @the-project-group on GitHub (Jan 19, 2025). **Describe the problem** - Every few hours a **P2P** connected peer falls back to "Relayed" for all peers but isn't reachable anymore - I'm not sure if the connectivity works a while when in state "relayed" or if it stops immediately - The relayed / not working peer is a docker container running Netbird as **sidecar** for connectivity used to provide custom DNS services ([TechnitiumDNS](https://technitium.com/dns/)) Sidecar docker compose excerpt: ``` netbird-dns-server: image: "netbirdio/netbird:latest" command: netbird up --management-url https://netbird.DOMAIN.com:33073 -k SETUPKEY hostname: Technitium-DNS cap_add: - net_admin # https://docs.netbird.io/how-to/examples#net-bird-client-in-docker - sys_admin # https://docs.netbird.io/how-to/examples#net-bird-client-in-docker - sys_resource # https://docs.netbird.io/how-to/examples#net-bird-client-in-docker restart: unless-stopped volumes: - netbird-data:/etc/netbird ``` - the docker host is running on a public VM which also hosts the other Netbird containers ❗️ ``` CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES ec293b45c8c9 technitium/dns-server:latest "/usr/bin/dotnet /op…" 31 hours ago Up 31 hours dns-server f0ef812897cc netbirdio/netbird:latest "/usr/local/bin/netb…" 31 hours ago Up 31 hours dnsserver-netbird-dns-server-1 4d52cadccbb5 netbirdio/management:latest "/go/bin/netbird-mgm…" 31 hours ago Up 31 hours 0.0.0.0:33073->443/tcp, [::]:33073->443/tcp netbird-management-1 b3ea4bf4aec6 netbirdio/relay:latest "/go/bin/netbird-rel…" 31 hours ago Up 14 hours 0.0.0.0:33080->33080/tcp, :::33080->33080/tcp netbird-relay-1 f621d1e13ba7 netbirdio/signal:latest "/go/bin/netbird-sig…" 31 hours ago Up 31 hours 0.0.0.0:10000->80/tcp, [::]:10000->80/tcp netbird-signal-1 5158c9fbf755 netbirdio/dashboard:latest "/usr/bin/supervisor…" 31 hours ago Up 31 hours 0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp netbird-dashboard-1 71fa9b82c220 coturn/coturn:latest "docker-entrypoint.s…" 31 hours ago Up 31 hours netbird-coturn-1 ``` - The not working peer is not reachable from any peer (except from the Docker host which is also running netbird agent and still has P2P connectivity, all others are in state relayed), so all other peers on the network can't reach it (netbird down / up won't help) - I noticed that Ice candidate values for local/remote are empty for the affected peer while in state "relayed" < probably normal? ``` anon-BUzzm.domain: NetBird IP: 100.116.177.244 Public key: KiYicIpIUWtzHV75IxVy6TsW4Lr3LxEgS7LqluYfqGk= Status: Connected -- detail -- Connection type: Relayed ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: rel://netbird.anon-LexV5.domain:33080 Last connection update: 6 seconds ago Last WireGuard handshake: - Transfer status (received/sent) 0 B/296 B Quantum resistance: false Routes: - Networks: - Latency: 0s ``` On the docker host it shows local/remote IPs for the affected peer (but with the local IPs of the docker network) ``` anon-NYn0Z.domain: NetBird IP: 100.116.177.244 Public key: KiYicIpIUWtzHV75IxVy6TsW4Lr3LxEgS7LqluYfqGk= Status: Connected -- detail -- Connection type: P2P ICE candidate (Local/Remote): host/srflx ICE candidate endpoints (Local/Remote): 198.51.100.0:51820/172.19.0.2:51820 Relay server address: rel://netbird.anon-DfWu5.domain:33080 Last connection update: 14 hours, 32 minutes ago Last WireGuard handshake: 57 seconds ago Transfer status (received/sent) 221.6 KiB/297.5 KiB Quantum resistance: false Routes: - Networks: - Latency: 1.643664ms ``` - `docker compose restart relay` brings back connectivity immediately (at least in state `relayed`) ✅ - I didn't find a way to re-produce it yet - I'm not sure since when this issue occurs but I think it started with the "new relay" **Excerpt of a client log, notice the "weird ?" IP "127.1.177.244":** ``` 2025-01-19T08:04:05+01:00 ERRO client/iface/wgproxy/bind/proxy.go:118: failed to read from remote conn: rel://netbird.domain/.com:33080, use of closed network connection 2025-01-19T08:04:05+01:00 INFO [relay: rel://netbird.domain.com:33080] relay/client/client.go:216: open connection to peer: sha-Q9xyX+WBgKqA+4bVKeeU+EGITTsl+l5qQtSX3J/J+TM= 2025-01-19T08:04:05+01:00 INFO [peer: KiYicIpIUWtzHV75IxVy6TsW4Lr3LxEgS7LqluYfqGk=] client/internal/peer/conn.go:444: created new wgProxy for relay connection: 127.1.177.244:51820 2025-01-19T08:04:05+01:00 INFO [peer: KiYicIpIUWtzHV75IxVy6TsW4Lr3LxEgS7LqluYfqGk=] client/internal/peer/conn.go:473: start to communicate with peer via relay ``` **Relay log:** ``` relay-1 | 2025-01-19T06:46:00Z INFO relay/server/listener/ws/listener.go:91: WS client connected from: 88.217.39.44:58954 relay-1 | 2025-01-19T06:46:00Z INFO [peer_id: sha-bnw/WMyavOJjnGby785Go9eaVCsh9Wr4BL7lNggzx44=] relay/server/relay.go:129: peer connected from: 88.217.39.44:58954 relay-1 | 2025-01-19T06:58:39Z ERRO [peer_id: sha-/Dz/1ntGJXBfCBci/Ibl6g7imZbWJ7l7G1qPgVHeU6c=] relay/server/peer.go:180: peer healthcheck timeout relay-1 | 2025-01-19T06:58:39Z INFO [peer_id: sha-/Dz/1ntGJXBfCBci/Ibl6g7imZbWJ7l7G1qPgVHeU6c=] relay/server/peer.go:185: peer connection closed due healthcheck timeout relay-1 | 2025-01-19T06:58:39Z DEBG [peer_id: sha-/Dz/1ntGJXBfCBci/Ibl6g7imZbWJ7l7G1qPgVHeU6c=] relay/server/relay.go:137: relay connection closed ``` **NetBird version** Selfhosted `0.36.3` **Additional context** I'm still not sure what brings back P2P connectivity. I tried restarting the sidecar container as well... Restarting the whole Docker host brought back P2P connectivity at least once for the DNS / Sidecar container, **but** then the connectivity to the Docker host itself switched to "relayed". Could the new rootless feature NB_USE_NETSTACK_MODE help?
saavagebueno added the triage-needed label 2025-11-20 05:32:54 -05:00
Author
Owner

@the-project-group commented on GitHub (Jan 19, 2025):

Rootless / NB_USE_NETSTACK_MODE won't work for me because it seems that it can't use the routes from routing peers.
I need to push the route to some internal DNS servers (Conditional DNS forwarders) via a routing peer to the DNS server container / sidecar because these domain controllers are not running Netbird.

@the-project-group commented on GitHub (Jan 19, 2025): Rootless / NB_USE_NETSTACK_MODE won't work for me because it seems that it can't use the routes from routing peers. I need to push the route to some internal DNS servers (Conditional DNS forwarders) via a routing peer to the DNS server container / sidecar because these domain controllers are not running Netbird.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#1563