Nameserver being randomly unavailable #814

Open
opened 2025-11-20 05:17:56 -05:00 by saavagebueno · 7 comments
Owner

Originally created by @Enailis on GitHub (Apr 19, 2024).

Describe the problem

We're using the self-hosted version of Netbird and everything is setup according to the documentation. Sometimes the custom nameserver is resolved, sometimes it isn't. That's without ever touching the config on the web interface.

To give more context here is our network configuration:

  • We're hosting multiple internal services like Gitlab for example
  • We can access those services using pfSense's DNS resolver
  • Every user in Netbird has a Network Route to the pfSense's IP (10.10.10.1)
  • Every user in Netbird has a Network Route to the different internal services' IP like the one hosting Gitlab
  • We have a Nameserver that matches the domain of those services (like gitlab.mycompany.com) using the pfSense's IP

When a user is connected to the Netbird VPN, he can ping every server and every user without any problem. For example, users can ping Gitlab's Netbird IP:

> ping 100.73.149.194
PING 100.73.149.194 (100.73.149.194): 56 data bytes
64 bytes from 100.73.149.194: icmp_seq=0 ttl=64 time=35.938 ms
64 bytes from 100.73.149.194: icmp_seq=1 ttl=64 time=32.203 ms
64 bytes from 100.73.149.194: icmp_seq=2 ttl=64 time=32.427 ms

But users cannot ping pfSense's DNS Resolver IP:

> ping 10.10.10.1
PING 10.10.10.1 (10.10.10.1): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2

The netbird status -d command returns this problem:

[10.10.10.1:53] for [gitlab.mycompany.com] is Unavailable, reason: 1 error occurred:
	* read udp 192.168.1.182:53408->10.10.10.1:53: i/o timeout

Apart from this, we have no logs server side about this, and only i/o timeout for the pfSense's DNS Resolver IP in var/log/netbird/client.log.

But sometimes, without changing anything, either client or server side, everything works just fine.

This issue appears on every OS: Windows 11, macOS 14.4.1 (23E224) and Ubuntu 22.04.

To Reproduce

Since the problem is random, we have no clue how to reproduce this problem.

Expected behavior

The Nameserver is supposed to be constantly recognize by Netbird without being randomly unavailable.

Are you using NetBird Cloud?

We're using Netbird self-hosted solution.

NetBird version

Every user is up to date: 0.27.3.

Originally created by @Enailis on GitHub (Apr 19, 2024). **Describe the problem** We're using the self-hosted version of Netbird and everything is setup according to the documentation. Sometimes the custom nameserver is resolved, sometimes it isn't. That's without ever touching the config on the web interface. To give more context here is our network configuration: - We're hosting multiple internal services like Gitlab for example - We can access those services using pfSense's DNS resolver - Every user in Netbird has a Network Route to the pfSense's IP (10.10.10.1) - Every user in Netbird has a Network Route to the different internal services' IP like the one hosting Gitlab - We have a Nameserver that matches the domain of those services (like gitlab.mycompany.com) using the pfSense's IP When a user is connected to the Netbird VPN, he can ping every server and every user without any problem. For example, users can ping Gitlab's Netbird IP: ```powershell > ping 100.73.149.194 PING 100.73.149.194 (100.73.149.194): 56 data bytes 64 bytes from 100.73.149.194: icmp_seq=0 ttl=64 time=35.938 ms 64 bytes from 100.73.149.194: icmp_seq=1 ttl=64 time=32.203 ms 64 bytes from 100.73.149.194: icmp_seq=2 ttl=64 time=32.427 ms ``` But users cannot ping pfSense's DNS Resolver IP: ```powershell > ping 10.10.10.1 PING 10.10.10.1 (10.10.10.1): 56 data bytes Request timeout for icmp_seq 0 Request timeout for icmp_seq 1 Request timeout for icmp_seq 2 ``` The `netbird status -d` command returns this problem: ```powershell [10.10.10.1:53] for [gitlab.mycompany.com] is Unavailable, reason: 1 error occurred: * read udp 192.168.1.182:53408->10.10.10.1:53: i/o timeout ``` Apart from this, we have no logs server side about this, and only `i/o timeout` for the pfSense's DNS Resolver IP in `var/log/netbird/client.log`. But sometimes, without changing anything, either client or server side, everything works just fine. This issue appears on every OS: Windows 11, macOS 14.4.1 (23E224) and Ubuntu 22.04. **To Reproduce** Since the problem is random, we have no clue how to reproduce this problem. **Expected behavior** The Nameserver is supposed to be constantly recognize by Netbird without being randomly unavailable. **Are you using NetBird Cloud?** We're using Netbird self-hosted solution. **NetBird version** Every user is up to date: 0.27.3.
saavagebueno added the triage-needed label 2025-11-20 05:17:56 -05:00
Author
Owner

@pascal-fischer commented on GitHub (Apr 22, 2024):

Hi @Enailis,

how exactly is the route to 10.10.10.1 set up? Are you sure the configured routing peer is online and successfully connected to the users peer that tries to ping? Is that connection direct or relayed? So with netbird status -d can you detect a difference in the connection when it is working compared to when it is not working?

@pascal-fischer commented on GitHub (Apr 22, 2024): Hi @Enailis, how exactly is the route to 10.10.10.1 set up? Are you sure the configured routing peer is online and successfully connected to the users peer that tries to ping? Is that connection direct or relayed? So with `netbird status -d` can you detect a difference in the connection when it is working compared to when it is not working?
Author
Owner

@Enailis commented on GitHub (Apr 22, 2024):

Hi @pascal-fischer,

We have a network route to 10.10.10.1/32 using our internal servers as peer group. All servers in this group have access to 10.10.10.1. This route is distributed to all users. We have 3 different peers in this group, they're all online. The servers can't ping the users with their Netbird's IP. The users can ping the servers using their real IP but not their Netbird's IP. The connection to the 3 servers in the peer group is relayed.

We actually can't detect any difference in the netbird status -d when it's working and when it's not. The current configuration gives me this result for netbird status -d, there is other peers but they all look the same as the one shown here:

 server.mycompany.com:
  NetBird IP: 100.73.252.226
  Public key: RupexIsExt4J2oKsN4avstKkjD03vlSq728BzT/uvB8=
  Status: Connected
  -- detail --
  Connection type: Relayed
  Direct: false
  ICE candidate (Local/Remote): relay/prflx
  ICE candidate endpoints (Local/Remote): 90.90.90.90:63293/80.80.80.80:63293
  Last connection update: 2024-04-22 13:58:48
  Last WireGuard handshake: 2024-04-22 14:11:34
  Transfer status (received/sent) 1.9 KiB/1.5 KiB
  Quantum resistance: false
  Routes: -
  Latency: 55.684815ms

Daemon version: 0.27.3
CLI version: 0.27.3
Management: Connected to https://vpn.mycompany.com:33073
Signal: Connected to http://vpn.mycompany.com:10000
Relays:
  [stun:vpn.mycompany.com:3478] is Available
  [turn:vpn.mycompany.com:3478?transport=udp] is Available
Nameservers:
  [10.10.10.1:53] for [gitlab.mycompany.com] is Available
FQDN: ena.mycompany.com
NetBird IP: 100.73.213.219/16
Interface type: Kernel
Quantum resistance: false
Routes: -
Peers count: 8/13 Connected

Even if everything looks fine, I cannot access gitlab.mycompany.com.

To add something from my original issue, it now works for some windows users. The client takes a long time to connect and sometimes users have to do netbird up/netbird down multiple times before it actually works. It still doesn't work for Linux, macOS and some windows users.

@Enailis commented on GitHub (Apr 22, 2024): Hi @pascal-fischer, We have a network route to 10.10.10.1/32 using our internal servers as peer group. All servers in this group have access to 10.10.10.1. This route is distributed to all users. We have 3 different peers in this group, they're all online. The servers can't ping the users with their Netbird's IP. The users can ping the servers using their real IP but not their Netbird's IP. The connection to the 3 servers in the peer group is relayed. We actually can't detect any difference in the `netbird status -d` when it's working and when it's not. The current configuration gives me this result for `netbird status -d`, there is other peers but they all look the same as the one shown here: ```powershell server.mycompany.com: NetBird IP: 100.73.252.226 Public key: RupexIsExt4J2oKsN4avstKkjD03vlSq728BzT/uvB8= Status: Connected -- detail -- Connection type: Relayed Direct: false ICE candidate (Local/Remote): relay/prflx ICE candidate endpoints (Local/Remote): 90.90.90.90:63293/80.80.80.80:63293 Last connection update: 2024-04-22 13:58:48 Last WireGuard handshake: 2024-04-22 14:11:34 Transfer status (received/sent) 1.9 KiB/1.5 KiB Quantum resistance: false Routes: - Latency: 55.684815ms Daemon version: 0.27.3 CLI version: 0.27.3 Management: Connected to https://vpn.mycompany.com:33073 Signal: Connected to http://vpn.mycompany.com:10000 Relays: [stun:vpn.mycompany.com:3478] is Available [turn:vpn.mycompany.com:3478?transport=udp] is Available Nameservers: [10.10.10.1:53] for [gitlab.mycompany.com] is Available FQDN: ena.mycompany.com NetBird IP: 100.73.213.219/16 Interface type: Kernel Quantum resistance: false Routes: - Peers count: 8/13 Connected ``` Even if everything looks fine, I cannot access `gitlab.mycompany.com`. To add something from my original issue, it now works for some windows users. The client takes a long time to connect and sometimes users have to do `netbird up`/`netbird down` multiple times before it actually works. It still doesn't work for Linux, macOS and some windows users.
Author
Owner

@vincent-lg18 commented on GitHub (Apr 24, 2024):

Hello,
I'm working on the same Netbird instance as @Enailis

Some corrections / additional information on the above post:

  • the domain name of the peer 100.73.252.226 is server.mycompany.vpn and not server.mycompany.com
  • gitlab.mycompany.com is hosted on 100.73.252.226

Here is some other additional information:

On Windows clients (our users connected with SSO), our nameserver (10.10.10.1:53) is unstable, and its availability can change from one netbird down & netbird up to another for no apparent reason.
When it's available, we can access gitlab.mycompany.com and server.mycompany.vpn without any problem.
However, when it's unavailable ([10.10.10.1:53] for [gitlab.mycompany.com] is Unavailable, reason: 1 error occurred: * read udp 192.168.1.182:53408->10.10.10.1:53: i/o timeout)), we can no longer access gitlab.mycompany.com but we can still access server.mycompany.vpn.

On our Linux clients (other users connected with SSO), other behaviors appear.
Our nameserver (10.10.10.1:53) is always marked as available in a netbird status -d, however, it is impossible to access gitlab.mycompany.com or server.mycompany.vpn

Here is a client's /etc/resolv.conf file:

# Generated by NetworkManager
nameserver 192.168.1.1

If I run dig gitlab.mycompany.com, I don't get an IP address back. However, if I run dig @10.10.10.1 gitlab.mycompany.com, its IP appears. So by adding the line nameserver 10.10.10.1 in the clients' /etc/resolv.conf files, we can access our gitlab but we can't still access server.mycompany.vpn.

Note that we can still access our gitlab via its IP address (the IP given by Netbird and its real IP). Our routes are therefore well configured, the problem only comes from DNS resolution.

Finally, note that this problem never appears for Linux clients installed with a Setup Key (our servers). Here's their /etc/resolv.conf file:

...
nameserver 127.0.0.53
options edns0 trust-ad
search company.vpn company.com

We therefore believe that the problem only comes from Netbird clients, which cannot apply DNS configurations to our workstations (Linux and Windows).

@vincent-lg18 commented on GitHub (Apr 24, 2024): Hello, I'm working on the same Netbird instance as @Enailis Some corrections / additional information on the above post: - the domain name of the peer 100.73.252.226 is server.mycompany.vpn and not server.mycompany.com - gitlab.mycompany.com is hosted on 100.73.252.226 Here is some other additional information: On **Windows clients** (our users connected with SSO), our nameserver (10.10.10.1:53) is unstable, and its availability can change from one `netbird down & netbird up` to another for no apparent reason. When it's available, we can access `gitlab.mycompany.com` and `server.mycompany.vpn` without any problem. However, when it's unavailable (`[10.10.10.1:53] for [gitlab.mycompany.com] is Unavailable, reason: 1 error occurred: * read udp 192.168.1.182:53408->10.10.10.1:53: i/o timeout)`), we can no longer access `gitlab.mycompany.com` but we can still access `server.mycompany.vpn`. On our **Linux clients** (other users connected with SSO), other behaviors appear. Our nameserver (10.10.10.1:53) is always marked as available in a `netbird status -d`, however, it is impossible to access `gitlab.mycompany.com` or `server.mycompany.vpn` Here is a client's `/etc/resolv.conf` file: ``` # Generated by NetworkManager nameserver 192.168.1.1 ``` If I run `dig gitlab.mycompany.com`, I don't get an IP address back. However, if I run `dig @10.10.10.1 gitlab.mycompany.com`, its IP appears. So by adding the line `nameserver 10.10.10.1` in the clients' `/etc/resolv.conf` files, we can access our gitlab but we can't still access `server.mycompany.vpn`. Note that we can still access our gitlab via its IP address (the IP given by Netbird and its real IP). Our routes are therefore well configured, the problem only comes from DNS resolution. Finally, note that this problem never appears for Linux clients installed with a Setup Key (our servers). Here's their `/etc/resolv.conf` file: ``` ... nameserver 127.0.0.53 options edns0 trust-ad search company.vpn company.com ``` We therefore believe that the problem only comes from Netbird clients, which cannot apply DNS configurations to our workstations (Linux and Windows).
Author
Owner

@vincent-lg18 commented on GitHub (Apr 25, 2024):

Hello, here is some additional information about our Windows client errors.

Here are the lines in the client.log file when the error [10.10.10.1:53] for [gitlab.mycompany.com] is Unavailable, reason: 1 error occurred: * read udp 192.168.1.182:53408->10.10.10.1 :53: i/o timeout) appears on our Windows clients:

2024-04-25T11:58:42+02:00 ERRO util/net/dialer_generic.go:64: Failed to call dialer hooks: 1 error occurred:
        * executing dial hook: 1 error occurred:
        * adding route reference: failed to add route for prefix 90.90.90.90/32: add route to table: PowerShell add route: exit status 1




2024-04-25T11:58:43+02:00 ERRO util/net/listener_generic.go:128: Error executing listener write hook: adding route reference: failed to add route for prefix 90.90.90.90/32: add route to table: PowerShell add route: exit status 1
@vincent-lg18 commented on GitHub (Apr 25, 2024): Hello, here is some additional information about our Windows client errors. Here are the lines in the client.log file when the error `[10.10.10.1:53] for [gitlab.mycompany.com] is Unavailable, reason: 1 error occurred: * read udp 192.168.1.182:53408->10.10.10.1 :53: i/o timeout)` appears on our Windows clients: ``` 2024-04-25T11:58:42+02:00 ERRO util/net/dialer_generic.go:64: Failed to call dialer hooks: 1 error occurred: * executing dial hook: 1 error occurred: * adding route reference: failed to add route for prefix 90.90.90.90/32: add route to table: PowerShell add route: exit status 1 2024-04-25T11:58:43+02:00 ERRO util/net/listener_generic.go:128: Error executing listener write hook: adding route reference: failed to add route for prefix 90.90.90.90/32: add route to table: PowerShell add route: exit status 1 ```
Author
Owner

@florian-obradovic commented on GitHub (May 28, 2024):

Similar issue here on macOS.

  • 192.168.99.1 is the IP of the DNS server
  • 192.168.99.1 is reachable via ICMP
  • nslookup docker.my-localdomain.local 192.168.99.1 also works
Server:		192.168.99.1
Address:	192.168.99.1#53

Non-authoritative answer:
Name:	docker.my-localdomain.local
Address: 192.168.99.125

OS: darwin/arm64
Daemon version: 0.27.10
CLI version: 0.27.10
Management: Connected to https://netbird.mydomain.com:33073
Signal: Connected to http://netbird.mydomain.com:10000
Relays:
  [stun:netbird.mydomain.com:3478] is Available
  [turn:netbird.mydomain.com:3478?transport=udp] is Available
Nameservers:
  [192.168.99.1:53] for [my-localdomain.local, mydomain.com] is Unavailable, reason: 1 error occurred:
	* read udp 100.102.88.179:65220->192.168.99.1:53: i/o timeout
FQDN: nbfombprom1max.ivo
NetBird IP: 100.102.88.179/16
Interface type: Userspace
Quantum resistance: false
Routes: -
Peers count: 6/12 Connected
@florian-obradovic commented on GitHub (May 28, 2024): Similar issue here on macOS. - 192.168.99.1 is the IP of the DNS server - 192.168.99.1 is reachable via ICMP - nslookup docker.my-localdomain.local 192.168.99.1 also works ``` Server: 192.168.99.1 Address: 192.168.99.1#53 Non-authoritative answer: Name: docker.my-localdomain.local Address: 192.168.99.125 ``` ``` OS: darwin/arm64 Daemon version: 0.27.10 CLI version: 0.27.10 Management: Connected to https://netbird.mydomain.com:33073 Signal: Connected to http://netbird.mydomain.com:10000 Relays: [stun:netbird.mydomain.com:3478] is Available [turn:netbird.mydomain.com:3478?transport=udp] is Available Nameservers: [192.168.99.1:53] for [my-localdomain.local, mydomain.com] is Unavailable, reason: 1 error occurred: * read udp 100.102.88.179:65220->192.168.99.1:53: i/o timeout FQDN: nbfombprom1max.ivo NetBird IP: 100.102.88.179/16 Interface type: Userspace Quantum resistance: false Routes: - Peers count: 6/12 Connected ```
Author
Owner

@fruworg commented on GitHub (Jan 23, 2025):

Same problem. Any updates?

@fruworg commented on GitHub (Jan 23, 2025): Same problem. Any updates?
Author
Owner

@the-project-group commented on GitHub (Jan 23, 2025):

Still the same here - it work after restarting the relay container.
Is this related to https://github.com/netbirdio/netbird/issues/3213 ?

@the-project-group commented on GitHub (Jan 23, 2025): Still the same here - it work after restarting the relay container. Is this related to https://github.com/netbirdio/netbird/issues/3213 ?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#814