Regression: Frequent disconnects with version 0.30.2 #1356

Closed
opened 2025-11-20 05:28:55 -05:00 by saavagebueno · 25 comments
Owner

Originally created by @christian-schlichtherle on GitHub (Oct 21, 2024).

Describe the problem

We are running an IoT project where some Linux based K3s nodes on the edge are located at customer premises and communicate with some other Linux based K3s nodes in the cloud. This project is running for almost three years now. Previously, we have been connecting and managing all nodes via OpenVPN (so we can SSH into every node, even when it's connected at customer premises) and then installed K3s on each node. A few months ago we replaced OpenVPN with Netbird because of its many advantages like performance, peer-to-peer topology with central management etc.

Ever since, we were following updates as soon as possible. We started at 0.27.10 and now we are (or were) at 0.30.2. Unfortunately, starting somewhere between version 0.28.4 and 0.30.2 we started to observe frequent network partitions (disconnects). They would happen randomly after some hours, mostly several times a day, at least once per day, following no particular pattern. I checked many potential causes, including IP address changes which happens to CPE equipment every night (according to the Internet provider's plan), but none of this was the root cause.

Recently I decided to downgrade the network from version 0.30.2 to version 0.28.4 and since then, we didn't have a single network partition / disconnect any more.

To Reproduce

Setup a bunch of nodes and run them 24/7. If you setup the nodes using Ansible, you can discover network partitions like this:

ansible netbird_client -m shell -a 'netbird status --filter-by-status disconnected | grep netbird.cloud | grep -v FQDN || true'

If there is no network partition, then each node produces an empty output, otherwise it lists the nodes it cannot connect to.

Expected behavior

These Linux nodes should stay connected 24/7, real Internet outages aside.

Are you using NetBird Cloud?

Yes

NetBird version

see above

NetBird status -dA output:

n/a

Do you face any (non-mobile) client issues?

n/a

Screenshots

n/a

Additional context

n/a

Originally created by @christian-schlichtherle on GitHub (Oct 21, 2024). **Describe the problem** We are running an IoT project where some Linux based K3s nodes on the edge are located at customer premises and communicate with some other Linux based K3s nodes in the cloud. This project is running for almost three years now. Previously, we have been connecting and managing all nodes via OpenVPN (so we can SSH into every node, even when it's connected at customer premises) and then installed K3s on each node. A few months ago we replaced OpenVPN with Netbird because of its many advantages like performance, peer-to-peer topology with central management etc. Ever since, we were following updates as soon as possible. We started at 0.27.10 and now we are (or were) at 0.30.2. Unfortunately, starting somewhere between version 0.28.4 and 0.30.2 we started to observe frequent network partitions (disconnects). They would happen randomly after some hours, mostly several times a day, at least once per day, following no particular pattern. I checked many potential causes, including IP address changes which happens to CPE equipment every night (according to the Internet provider's plan), but none of this was the root cause. Recently I decided to downgrade the network from version 0.30.2 to version 0.28.4 and since then, we didn't have a single network partition / disconnect any more. **To Reproduce** Setup a bunch of nodes and run them 24/7. If you setup the nodes using Ansible, you can discover network partitions like this: ``` ansible netbird_client -m shell -a 'netbird status --filter-by-status disconnected | grep netbird.cloud | grep -v FQDN || true' ``` If there is no network partition, then each node produces an empty output, otherwise it lists the nodes it cannot connect to. **Expected behavior** These Linux nodes should stay connected 24/7, real Internet outages aside. **Are you using NetBird Cloud?** Yes **NetBird version** see above **NetBird status -dA output:** n/a **Do you face any (non-mobile) client issues?** n/a **Screenshots** n/a **Additional context** n/a
saavagebueno added the clientcloud labels 2025-11-20 05:28:55 -05:00
Author
Owner

@wiiun commented on GitHub (Oct 22, 2024):

I also encountered the same situation

@wiiun commented on GitHub (Oct 22, 2024): I also encountered the same situation
Author
Owner

@mgarces commented on GitHub (Oct 22, 2024):

hello @christian-schlichtherle thank you for your issue; we will investigate this, also, thank you for your provided commands.
It would be beneficial to have debug logs for those clients, is it possible to turn them on and run them for a long period of time? You can achieve this by following these docs.
Thanks

@mgarces commented on GitHub (Oct 22, 2024): hello @christian-schlichtherle thank you for your issue; we will investigate this, also, thank you for your provided commands. It would be beneficial to have debug logs for those clients, is it possible to turn them on and run them for a long period of time? You can achieve this by following these [docs](https://docs.netbird.io/how-to/troubleshooting-client#getting-client-logs). Thanks
Author
Owner

@mgarces commented on GitHub (Oct 22, 2024):

Hi again! We've found a bug in our reconnection logic, and we are working on improvements for it. We currently have a PullRequest ongoing, would you be willing to test it out before we release it?

@mgarces commented on GitHub (Oct 22, 2024): Hi again! We've found a bug in our reconnection logic, and we are working on improvements for it. We currently have a PullRequest [ongoing](https://github.com/netbirdio/netbird/pull/2758), would you be willing to test it out before we release it?
Author
Owner

@christian-schlichtherle commented on GitHub (Oct 23, 2024):

I could install an update on our development cluster for some limited time over the weekend. Since we are using Ansible, how would I go about installing a pre-release?

@christian-schlichtherle commented on GitHub (Oct 23, 2024): I could install an update on our development cluster for some limited time over the weekend. Since we are using Ansible, how would I go about installing a pre-release?
Author
Owner

@DutchCloud4Work commented on GitHub (Oct 23, 2024):

I`m also experiencing problems, whenever a workstation (Mostly windows) goes into sleep mode (because of lunch or something) it will never reconnect anymore, only fix is a reboot of the entire system.
Started after the upgrade 0.29 to 0.30

@DutchCloud4Work commented on GitHub (Oct 23, 2024): I`m also experiencing problems, whenever a workstation (Mostly windows) goes into sleep mode (because of lunch or something) it will never reconnect anymore, only fix is a reboot of the entire system. Started after the upgrade 0.29 to 0.30
Author
Owner

@christian-schlichtherle commented on GitHub (Oct 24, 2024):

@DutchCloud4Work you can try netbird service restart to fix this issue. A reboot should not be required. I'm on macOS however, so your situation may be different.

@christian-schlichtherle commented on GitHub (Oct 24, 2024): @DutchCloud4Work you can try `netbird service restart` to fix this issue. A reboot should not be required. I'm on macOS however, so your situation may be different.
Author
Owner

@DutchCloud4Work commented on GitHub (Oct 24, 2024):

@christian-schlichtherle

Doesn`t work, even restarting the service in Windows en restarting the GUI will still give me no connectivity (the gui does say its connected, but no traffic)

@DutchCloud4Work commented on GitHub (Oct 24, 2024): @christian-schlichtherle Doesn`t work, even restarting the service in Windows en restarting the GUI will still give me no connectivity (the gui does say its connected, but no traffic)
Author
Owner

@ngtrthanh commented on GitHub (Oct 24, 2024):

I have nearly same problem with disconnect to some peers.
Host: Ubuntu 24.02.
Netbird version: 0.30.2
I got netbird status report in inverse
8 of 13 Accessible Peers on Dashboard but at CLI 5/13.
A down/ up cycle solve problems, but it persisted after 12h.
I want to join test new version or roll back to prev.

@ngtrthanh commented on GitHub (Oct 24, 2024): I have nearly same problem with disconnect to some peers. Host: Ubuntu 24.02. Netbird version: 0.30.2 I got netbird status report in inverse 8 of 13 Accessible Peers on Dashboard but at CLI 5/13. A down/ up cycle solve problems, but it persisted after 12h. I want to join test new version or roll back to prev.
Author
Owner

@mgarces commented on GitHub (Oct 24, 2024):

hi, v0.30.3 is now live! While we have conducted thorough testing, if you encounter any unusual behaviour, it might be related to this change. Your feedback will be invaluable in ensuring the stability and performance of this update. Please report back if your connectivity issues are resolved or if you encounter any other hurdle.

@mgarces commented on GitHub (Oct 24, 2024): hi, `v0.30.3` is now live! While we have conducted thorough testing, if you encounter any unusual behaviour, it might be related to this change. Your feedback will be invaluable in ensuring the stability and performance of this update. Please report back if your connectivity issues are resolved or if you encounter any other hurdle.
Author
Owner

@christian-schlichtherle commented on GitHub (Oct 24, 2024):

Will give it a try on our development cluster over the weekend - thank you so much!

@christian-schlichtherle commented on GitHub (Oct 24, 2024): Will give it a try on our development cluster over the weekend - thank you so much!
Author
Owner

@christian-schlichtherle commented on GitHub (Oct 24, 2024):

Actually, I was installing it now. Here's my findings:

Remote installation with apt install netbird=0.30.3 on the cloud nodes in the dev cluster went well.
Installation on the single edge node in the dev cluster using the same command hung up. When I SSH to the node using the LAN port and do netbird status I get:

$ netbird status
Error: failed to connect to daemon error: context deadline exceeded
If the daemon is not running please run: 
netbird service install 
netbird service start

So, apparently the service could not get restarted on upgrading from 0.28.4 to 0.30.3. I'm glad this is only a single node in the dev cluster which I can power cycle manually. For the prod cluster, this incident would be a disaster as I would have to call the customers and ask everyone to reboot the edge nodes manually.

Some more diagnostic output:

# systemctl status netbird
● netbird.service - A WireGuard-based mesh network that connects your devices into a single private network.
     Loaded: loaded (/etc/systemd/system/netbird.service; enabled; preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Thu 2024-10-24 19:35:30 UTC; 28s ago
    Process: 266290 ExecStart=/usr/bin/netbird service run --config /etc/netbird/config.json --log-level info --daemon-addr unix:///var/run/netbird.sock --log-file /var/log/netbird/client.log (code=exited, status=2)
   Main PID: 266290 (code=exited, status=2)
        CPU: 281ms

From journalctl:

Oct 24 19:35:25 de-nw-45134-cs-d0 systemd[1]: Started netbird.service - A WireGuard-based mesh network that connects your devices into a single private network..
░░ Subject: A start job for unit netbird.service has finished successfully
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ A start job for unit netbird.service has finished successfully.
░░ 
░░ The job identifier is 28342.
Oct 24 19:35:30 de-nw-45134-cs-d0 systemd[1]: netbird.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ An ExecStart= process belonging to unit netbird.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 2.
Oct 24 19:35:30 de-nw-45134-cs-d0 systemd[1]: netbird.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ The unit netbird.service has entered the 'failed' state with result 'exit-code'.

Obviously it complains about an invalid argument, but I haven't configured anything different on this node than any other. Maybe this error message is a false positive?

Finally, I was doing a reboot of the node and the problem disappeared.

Now I will start to monitor the stability.

@christian-schlichtherle commented on GitHub (Oct 24, 2024): Actually, I was installing it now. Here's my findings: Remote installation with `apt install netbird=0.30.3` on the cloud nodes in the dev cluster went well. Installation on the single edge node in the dev cluster using the same command hung up. When I SSH to the node using the LAN port and do `netbird status` I get: ``` $ netbird status Error: failed to connect to daemon error: context deadline exceeded If the daemon is not running please run: netbird service install netbird service start ``` So, apparently the service could not get restarted on upgrading from 0.28.4 to 0.30.3. I'm glad this is only a single node in the dev cluster which I can power cycle manually. For the prod cluster, this incident would be a disaster as I would have to call the customers and ask everyone to reboot the edge nodes manually. Some more diagnostic output: ``` # systemctl status netbird ● netbird.service - A WireGuard-based mesh network that connects your devices into a single private network. Loaded: loaded (/etc/systemd/system/netbird.service; enabled; preset: enabled) Active: activating (auto-restart) (Result: exit-code) since Thu 2024-10-24 19:35:30 UTC; 28s ago Process: 266290 ExecStart=/usr/bin/netbird service run --config /etc/netbird/config.json --log-level info --daemon-addr unix:///var/run/netbird.sock --log-file /var/log/netbird/client.log (code=exited, status=2) Main PID: 266290 (code=exited, status=2) CPU: 281ms ``` From journalctl: ``` Oct 24 19:35:25 de-nw-45134-cs-d0 systemd[1]: Started netbird.service - A WireGuard-based mesh network that connects your devices into a single private network.. ░░ Subject: A start job for unit netbird.service has finished successfully ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ A start job for unit netbird.service has finished successfully. ░░ ░░ The job identifier is 28342. Oct 24 19:35:30 de-nw-45134-cs-d0 systemd[1]: netbird.service: Main process exited, code=exited, status=2/INVALIDARGUMENT ░░ Subject: Unit process exited ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ An ExecStart= process belonging to unit netbird.service has exited. ░░ ░░ The process' exit code is 'exited' and its exit status is 2. Oct 24 19:35:30 de-nw-45134-cs-d0 systemd[1]: netbird.service: Failed with result 'exit-code'. ░░ Subject: Unit failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ The unit netbird.service has entered the 'failed' state with result 'exit-code'. ``` Obviously it complains about an invalid argument, but I haven't configured anything different on this node than any other. Maybe this error message is a false positive? Finally, I was doing a reboot of the node and the problem disappeared. Now I will start to monitor the stability.
Author
Owner

@christian-schlichtherle commented on GitHub (Oct 24, 2024):

PS: I noticed that the troubled edge node has changed it's DNS name to my-name-1.netbird.cloud (note the additional -1). I'm not sure when that happened and how this is related, if at all.

@christian-schlichtherle commented on GitHub (Oct 24, 2024): PS: I noticed that the troubled edge node has changed it's DNS name to `my-name-1.netbird.cloud` (note the additional `-1`). I'm not sure when that happened and how this is related, if at all.
Author
Owner

@christian-schlichtherle commented on GitHub (Oct 25, 2024):

Two more hosts failed the remote install, so there is definitely a problem.

@christian-schlichtherle commented on GitHub (Oct 25, 2024): Two more hosts failed the remote install, so there is definitely a problem.
Author
Owner

@lixmal commented on GitHub (Oct 25, 2024):

@christian-schlichtherle could you provide the log file, please?

netbird debug bundle -A

The log from the zip should be sufficient.

Or, if that fails, just /var/log/netbird/client.log

@lixmal commented on GitHub (Oct 25, 2024): @christian-schlichtherle could you provide the log file, please? `netbird debug bundle -A` The log from the zip should be sufficient. Or, if that fails, just `/var/log/netbird/client.log`
Author
Owner

@DutchCloud4Work commented on GitHub (Oct 25, 2024):

update went without issues at my side (0.29 / 0.30.2 > 0.30.3), seems to have fixed my issue.
After my laptop is now waking up from sleep, after +- 60 it will reconnect

@DutchCloud4Work commented on GitHub (Oct 25, 2024): update went without issues at my side (0.29 / 0.30.2 > 0.30.3), seems to have fixed my issue. After my laptop is now waking up from sleep, after +- 60 it will reconnect
Author
Owner

@christian-schlichtherle commented on GitHub (Oct 25, 2024):

@lixmal

# netbird debug bundle -A
Error: failed to connect to daemon error: context deadline exceeded
If the daemon is not running please run: 
netbird service install 
netbird service start

After installation, the netbird service is not starting up. I tried several restarts using netbird service restart as well as systemctl restart netbird with no luck. This is the tail of /var/log/netbird/client.log

2024-10-25T11:43:47Z INFO client/cmd/service_controller.go:24: starting Netbird service
2024-10-25T11:43:47Z INFO client/cmd/service_controller.go:66: started daemon server: /var/run/netbird.sock
2024-10-25T11:43:47Z INFO client/internal/connect.go:111: starting NetBird client version 0.30.3 on linux/arm64
2024-10-25T11:43:48Z INFO client/internal/connect.go:240: connecting to the Relay service(s): rels://relay.netbird.io:443
2024-10-25T11:43:48Z INFO relay/client/picker.go:66: try to connecting to relay server: rels://relay.netbird.io:443
2024-10-25T11:43:48Z INFO [relay: rels://relay.netbird.io:443] relay/client/client.go:166: create new relay connection: local peerID: VEz5HwhOlpeJFpKlvqsfWl6nCh03GbshHg6HwyIRNGg=, local peer hashedID: sha-+b9ykAd8KDzwzrvKbhowImdFQ/H6yvP4ob0MBHb20LE=
2024-10-25T11:43:48Z INFO [relay: rels://relay.netbird.io:443] relay/client/client.go:172: connecting to relay server
2024-10-25T11:43:48Z INFO [relay: rels://streamline-de-fra1-0.relay.netbird.io:443] relay/client/client.go:189: relay connection established
2024-10-25T11:43:48Z INFO relay/client/picker.go:84: connected to Relay server: rels://relay.netbird.io:443
2024-10-25T11:43:48Z INFO relay/client/picker.go:58: chosen home Relay server: rels://relay.netbird.io:443
2024-10-25T11:43:48Z INFO client/internal/connect.go:400: using 59431 as wireguard port: 51820 is in use
2024-10-25T11:43:48Z INFO client/iface/wgproxy/ebpf/proxy.go:91: local wg proxy listening on: 3128
2024-10-25T11:43:48Z INFO client/iface/wgproxy/factory_kernel.go:29: WireGuard Proxy Factory will produce eBPF proxy
2024-10-25T11:43:48Z INFO client/internal/routemanager/manager.go:144: Routing setup complete
2024-10-25T11:43:48Z INFO client/firewall/create_linux.go:77: creating an nftables firewall manager
2024-10-25T11:43:48Z ERRO client/firewall/create_linux.go:44: failed to init nftables manager: router init: create containers: nftables: unable to initialize table: conn.Receive: netlink receive: no such file or directory
2024-10-25T11:43:48Z INFO client/internal/dns/host_unix.go:54: System DNS manager discovered: systemd
2024-10-25T11:43:48Z INFO client/internal/peer/guard/sr_watcher.go:106: reconnected to Signal or Relay server
2024-10-25T11:43:48Z INFO signal/client/grpc.go:149: connected to the Signal Service stream
2024-10-25T11:43:48Z INFO client/internal/engine.go:1415: Network monitor is disabled, not starting
2024-10-25T11:43:48Z INFO client/internal/connect.go:268: Netbird engine started, the IP is: 100.90.89.63/16
2024-10-25T11:43:49Z ERRO signal/client/grpc.go:413: error while handling message of Peer [key: fBMFPOPCBuXdpOBjwKOmBknXE1eIG7OqJUitHYLOzmQ=] error: [wrongly addressed message fBMFPOPCBuXdpOBjwKOmBknXE1eIG7OqJUitHYLOzmQ=]
2024-10-25T11:43:49Z INFO management/client/grpc.go:155: connected to the Management Service stream
2024-10-25T11:43:49Z WARN client/internal/engine.go:597: running SSH server is not permitted
2024-10-25T11:43:49Z INFO client/internal/acl/manager.go:56: ACL rules processed in: 1.274267ms, total rules count: 0

Only a reboot helped.

@christian-schlichtherle commented on GitHub (Oct 25, 2024): @lixmal ``` # netbird debug bundle -A Error: failed to connect to daemon error: context deadline exceeded If the daemon is not running please run: netbird service install netbird service start ``` After installation, the netbird service is not starting up. I tried several restarts using `netbird service restart` as well as `systemctl restart netbird` with no luck. This is the tail of `/var/log/netbird/client.log` ``` 2024-10-25T11:43:47Z INFO client/cmd/service_controller.go:24: starting Netbird service 2024-10-25T11:43:47Z INFO client/cmd/service_controller.go:66: started daemon server: /var/run/netbird.sock 2024-10-25T11:43:47Z INFO client/internal/connect.go:111: starting NetBird client version 0.30.3 on linux/arm64 2024-10-25T11:43:48Z INFO client/internal/connect.go:240: connecting to the Relay service(s): rels://relay.netbird.io:443 2024-10-25T11:43:48Z INFO relay/client/picker.go:66: try to connecting to relay server: rels://relay.netbird.io:443 2024-10-25T11:43:48Z INFO [relay: rels://relay.netbird.io:443] relay/client/client.go:166: create new relay connection: local peerID: VEz5HwhOlpeJFpKlvqsfWl6nCh03GbshHg6HwyIRNGg=, local peer hashedID: sha-+b9ykAd8KDzwzrvKbhowImdFQ/H6yvP4ob0MBHb20LE= 2024-10-25T11:43:48Z INFO [relay: rels://relay.netbird.io:443] relay/client/client.go:172: connecting to relay server 2024-10-25T11:43:48Z INFO [relay: rels://streamline-de-fra1-0.relay.netbird.io:443] relay/client/client.go:189: relay connection established 2024-10-25T11:43:48Z INFO relay/client/picker.go:84: connected to Relay server: rels://relay.netbird.io:443 2024-10-25T11:43:48Z INFO relay/client/picker.go:58: chosen home Relay server: rels://relay.netbird.io:443 2024-10-25T11:43:48Z INFO client/internal/connect.go:400: using 59431 as wireguard port: 51820 is in use 2024-10-25T11:43:48Z INFO client/iface/wgproxy/ebpf/proxy.go:91: local wg proxy listening on: 3128 2024-10-25T11:43:48Z INFO client/iface/wgproxy/factory_kernel.go:29: WireGuard Proxy Factory will produce eBPF proxy 2024-10-25T11:43:48Z INFO client/internal/routemanager/manager.go:144: Routing setup complete 2024-10-25T11:43:48Z INFO client/firewall/create_linux.go:77: creating an nftables firewall manager 2024-10-25T11:43:48Z ERRO client/firewall/create_linux.go:44: failed to init nftables manager: router init: create containers: nftables: unable to initialize table: conn.Receive: netlink receive: no such file or directory 2024-10-25T11:43:48Z INFO client/internal/dns/host_unix.go:54: System DNS manager discovered: systemd 2024-10-25T11:43:48Z INFO client/internal/peer/guard/sr_watcher.go:106: reconnected to Signal or Relay server 2024-10-25T11:43:48Z INFO signal/client/grpc.go:149: connected to the Signal Service stream 2024-10-25T11:43:48Z INFO client/internal/engine.go:1415: Network monitor is disabled, not starting 2024-10-25T11:43:48Z INFO client/internal/connect.go:268: Netbird engine started, the IP is: 100.90.89.63/16 2024-10-25T11:43:49Z ERRO signal/client/grpc.go:413: error while handling message of Peer [key: fBMFPOPCBuXdpOBjwKOmBknXE1eIG7OqJUitHYLOzmQ=] error: [wrongly addressed message fBMFPOPCBuXdpOBjwKOmBknXE1eIG7OqJUitHYLOzmQ=] 2024-10-25T11:43:49Z INFO management/client/grpc.go:155: connected to the Management Service stream 2024-10-25T11:43:49Z WARN client/internal/engine.go:597: running SSH server is not permitted 2024-10-25T11:43:49Z INFO client/internal/acl/manager.go:56: ACL rules processed in: 1.274267ms, total rules count: 0 ``` Only a reboot helped.
Author
Owner

@lixmal commented on GitHub (Oct 25, 2024):

Ah, I see. I don't think we get logs.

Can you run

 sudo netbird service run --log-level trace --log-file console

and return the output?

@lixmal commented on GitHub (Oct 25, 2024): Ah, I see. I don't think we get logs. Can you run ``` sudo netbird service run --log-level trace --log-file console ``` and return the output?
Author
Owner

@mlsmaycon commented on GitHub (Oct 25, 2024):

Can you check if there is something at /var/log/netbird/netbird.err

@mlsmaycon commented on GitHub (Oct 25, 2024): Can you check if there is something at /var/log/netbird/netbird.err
Author
Owner

@christian-schlichtherle commented on GitHub (Oct 25, 2024):

BTW: The affected systems all run Ubuntu 24.04.1 LTS.

From the same node, this is /var/log/netbird/netbird.err:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa7eb24]

goroutine 273 [running]:
github.com/google/nftables.(*Conn).newRule(0x4000c22a00, 0x40006e0300, 0x0)
        /home/runner/go/pkg/mod/github.com/google/nftables@v0.2.0/rule.go:123 +0x1c4
github.com/google/nftables.(*Conn).AddRule(0x400079efb8?, 0x400068c180?)
        /home/runner/go/pkg/mod/github.com/google/nftables@v0.2.0/rule.go:192 +0x3c
github.com/netbirdio/netbird/client/firewall/nftables.(*AclManager).addIOFiltering(0x4000b3c9c0, {0x40004f8180, 0x10, 0x10}, {0xefb543, 0x3}, 0x0, 0x0, 0x1, 0x0, ...)
        /home/runner/work/netbird/netbird/client/firewall/nftables/acl_linux.go:394 +0xbcc
github.com/netbirdio/netbird/client/firewall/nftables.(*AclManager).AddPeerFiltering(0x4000b3c9c0, {0x40004f8180, 0x10, 0x10}, {0xefb543, 0x3}, 0x0, 0x0, 0x1, 0x0, ...)
        /home/runner/work/netbird/netbird/client/firewall/nftables/acl_linux.go:109 +0xd0
github.com/netbirdio/netbird/client/firewall/nftables.(*Manager).AddPeerFiltering(0x0?, {0x40004f8180?, 0x0?, 0x0?}, {0xefb543?, 0x40?}, 0x38?, 0xefcdab8967452301?, 0x1032547698badcfe?, 0x0?, ...)
        /home/runner/work/netbird/netbird/client/firewall/nftables/manager_linux.go:131 +0x150
github.com/netbirdio/netbird/client/internal/acl.(*DefaultManager).addOutRules(0x40008ed4d0, {0x40004f8180, 0x10, 0x10}, {0xefb543, 0x3}, 0x0, 0x0, {0x40004f8170, 0x9}, ...)
        /home/runner/work/netbird/netbird/client/internal/acl/manager.go:327 +0x80
github.com/netbirdio/netbird/client/internal/acl.(*DefaultManager).protoRuleToFirewallRule(0x40008ed4d0, 0x4000903f20, {0x40004f8170, 0x9})
        /home/runner/work/netbird/netbird/client/internal/acl/manager.go:277 +0x348
github.com/netbirdio/netbird/client/internal/acl.(*DefaultManager).applyPeerACLs(0x40008ed4d0, 0x4000135380)
        /home/runner/work/netbird/netbird/client/internal/acl/manager.go:143 +0x45c
github.com/netbirdio/netbird/client/internal/acl.(*DefaultManager).ApplyFiltering(0x40008ed4d0, 0x4000135380)
        /home/runner/work/netbird/netbird/client/internal/acl/manager.go:66 +0x100
github.com/netbirdio/netbird/client/internal.(*Engine).updateNetworkMap(0x4000410fc8, 0x4000135380)
        /home/runner/work/netbird/netbird/client/internal/engine.go:752 +0xa8
github.com/netbirdio/netbird/client/internal.(*Engine).handleSync(0x4000410fc8, 0x4000921600)
        /home/runner/work/netbird/netbird/client/internal/engine.go:561 +0x498
github.com/netbirdio/netbird/management/client.(*GrpcClient).receiveEvents(0x40006793b0, {0x112c310, 0x40008c5c10}, {0xb7, 0xfa, 0x1a, 0xf9, 0x2e, 0xf, 0x1e, ...}, ...)
        /home/runner/work/netbird/netbird/management/client/grpc.go:260 +0x138
github.com/netbirdio/netbird/management/client.(*GrpcClient).handleStream(0x40006793b0, {0x1121758?, 0x4000b2e000?}, {0xb7, 0xfa, 0x1a, 0xf9, 0x2e, 0xf, 0x1e, ...}, ...)
        /home/runner/work/netbird/netbird/management/client/grpc.go:159 +0x1f8
github.com/netbirdio/netbird/management/client.(*GrpcClient).Sync.func1()
        /home/runner/work/netbird/netbird/management/client/grpc.go:130 +0x15c
github.com/cenkalti/backoff/v4.RetryNotifyWithTimer.Operation.withEmptyData.func1()
        /home/runner/go/pkg/mod/github.com/cenkalti/backoff/v4@v4.3.0/retry.go:18 +0x24
github.com/cenkalti/backoff/v4.doRetryNotify[...](0x400079fe88?, {0x7f39b41118, 0x4000a810e0}, 0x0, {0x0, 0x0?})
        /home/runner/go/pkg/mod/github.com/cenkalti/backoff/v4@v4.3.0/retry.go:88 +0xcc
github.com/cenkalti/backoff/v4.RetryNotifyWithTimer(0x1a9ec70?, {0x7f39b41118?, 0x4000a810e0?}, 0x10?, {0x0?, 0x0?})
        /home/runner/go/pkg/mod/github.com/cenkalti/backoff/v4@v4.3.0/retry.go:61 +0x5c
github.com/cenkalti/backoff/v4.RetryNotify(...)
        /home/runner/go/pkg/mod/github.com/cenkalti/backoff/v4@v4.3.0/retry.go:49
github.com/cenkalti/backoff/v4.Retry(...)
        /home/runner/go/pkg/mod/github.com/cenkalti/backoff/v4@v4.3.0/retry.go:38
github.com/netbirdio/netbird/management/client.(*GrpcClient).Sync(0x1121758?, {0x1121758, 0x4000b2e000}, 0x0?, 0x0?)
        /home/runner/work/netbird/netbird/management/client/grpc.go:133 +0x198
github.com/netbirdio/netbird/client/internal.(*Engine).receiveManagementEvents.func1()
        /home/runner/work/netbird/netbird/client/internal/engine.go:683 +0xec
created by github.com/netbirdio/netbird/client/internal.(*Engine).receiveManagementEvents in goroutine 24
        /home/runner/work/netbird/netbird/client/internal/engine.go:675 +0x5c
75 +0x5c

Unfortunately, it doesn't report the time, so this could be entirely unrelated.

@christian-schlichtherle commented on GitHub (Oct 25, 2024): BTW: The affected systems all run Ubuntu 24.04.1 LTS. From the same node, this is `/var/log/netbird/netbird.err`: ``` panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa7eb24] goroutine 273 [running]: github.com/google/nftables.(*Conn).newRule(0x4000c22a00, 0x40006e0300, 0x0) /home/runner/go/pkg/mod/github.com/google/nftables@v0.2.0/rule.go:123 +0x1c4 github.com/google/nftables.(*Conn).AddRule(0x400079efb8?, 0x400068c180?) /home/runner/go/pkg/mod/github.com/google/nftables@v0.2.0/rule.go:192 +0x3c github.com/netbirdio/netbird/client/firewall/nftables.(*AclManager).addIOFiltering(0x4000b3c9c0, {0x40004f8180, 0x10, 0x10}, {0xefb543, 0x3}, 0x0, 0x0, 0x1, 0x0, ...) /home/runner/work/netbird/netbird/client/firewall/nftables/acl_linux.go:394 +0xbcc github.com/netbirdio/netbird/client/firewall/nftables.(*AclManager).AddPeerFiltering(0x4000b3c9c0, {0x40004f8180, 0x10, 0x10}, {0xefb543, 0x3}, 0x0, 0x0, 0x1, 0x0, ...) /home/runner/work/netbird/netbird/client/firewall/nftables/acl_linux.go:109 +0xd0 github.com/netbirdio/netbird/client/firewall/nftables.(*Manager).AddPeerFiltering(0x0?, {0x40004f8180?, 0x0?, 0x0?}, {0xefb543?, 0x40?}, 0x38?, 0xefcdab8967452301?, 0x1032547698badcfe?, 0x0?, ...) /home/runner/work/netbird/netbird/client/firewall/nftables/manager_linux.go:131 +0x150 github.com/netbirdio/netbird/client/internal/acl.(*DefaultManager).addOutRules(0x40008ed4d0, {0x40004f8180, 0x10, 0x10}, {0xefb543, 0x3}, 0x0, 0x0, {0x40004f8170, 0x9}, ...) /home/runner/work/netbird/netbird/client/internal/acl/manager.go:327 +0x80 github.com/netbirdio/netbird/client/internal/acl.(*DefaultManager).protoRuleToFirewallRule(0x40008ed4d0, 0x4000903f20, {0x40004f8170, 0x9}) /home/runner/work/netbird/netbird/client/internal/acl/manager.go:277 +0x348 github.com/netbirdio/netbird/client/internal/acl.(*DefaultManager).applyPeerACLs(0x40008ed4d0, 0x4000135380) /home/runner/work/netbird/netbird/client/internal/acl/manager.go:143 +0x45c github.com/netbirdio/netbird/client/internal/acl.(*DefaultManager).ApplyFiltering(0x40008ed4d0, 0x4000135380) /home/runner/work/netbird/netbird/client/internal/acl/manager.go:66 +0x100 github.com/netbirdio/netbird/client/internal.(*Engine).updateNetworkMap(0x4000410fc8, 0x4000135380) /home/runner/work/netbird/netbird/client/internal/engine.go:752 +0xa8 github.com/netbirdio/netbird/client/internal.(*Engine).handleSync(0x4000410fc8, 0x4000921600) /home/runner/work/netbird/netbird/client/internal/engine.go:561 +0x498 github.com/netbirdio/netbird/management/client.(*GrpcClient).receiveEvents(0x40006793b0, {0x112c310, 0x40008c5c10}, {0xb7, 0xfa, 0x1a, 0xf9, 0x2e, 0xf, 0x1e, ...}, ...) /home/runner/work/netbird/netbird/management/client/grpc.go:260 +0x138 github.com/netbirdio/netbird/management/client.(*GrpcClient).handleStream(0x40006793b0, {0x1121758?, 0x4000b2e000?}, {0xb7, 0xfa, 0x1a, 0xf9, 0x2e, 0xf, 0x1e, ...}, ...) /home/runner/work/netbird/netbird/management/client/grpc.go:159 +0x1f8 github.com/netbirdio/netbird/management/client.(*GrpcClient).Sync.func1() /home/runner/work/netbird/netbird/management/client/grpc.go:130 +0x15c github.com/cenkalti/backoff/v4.RetryNotifyWithTimer.Operation.withEmptyData.func1() /home/runner/go/pkg/mod/github.com/cenkalti/backoff/v4@v4.3.0/retry.go:18 +0x24 github.com/cenkalti/backoff/v4.doRetryNotify[...](0x400079fe88?, {0x7f39b41118, 0x4000a810e0}, 0x0, {0x0, 0x0?}) /home/runner/go/pkg/mod/github.com/cenkalti/backoff/v4@v4.3.0/retry.go:88 +0xcc github.com/cenkalti/backoff/v4.RetryNotifyWithTimer(0x1a9ec70?, {0x7f39b41118?, 0x4000a810e0?}, 0x10?, {0x0?, 0x0?}) /home/runner/go/pkg/mod/github.com/cenkalti/backoff/v4@v4.3.0/retry.go:61 +0x5c github.com/cenkalti/backoff/v4.RetryNotify(...) /home/runner/go/pkg/mod/github.com/cenkalti/backoff/v4@v4.3.0/retry.go:49 github.com/cenkalti/backoff/v4.Retry(...) /home/runner/go/pkg/mod/github.com/cenkalti/backoff/v4@v4.3.0/retry.go:38 github.com/netbirdio/netbird/management/client.(*GrpcClient).Sync(0x1121758?, {0x1121758, 0x4000b2e000}, 0x0?, 0x0?) /home/runner/work/netbird/netbird/management/client/grpc.go:133 +0x198 github.com/netbirdio/netbird/client/internal.(*Engine).receiveManagementEvents.func1() /home/runner/work/netbird/netbird/client/internal/engine.go:683 +0xec created by github.com/netbirdio/netbird/client/internal.(*Engine).receiveManagementEvents in goroutine 24 /home/runner/work/netbird/netbird/client/internal/engine.go:675 +0x5c 75 +0x5c ``` Unfortunately, it doesn't report the time, so this could be entirely unrelated.
Author
Owner

@lixmal commented on GitHub (Oct 25, 2024):

@christian-schlichtherle

Can you please run

 sudo netbird service run --log-level trace --log-file console

I need to see the preceding log

@lixmal commented on GitHub (Oct 25, 2024): @christian-schlichtherle Can you please run ``` sudo netbird service run --log-level trace --log-file console ``` I need to see the preceding log
Author
Owner

@christian-schlichtherle commented on GitHub (Oct 25, 2024):

@lixmal I'm sorry, but isn't that too late already? I rebooted the failing nodes to fix the service startup. Now everything works as designed.

@christian-schlichtherle commented on GitHub (Oct 25, 2024): @lixmal I'm sorry, but isn't that too late already? I rebooted the failing nodes to fix the service startup. Now everything works as designed.
Author
Owner

@lixmal commented on GitHub (Oct 25, 2024):

I see. There must've been something gone wrong when creating nftables that was cleared up by the reboot. Thanks

@lixmal commented on GitHub (Oct 25, 2024): I see. There must've been something gone wrong when creating nftables that was cleared up by the reboot. Thanks
Author
Owner

@christian-schlichtherle commented on GitHub (Oct 25, 2024):

@lixmal Maybe there is another approach to isolate the issue: We are using a huge Ansible playbook to setup Netbird, K3s and a ton of other stuff on each node. This ensures that the configuration of the nodes is as identical as possible. That being said, one of the biggest differences is the base image. For the edge devices (which are the ones which have been failing the service restart after installation), we use Ubuntu 24.04.1 LTS. Maybe you can recreate the issue on such nodes exclusively?

@christian-schlichtherle commented on GitHub (Oct 25, 2024): @lixmal Maybe there is another approach to isolate the issue: We are using a huge Ansible playbook to setup Netbird, K3s and a ton of other stuff on each node. This ensures that the configuration of the nodes is as identical as possible. That being said, one of the biggest differences is the base image. For the edge devices (which are the ones which have been failing the service restart after installation), we use Ubuntu 24.04.1 LTS. Maybe you can recreate the issue on such nodes exclusively?
Author
Owner

@christian-schlichtherle commented on GitHub (Nov 1, 2024):

I'm closing this ticket because the upgrade to 0.31.0 went well for all our nodes in all our environments, including the Ubuntu 24.04.1 LTS nodes. Kudos to the team for their dedication - you rock!

@christian-schlichtherle commented on GitHub (Nov 1, 2024): I'm closing this ticket because the upgrade to 0.31.0 went well for all our nodes in all our environments, including the Ubuntu 24.04.1 LTS nodes. Kudos to the team for their dedication - you rock!
Author
Owner

@mlsmaycon commented on GitHub (Nov 1, 2024):

Thanks for the feedback @christian-schlichtherle

@mlsmaycon commented on GitHub (Nov 1, 2024): Thanks for the feedback @christian-schlichtherle
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#1356