Peer link is being dropped #1521

@hadleyrich commented on GitHub (Dec 28, 2024):

Interesting timing. I've been seeing some more link instability in the last few days. Since 0.35 maybe. Requiring a restart of some peers to reconnect. Sometimes they think they are connected but are not passing traffic.

I did have some stability issues back around pre-0.20 or so and required restarting clients. Then things have been quite stable for the last many months.

I know this is very vague and doesn't provide useful information in of itself but just wanted to add in my anecdotal experience that the current instability hasn't shown up in my environment for quite some time.

@hadleyrich commented on GitHub (Dec 28, 2024): Interesting timing. I've been seeing some more link instability in the last few days. Since 0.35 maybe. Requiring a restart of some peers to reconnect. Sometimes they think they are connected but are not passing traffic. I did have some stability issues back around pre-0.20 or so and required restarting clients. Then things have been quite stable for the last many months. I know this is very vague and doesn't provide useful information in of itself but just wanted to add in my anecdotal experience that the current instability hasn't shown up in my environment for quite some time.

saavagebueno commented

@hadleyrich commented on GitHub (Dec 28, 2024):

Logs from a peer at the time it dropped off:

2024-12-29T11:34:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
2024-12-29T11:39:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
2024-12-29T11:44:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
2024-12-29T11:44:43+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:44:44+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:44:47+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:44:49+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:44:54+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:45:06+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:49:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
2024-12-29T11:54:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out

@hadleyrich commented on GitHub (Dec 28, 2024): Logs from a peer at the time it dropped off: ``` 2024-12-29T11:34:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out 2024-12-29T11:39:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out 2024-12-29T11:44:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out 2024-12-29T11:44:43+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer 2024-12-29T11:44:44+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer 2024-12-29T11:44:47+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer 2024-12-29T11:44:49+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer 2024-12-29T11:44:54+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer 2024-12-29T11:45:06+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer 2024-12-29T11:49:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out 2024-12-29T11:54:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out ```

saavagebueno commented

@rihards-simanovics commented on GitHub (Dec 29, 2024):

Interesting timing. I've been seeing some more link instability in the last few days. Since 0.35 maybe. Requiring a restart of some peers to reconnect. Sometimes they think they are connected but are not passing traffic.

Hey @hadleyrich, I agree that's pretty much what I've been battling with for the past couple of weeks. I have a load balancer which uses the VPN to connect to various other VPS peers so that we can have a simple HTTP reverse proxy on port :80. As of 0.34.0, the load balancer drops the connection to the other VPS peers without retrying to connect, needing a manual restart of the Netbird client.

I did have some stability issues back around pre-0.20 or so and required restarting clients. Then things have been quite stable for the last many months.

That's pretty much my experience. I joined at around version 0.27.0, I think. I fully converted from a traditional VPN by around 0.28.0, and things were relatively stable, so I stayed. That said, I think they need to have a nightly and stable release at this point, as I agree with @bmansfie having this run in production, I, first before anything, need stability. Yesterday had a 2-hour downtime because the 0.29.4 client did something when I was applying the access policy and took down all external ports, which absolutely wrecked all my DNS server and all DNS records for a good 4 hours; thankfully, nowadays, it only takes around 2 hours to re-propagate. That said, I'd like for that not to happen again...

I know this is very vague and doesn't provide helpful information in itself, but I just wanted to add in my anecdotal experience that the current instability hasn't shown up in my environment for quite some time.

I wouldn't really call it "anecdotal". I have a monthly maintenance window during which I upgrade all of the packages on the OS, so when I do eventually upgrade, I may jump many minor and patch releases. Because things were more or less stable, I had no issues upgrading to the latest. Right now, all of my servers are sitting on a downgraded version of 0.33.0 as it seems to be the last stable release, at least for the previous 24, before it was 0.29.4. That said, after yesterday, I am fearful of all versions 😅.

@rihards-simanovics commented on GitHub (Dec 29, 2024): > Interesting timing. I've been seeing some more link instability in the last few days. Since 0.35 maybe. Requiring a restart of some peers to reconnect. Sometimes they think they are connected but are not passing traffic. Hey @hadleyrich, I agree that's pretty much what I've been battling with for the past couple of weeks. I have a load balancer which uses the VPN to connect to various other VPS peers so that we can have a simple `HTTP` reverse proxy on port `:80`. As of `0.34.0`, the load balancer drops the connection to the other VPS peers without retrying to connect, needing a manual restart of the Netbird client. > I did have some stability issues back around pre-0.20 or so and required restarting clients. Then things have been quite stable for the last many months. That's pretty much my experience. I joined at around version `0.27.0`, I think. I fully converted from a traditional VPN by around `0.28.0`, and things were relatively stable, so I stayed. That said, I think they need to have a nightly and stable release at this point, as I agree with @bmansfie having this run in production, I, first before anything, need stability. Yesterday had a 2-hour downtime because the `0.29.4` client did something when I was applying the access policy and took down all external ports, which absolutely wrecked all my DNS server and all DNS records for a good 4 hours; thankfully, nowadays, it only takes around 2 hours to re-propagate. That said, I'd like for that not to happen again... > I know this is very vague and doesn't provide helpful information in itself, but I just wanted to add in my anecdotal experience that the current instability hasn't shown up in my environment for quite some time. I wouldn't really call it "anecdotal". I have a monthly maintenance window during which I upgrade all of the packages on the OS, so when I do eventually upgrade, I may jump many minor and patch releases. Because things were more or less stable, I had no issues upgrading to the latest. Right now, all of my servers are sitting on a downgraded version of `0.33.0` as it seems to be the last stable release, at least for the previous 24, before it was 0.29.4. That said, after yesterday, I am fearful of all versions 😅.

saavagebueno commented

@hadleyrich commented on GitHub (Dec 30, 2024):

I just noticed on a peer that had lost communication with another peer that "Last WireGuard handshake" was hours old and "Last connection update" was minutes so it certainly points to something at the WG level becoming out of sync.

I think you're probably right, I think I probably saw stability issues reappearing around 0.34. I had become quite (probably overly) comfortable with the level of stability over the past months and been happily tracking the latest releases. I don't yet run netbird in a production setting. More of a long term stability test on my homelab "production" services before deploying to real customer facing workloads.

@hadleyrich commented on GitHub (Dec 30, 2024): I just noticed on a peer that had lost communication with another peer that "Last WireGuard handshake" was hours old and "Last connection update" was minutes so it certainly points to something at the WG level becoming out of sync. I think you're probably right, I think I probably saw stability issues reappearing around 0.34. I had become quite (probably overly) comfortable with the level of stability over the past months and been happily tracking the latest releases. I don't yet run netbird in a production setting. More of a long term stability test on my homelab "production" services before deploying to real customer facing workloads.

saavagebueno commented

@freebs65 commented on GitHub (Dec 30, 2024):

Hmm.. it's funny I have one machine that drops and it's a Windows Server 2022 .. I don't see other clients stop. A simple restart fixes it, but i have to do every day. I have Linux clients and an older Windows SBS server all seem to be ok..Also have Windows 11 clients.. again seem fine.. even my arch desktop is fine. Very odd.

@freebs65 commented on GitHub (Dec 30, 2024): Hmm.. it's funny I have one machine that drops and it's a Windows Server 2022 .. I don't see other clients stop. A simple restart fixes it, but i have to do every day. I have Linux clients and an older Windows SBS server all seem to be ok..Also have Windows 11 clients.. again seem fine.. even my arch desktop is fine. Very odd.

saavagebueno commented

@hadleyrich commented on GitHub (Dec 31, 2024):

Another data point. A long running ping in screen to keep traffic going over the link appears to keep the peer connected.

@hadleyrich commented on GitHub (Dec 31, 2024): Another data point. A long running ping in screen to keep traffic going over the link appears to keep the peer connected.

saavagebueno commented

@rihards-simanovics commented on GitHub (Dec 31, 2024):

It seems like the issue is with the WireGuard handshake. For instance, my Windows 11 PC seemingly struggles to connect to other Linux Server Peers despite everything running the latest Netbird version, in this case, Netbird 0.35.2. One of my Load Balancer servers running Ubuntu 22.04 just refuses to keep the connection to other Linux servers for longer than 5 minutes before dying and needing to be restarted. I don't know what I'm doing wrong, but I always update the management server first and only then move on to the client nodes, first on the Linux servers and then on devices such as PCs/Laptops/Phones.

@rihards-simanovics commented on GitHub (Dec 31, 2024): It seems like the issue is with the WireGuard handshake. For instance, my Windows 11 PC seemingly struggles to connect to other Linux Server Peers despite everything running the latest Netbird version, in this case, Netbird 0.35.2. One of my Load Balancer servers running Ubuntu 22.04 just refuses to keep the connection to other Linux servers for longer than 5 minutes before dying and needing to be restarted. I don't know what I'm doing wrong, but I always update the management server first and only then move on to the client nodes, first on the Linux servers and then on devices such as PCs/Laptops/Phones.

saavagebueno commented

@rihards-simanovics commented on GitHub (Jan 1, 2025):

Hi Everyone, happy New Year!

Hey @mlsmaycon, sorry to ping you directly. Would you like me to run the same steps as listed last time? I will email the logs so you have a better picture. I am approaching a maintenance window for all our org servers and will be able to run a full debugging trace like last time. Also, I need to know if the logging persists across client updates or whether I need to run it first on the old version and then after the upgrade.

@rihards-simanovics commented on GitHub (Jan 1, 2025): Hi Everyone, happy New Year! Hey @mlsmaycon, sorry to ping you directly. Would you like me to run the same steps as listed last time? I will email the logs so you have a better picture. I am approaching a maintenance window for all our org servers and will be able to run a full debugging trace like last time. Also, I need to know if the logging persists across client updates or whether I need to run it first on the old version and then after the upgrade.

saavagebueno commented

@hadleyrich commented on GitHub (Jan 1, 2025):

I think (in my case at least) this appears to be something triggered by, or relating to relaying.

Previously I was not running the relay in my set up and only running coturn. The peer I was having most trouble with was connecting over relay.

Adding in the new relay service appears to have made that peer more stable for the last 12 hours or so.

@hadleyrich commented on GitHub (Jan 1, 2025): I think (in my case at least) this appears to be something triggered by, or relating to relaying. Previously I was not running the relay in my set up and only running coturn. The peer I was having most trouble with was connecting over relay. Adding in the new relay service appears to have made that peer more stable for the last 12 hours or so.

saavagebueno commented

@rihards-simanovics commented on GitHub (Jan 1, 2025):

Previously I was not running the relay in my set up and only running coturn. The peer I was having most trouble with was connecting over relay.

Hmm, interesting. In my case, I am already running a new relay service. Strangely, some client versions seem to overuse the relay, and some underuse it; since 0.35.0, the client seems to bypass it altogether and go straight for P2P.

Okay, you know what? It's late at night here in the UK, so let me try upgrading and getting at least some logs.

@rihards-simanovics commented on GitHub (Jan 1, 2025): > Previously I was not running the relay in my set up and only running coturn. The peer I was having most trouble with was connecting over relay. Hmm, interesting. In my case, I am already running a new relay service. Strangely, some client versions seem to overuse the relay, and some underuse it; since `0.35.0`, the client seems to bypass it altogether and go straight for P2P. Okay, you know what? It's late at night here in the UK, so let me try upgrading and getting at least some logs.

saavagebueno commented

@rihards-simanovics commented on GitHub (Jan 2, 2025):

Ok, without looking at the trace logs generated by the client, my anecdotal research log shows this:

4:54 UK7 client upgraded from 0.33.0 to 0.35.2
4.58 UK7 sites show as status 503 down on UK1 - which is still using 0.33.0 client
5:00 UK7 client downgraded back to 0.33.0 - and I am waiting for all sites to recover, which takes around 5 seconds
5:04 UK1 client upgraded to 0.35.2 from 0.33.0
5:08 UK3 websites are shown as down despite only the UK1 client being up to date.
.. some time here I downgraded the uk1 client to 0.33.0
5:19 UK1 client again upgraded to 0.35.2 from 0.33.0
5:23 UK2,3,5 sites went down - they use client 0.33.0 client Uk7 however is still up.
5:25 UK1 client restarted. Sites are going back up
5:37 UK1 client downgraded back to 0.33.0 things are back to normal.

@mlsmaycon I've collected a full trace from UK1 and UK7 using the method listed in https://github.com/netbirdio/netbird/issues/3112#issuecomment-2562361089 and am now parsing it to see if there is anything obvious. I will send it to the support email once I've reviewed everything.

@rihards-simanovics commented on GitHub (Jan 2, 2025): Ok, without looking at the trace logs generated by the client, my anecdotal research log shows this: ``` 4:54 UK7 client upgraded from 0.33.0 to 0.35.2 4.58 UK7 sites show as status 503 down on UK1 - which is still using 0.33.0 client 5:00 UK7 client downgraded back to 0.33.0 - and I am waiting for all sites to recover, which takes around 5 seconds 5:04 UK1 client upgraded to 0.35.2 from 0.33.0 5:08 UK3 websites are shown as down despite only the UK1 client being up to date. .. some time here I downgraded the uk1 client to 0.33.0 5:19 UK1 client again upgraded to 0.35.2 from 0.33.0 5:23 UK2,3,5 sites went down - they use client 0.33.0 client Uk7 however is still up. 5:25 UK1 client restarted. Sites are going back up 5:37 UK1 client downgraded back to 0.33.0 things are back to normal. ``` @mlsmaycon I've collected a full trace from UK1 and UK7 using the method listed in https://github.com/netbirdio/netbird/issues/3112#issuecomment-2562361089 and am now parsing it to see if there is anything obvious. I will send it to the support email once I've reviewed everything.

saavagebueno commented

@fiikra commented on GitHub (Jan 7, 2025):

We're encountering the same issue with our Netbird instance. Version 0.33 was incredibly stable for over 90 days, maintaining continuous communication with our peer. However, after upgrading directly to version 0.35, we observed the same problem that @hadleyrich mentioned starting from version 0.34. Although the peer is online and connected to our server, there is no communication. It seems the bug might have been introduced in that version. We've debugged the issue and found a temporary workaround: disabling and re-enabling the policy in access control, which restores communication. I'm happy to provide more details to help resolve this.

@fiikra commented on GitHub (Jan 7, 2025): We're encountering the same issue with our Netbird instance. Version 0.33 was incredibly stable for over 90 days, maintaining continuous communication with our peer. However, after upgrading directly to version 0.35, we observed the same problem that @hadleyrich mentioned starting from version 0.34. Although the peer is online and connected to our server, there is no communication. It seems the bug might have been introduced in that version. We've debugged the issue and found a temporary workaround: disabling and re-enabling the policy in access control, which restores communication. I'm happy to provide more details to help resolve this.

saavagebueno commented

@TomSipacom commented on GitHub (Jan 14, 2025):

We have the same issue with some peers. @fiikra I have tested your workaround but this is not working.
when I check the peer with the command netbird status --detail and I check the peer where I have no connection to.
I see it's connected but no Last WireGuard handshake.

Status: Connected
-- detail --
Connection type: Relayed
ICE candidate (Local/Remote): -/-
ICE candidate endpoints (Local/Remote): -/-
Relay server address: rel://vpn.example.com:33080 <-- here stands my real domain, obviously
Last connection update: 25 seconds ago
Last WireGuard handshake: -
Transfer status (received/sent) 0 B/740 B
Quantum resistance: false
Routes: -
Networks: -
Latency: 0s

and I have other peers that just work fine

Status: Connected
-- detail --
Connection type: Relayed
ICE candidate (Local/Remote): -/-
ICE candidate endpoints (Local/Remote): -/-
Relay server address: rel://vpn.example.com:33080 <-- here stands my real domain, obviously
Last connection update: 9 minutes, 33 seconds ago
Last WireGuard handshake: 1 minute, 27 seconds ago
Transfer status (received/sent) 22.0 KiB/16.7 KiB
Quantum resistance: false
Routes: -
Networks: -
Latency: 0s

It worked before and all peers mentioned are using version 0.35.2.
When I reinstall my client with version 0.27.3 it works.
My own peer is installed on windows, the 2 peers from the example are linux

@TomSipacom commented on GitHub (Jan 14, 2025): We have the same issue with some peers. @fiikra I have tested your workaround but this is not working. when I check the peer with the command netbird status --detail and I check the peer where I have no connection to. I see it's connected but no Last WireGuard handshake. Status: Connected -- detail -- Connection type: Relayed ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: rel://vpn.example.com:33080 <-- here stands my real domain, obviously Last connection update: 25 seconds ago Last WireGuard handshake: - Transfer status (received/sent) 0 B/740 B Quantum resistance: false Routes: - Networks: - Latency: 0s and I have other peers that just work fine Status: Connected -- detail -- Connection type: Relayed ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: rel://vpn.example.com:33080 <-- here stands my real domain, obviously Last connection update: 9 minutes, 33 seconds ago Last WireGuard handshake: 1 minute, 27 seconds ago Transfer status (received/sent) 22.0 KiB/16.7 KiB Quantum resistance: false Routes: - Networks: - Latency: 0s It worked before and all peers mentioned are using version 0.35.2. When I reinstall my client with version 0.27.3 it works. My own peer is installed on windows, the 2 peers from the example are linux

saavagebueno commented

@rihards-simanovics commented on GitHub (Jan 18, 2025):

Hey @mlsmaycon, I am beginning to get really frustrated by this. We are getting new releases that introduce more features, but none address the issue of the peer link being dropped. I've sent an email to support with the attached debug trace logs for two servers that keep dropping links after upgrading. Has anyone looked at it? I really don't want to be that kind of person, but at this point, I'm getting frustrated enough that I am this 🤏🏻 far away from trying and perhaps even switching to headscale.

@rihards-simanovics commented on GitHub (Jan 18, 2025): Hey @mlsmaycon, I am beginning to get really frustrated by this. We are getting new releases that introduce more features, but none address the issue of the peer link being dropped. I've sent an email to support with the attached debug trace logs for two servers that keep dropping links after upgrading. Has anyone looked at it? I really don't want to be that kind of person, but at this point, I'm getting frustrated enough that I am this 🤏🏻 far away from trying and perhaps even switching to headscale.

saavagebueno commented

@the-project-group commented on GitHub (Jan 20, 2025):

Can you guys check if you have "redundant" ACLs like:

ACL1: Allow all Peers to ICMP > Peer01
ACL2: Allow the monitoring Peers to ICMP > Peer01

Toggling one of the ACLs off / on brings connectivity back for me:

@the-project-group commented on GitHub (Jan 20, 2025): Can you guys check if you have "redundant" ACLs like: - ACL1: Allow all Peers to ICMP > Peer01 - ACL2: Allow the monitoring Peers to ICMP > Peer01 Toggling one of the ACLs off / on brings connectivity back for me: ![Image](https://github.com/user-attachments/assets/496c2d0d-4998-4bc4-9495-dbb4b99d4f41)

saavagebueno commented

@rihards-simanovics commented on GitHub (Jan 20, 2025):

I just spoke with someone from the head scale community who has used Netbird before. They suggested disabling rosenpass and rosenpass-permissive modes on the affected clients. After doing that and upgrading all clients, the issue appears to have disappeared—though I will keep monitoring it. I assume the problem is somewhere in the rotation of the rosenpass keys, as the peers drop connection almost exactly 5 minutes after establishing a link.

@rihards-simanovics commented on GitHub (Jan 20, 2025): I just spoke with someone from the head scale community who has used Netbird before. They suggested disabling `rosenpass` and `rosenpass-permissive` modes on the affected clients. After doing that and upgrading all clients, the issue appears to have disappeared—though I will keep monitoring it. I assume the problem is somewhere in the rotation of the rosenpass keys, as the peers drop connection almost exactly 5 minutes after establishing a link.

saavagebueno commented

@mlsmaycon commented on GitHub (Jan 20, 2025):

@rihards-simanovics did you have peers with different versions of NetBird and rosenpass enabled? After the upgrade, did you enable rosenpass again?

@mlsmaycon commented on GitHub (Jan 20, 2025): @rihards-simanovics did you have peers with different versions of NetBird and rosenpass enabled? After the upgrade, did you enable rosenpass again?

saavagebueno commented

@rihards-simanovics commented on GitHub (Jan 20, 2025):

@rihards-simanovics did you have peers with different versions of NetBird and rosenpass enabled? After the upgrade, did you enable rosenpass again?

Hi @mlsmaycon, thanks for replying. No, the versions were precisely the same across all peers when the issue occurred. To better illustrate the environment, all 9 peers (Ubuntu 22.04/24.04 servers):

were upgraded to either 0.36.2 or .3,
were upgraded directly from 0.33.0 within roughly 2 minutes of one another,
had rosenpass and rosenpass-premissive flags set to true,
were running Ubuntu 24.04 - except the load balancer, which ran Ubuntu 22.04,
have static IPv4 and IPv6.

When the peer dropped the connection, which happened roughly every 5 minutes, I restarted all of the servers so that if there was anything strange with the OS, it would have been accounted for. However, after around 10 minutes of things going down, I had to revert back to 0.33.0.

By the way, here is a quick update on the stability after disabling rosenpass and rosenpass-permissive mode; all 9 peers have been running 0.36.3 since I posted this comment and nothing dropped connection yet.

@rihards-simanovics commented on GitHub (Jan 20, 2025): > [@rihards-simanovics](https://github.com/rihards-simanovics) did you have peers with different versions of NetBird and rosenpass enabled? After the upgrade, did you enable rosenpass again? Hi @mlsmaycon, thanks for replying. No, the versions were precisely the same across all peers when the issue occurred. To better illustrate the environment, all 9 peers (Ubuntu 22.04/24.04 servers): - were upgraded to either `0.36.2` or `.3`, - were upgraded directly from `0.33.0` within roughly 2 minutes of one another, - had `rosenpass` and `rosenpass-premissive` flags set to `true`, - were running Ubuntu 24.04 - except the load balancer, which ran Ubuntu 22.04, - have static IPv4 and IPv6. When the peer dropped the connection, which happened roughly every 5 minutes, I restarted all of the servers so that if there was anything strange with the OS, it would have been accounted for. However, after around 10 minutes of things going down, I had to revert back to 0.33.0. By the way, here is a quick update on the stability after disabling `rosenpass` and `rosenpass-permissive` mode; all 9 peers have been running `0.36.3` since I posted [this](https://github.com/netbirdio/netbird/issues/3121#issuecomment-2602788952) comment and nothing dropped connection yet.

saavagebueno commented

@drixtol commented on GitHub (Jan 22, 2025):

Piping in to state that my org is also having this same issue. We do not have rosenpass options enabled.
All windows clients, all on versions >0.34.1. Only a subset of users are having issues, and only users who have authentication; Expiration disabled clients have not had any issues.
Server version is 0.35.2; have upgraded several times trying to resolve this issue.
Typically the effected clients are when coming back from idle, but can be a fresh connection. Wireguard handshake never completes.
If i change the peer group it immediately resolves the handshake issue.

@drixtol commented on GitHub (Jan 22, 2025): Piping in to state that my org is also having this same issue. We do not have rosenpass options enabled. All windows clients, all on versions >0.34.1. Only a subset of users are having issues, and only users who have authentication; Expiration disabled clients have not had any issues. Server version is 0.35.2; have upgraded several times trying to resolve this issue. Typically the effected clients are when coming back from idle, but can be a fresh connection. Wireguard handshake never completes. If i change the peer group it immediately resolves the handshake issue.

saavagebueno commented

@Bonnevie commented on GitHub (Feb 3, 2025):

I am also affected by this issue (as far I can tell), with netbird status -d sporadically showing no recent WireGuard handshake. The issues are with a specific ssh-enabled server only, the other peers in the network always seem to have recent handshakes listed. The workaround with cycling the access control policy seems to work.

My computer is on the bottom, the problematic peer up top. I have tried various versions though, and a colleague on 35.2 can connect without issue.

@Bonnevie commented on GitHub (Feb 3, 2025): I am also affected by this issue (as far I can tell), with `netbird status -d` sporadically showing no recent WireGuard handshake. The issues are with a specific ssh-enabled server only, the other peers in the network always seem to have recent handshakes listed. The workaround with cycling the access control policy seems to work. My computer is on the bottom, the problematic peer up top. I have tried various versions though, and a colleague on 35.2 can connect without issue. ![Image](https://github.com/user-attachments/assets/d2823461-0a30-4a22-bb36-a09a3b683cde)

saavagebueno commented

@ugurtam commented on GitHub (Feb 14, 2025):

Hi,
Same issues no rosenpass or rosenpass-premissive activated. My practical fix is to disable and enable policy in the problem group. But need a solution, we can't do that everyday

@ugurtam commented on GitHub (Feb 14, 2025): Hi, Same issues no rosenpass or rosenpass-premissive activated. My practical fix is to disable and enable policy in the problem group. But need a solution, we can't do that everyday

saavagebueno commented

@SuperKali commented on GitHub (May 22, 2025):

I think i have the same issue, my issue is that Peer 1 is the node from which I access the resources of Peer 2. However, if I reboot Peer 2 for maintenance, Peer 1 can no longer access the subnets on Peer 2—unless I also reboot Peer 1. Only after restarting Peer 1 do the resources on Peer 2 become accessible again.

@SuperKali commented on GitHub (May 22, 2025): I think i have the same issue, my issue is that Peer 1 is the node from which I access the resources of Peer 2. However, if I reboot Peer 2 for maintenance, Peer 1 can no longer access the subnets on Peer 2—unless I also reboot Peer 1. Only after restarting Peer 1 do the resources on Peer 2 become accessible again.

saavagebueno commented

@pscriptos commented on GitHub (Jun 10, 2025):

Peer 1 can no longer access the subnets on Peer 2—unless I also reboot Peer 1. Only after restarting Peer 1 do the resources on Peer 2 become accessible again.

I have exactly the same problem. Thanks for the tip with peer number 1, I have just restarted it and now access to the resource is working again for the time being.

@pscriptos commented on GitHub (Jun 10, 2025): >Peer 1 can no longer access the subnets on Peer 2—unless I also reboot Peer 1. Only after restarting Peer 1 do the resources on Peer 2 become accessible again. I have exactly the same problem. Thanks for the tip with peer number 1, I have just restarted it and now access to the resource is working again for the time being.

saavagebueno commented

@nazarewk commented on GitHub (Jun 10, 2025):

@pscriptos @SuperKali Can you update to 0.46.0, watch out for the issue/further new versions and report back with results after some time?

We have identified some form of race condition that was partially fixed in https://github.com/netbirdio/netbird/pull/3910 and is still being worked on in https://github.com/netbirdio/netbird/pull/3929

@nazarewk commented on GitHub (Jun 10, 2025): @pscriptos @SuperKali Can you update to `0.46.0`, watch out for the issue/further new versions and report back with results after some time? We have identified some form of race condition that was partially fixed in https://github.com/netbirdio/netbird/pull/3910 and is still being worked on in https://github.com/netbirdio/netbird/pull/3929

saavagebueno commented