Windows clients (users as well as key peers) loose routes after a couple of days #1675

Open
opened 2025-11-20 06:04:34 -05:00 by saavagebueno · 15 comments
Owner

Originally created by @tim-tamm on GitHub (Mar 4, 2025).

Describe the problem

To netbird connected Windows clients don't reach another peer via route anymore.

To Reproduce

Steps to reproduce the behavior:

  • netbird self-hosted
  • Windows Clients (experienced on WinS 2012, WinS 2019, Win10 LTSC), 0.37.0 and 0.37.1
  • Win10 = connect with user login
  • WinS = are connect with setup key
  • no idea if this matters: all clients have to network routes assigned to ONE peer (172.16./16 and 192.168.10./24) ... this is still a relict from migrating networks
  • this ONE peer (lxc-container, logged in with setup key) sits in a remote LAN and has these two network adapters (+ netbird)
  • in netbird, routes are setup for both IP ranges and this ONE peer is selected as destination
  • of course, this ONE peer is connected to netbird, too
  • however, on the subjected clients only the new route (172.16.) is selected (for test purposes, i also disabled the 192 route temp without any effect)
  • run an endless ping to the ONE peer (172.16.100.200)

=> after three to five days the ping starts to fail

  • Iam unable to reach the ONE peer anymore - neither with the local IP, nor with its netbird hostname, nor with the netbird wt0 IP (100.80.x.x is our case)
  • however, other clients with the very same setting are still able to reach this ONE peer or any available address in that remote LAN
  • the subjected clients can still reach the netbird instance and are still connected
  • whatever I tried (restarting clients, connecting / disconnecting, flushing arp, flushing (deleting) route table, flushing dns list), nothing worked; once it even happened when i only disconncted netbird on one of the machines it was not reachable locally anymore at all, even not on its natural adapter in the local network (i needed to hard reset since no monitor is attached)
  • the only way to make it work again was to "delete" those clients (peer) in the netbird dasboard and re-login the clients with setup key or user login => consequently, clients get a new netbird IP assigned (100.80.x.x)

...but I cannot do this workaround every couple of days ;). It just happens randomly and will not change back.

Expected behavior

Continious and nearly permanent ping to the destination peer / route.

Are you using NetBird Cloud?

self-host NetBird's control plane

NetBird version

0.36.5

Originally created by @tim-tamm on GitHub (Mar 4, 2025). **Describe the problem** To netbird connected Windows clients don't reach another peer via route anymore. **To Reproduce** Steps to reproduce the behavior: - netbird self-hosted - Windows Clients (experienced on WinS 2012, WinS 2019, Win10 LTSC), 0.37.0 and 0.37.1 - Win10 = connect with user login - WinS = are connect with setup key - no idea if this matters: all clients have to network routes assigned to ONE peer (172.16./16 and 192.168.10./24) ... this is still a relict from migrating networks - this ONE peer (lxc-container, logged in with setup key) sits in a remote LAN and has these two network adapters (+ netbird) - in netbird, routes are setup for both IP ranges and this ONE peer is selected as destination - of course, this ONE peer is connected to netbird, too - however, on the subjected clients only the new route (172.16.) is selected (for test purposes, i also disabled the 192 route temp without any effect) - run an endless ping to the ONE peer (172.16.100.200) => after three to five days the ping starts to fail - Iam unable to reach the ONE peer anymore - neither with the local IP, nor with its netbird hostname, nor with the netbird wt0 IP (100.80.x.x is our case) - however, other clients with the very same setting are still able to reach this ONE peer or any available address in that remote LAN - the subjected clients can still reach the netbird instance and are still connected - whatever I tried (restarting clients, connecting / disconnecting, flushing arp, flushing (deleting) route table, flushing dns list), nothing worked; once it even happened when i only disconncted netbird on one of the machines it was not reachable locally anymore at all, even not on its natural adapter in the local network (i needed to hard reset since no monitor is attached) - the only way to make it work again was to "delete" those clients (peer) in the netbird dasboard and re-login the clients with setup key or user login => consequently, clients get a new netbird IP assigned (100.80.x.x) ...but I cannot do this workaround every couple of days ;). It just happens randomly and will not change back. **Expected behavior** Continious and nearly permanent ping to the destination peer / route. **Are you using NetBird Cloud?** self-host NetBird's control plane **NetBird version** 0.36.5
saavagebueno added the triage-needed label 2025-11-20 06:04:34 -05:00
Author
Owner

@tim-tamm commented on GitHub (Mar 5, 2025):

Today, on two other clients the connection to the remote desktop broke (Win 11 LTSC and again Win 10 LTSC). I run route print on the Win11 machine before(!) re-establishing the connection (delete peer in netbird dash board and re-login on that machine) and after; in the following screenshot i highlight the differences.

This client has assigned the same route as mentioned in the initial issue and additionally an exit note - neither any client of that 172 route nor any internet destination (ping 8.8.8.8 or any web page) was reachable anymore as long as netbird was connected and respective routes were selected.

Image

@tim-tamm commented on GitHub (Mar 5, 2025): Today, on two other clients the connection to the remote desktop broke (Win 11 LTSC and again Win 10 LTSC). I run `route print` on the Win11 machine before(!) re-establishing the connection (delete peer in netbird dash board and re-login on that machine) and after; in the following screenshot i highlight the differences. This client has assigned the same route as mentioned in the initial issue and additionally an exit note - neither any client of that 172 route nor any internet destination (ping 8.8.8.8 or any web page) was reachable anymore as long as netbird was connected and respective routes were selected. ![Image](https://github.com/user-attachments/assets/74c13831-31ff-4515-954a-c43a5244d861)
Author
Owner

@mlsmaycon commented on GitHub (Mar 5, 2025):

@tim-tamm, we had an issue fixed on 0.36.6 that could cause the wireguard handshake watcher to stop, leaving the peer connection status as connected but with an invalid wireguard status.

Can you confirm that your nodes are all running 0.36.6 or newer? If so, when the issue happens again, can you generate a debug bundle with the following command:

netbird debug bundle -S
@mlsmaycon commented on GitHub (Mar 5, 2025): @tim-tamm, we had an issue fixed on 0.36.6 that could cause the wireguard handshake watcher to stop, leaving the peer connection status as connected but with an invalid wireguard status. Can you confirm that your nodes are all running 0.36.6 or newer? If so, when the issue happens again, can you generate a debug bundle with the following command: ``` netbird debug bundle -S ```
Author
Owner

@tim-tamm commented on GitHub (Mar 5, 2025):

thank you.

clients are 0.37.0 and 0.37.1
netbird instance / netbird management: was 0.36.5 (today we upgraded to 0.37.1)
nodes* for routes (clients as well): were 0.36.3 and 0.36.4 (today we upgraded them to 0.37.1 as well)

*this is how i understand node - please correct if I have a incorrect understanding.

I will observe it and run the command on that client / node once it occurs again.

@tim-tamm commented on GitHub (Mar 5, 2025): thank you. clients are 0.37.0 and 0.37.1 netbird instance / netbird management: was 0.36.5 (today we upgraded to 0.37.1) nodes* for routes (clients as well): were 0.36.3 and 0.36.4 (today we upgraded them to 0.37.1 as well) *this is how i understand node - please correct if I have a incorrect understanding. I will observe it and run the command on that client / node once it occurs again.
Author
Owner

@tim-tamm commented on GitHub (Mar 5, 2025):

probably updating the nodes did the job - I couldn't think of them causing the issue.

it looks promising, because I havent applied my work-around (re-establishing the connection) to one of the affected servers and now it reaches the destination peer again without any change on client side.

👍

@tim-tamm commented on GitHub (Mar 5, 2025): probably updating the **nodes** did the job - I couldn't think of them causing the issue. it looks promising, because I havent applied my _work-around_ (re-establishing the connection) to one of the affected servers and now it reaches the destination peer again without any change on client side. 👍
Author
Owner

@tim-tamm commented on GitHub (Mar 6, 2025):

now, it seems like one of the clients has some hiccups when both routes are activated, even though they are not overlapping.

Image

Image

Any idea why this happens?

@tim-tamm commented on GitHub (Mar 6, 2025): now, it seems like one of the clients has some hiccups when both routes are activated, even though they are not overlapping. ![Image](https://github.com/user-attachments/assets/4b42feba-3ee5-4695-8098-61637d94c92b) ![Image](https://github.com/user-attachments/assets/fc2b7424-bf70-4201-a4ba-3e87385d75fb) Any idea why this happens?
Author
Owner

@tim-tamm commented on GitHub (Mar 6, 2025):

strange ... after unticking the route to the remote desktop and then selecting it again, it seems to be stable now

Image

@tim-tamm commented on GitHub (Mar 6, 2025): strange ... after unticking the route to the remote desktop and then selecting it again, it seems to be stable now ![Image](https://github.com/user-attachments/assets/7c53926b-3e45-40dd-aa87-533e6c92b679)
Author
Owner

@mlsmaycon commented on GitHub (Mar 6, 2025):

@tim-tamm if you can reproduce the issue, could you please generate some debug logs so that we can trace the root cause? To enable them, you can run:

netbird debug log level trace

To collect them after the tests:

netbird debug bundle -S
@mlsmaycon commented on GitHub (Mar 6, 2025): @tim-tamm if you can reproduce the issue, could you please generate some debug logs so that we can trace the root cause? To enable them, you can run: ```shell netbird debug log level trace ``` To collect them after the tests: ```shell netbird debug bundle -S ```
Author
Owner

@tim-tamm commented on GitHub (Mar 6, 2025):

okay, now another client without using exit node, but having hiccups.

Image

after disconnecting and re-connecting:

Image

btw: yesterday I had the feeling the hiccups started when using those routes: once, when trying to open RDP session (i was still able to enter credentials, but then....) and once when opening a browser and google (the page got basically loaded, but the google logo didnt make it during the first hiccup round).

how can I share the log files? (i wouldnt ilke to share them publicly)

funny (actually strange), i just noticed: the ping went back to normal once i stopped the file transfer. starting the file transfer again lead to the hiccup.

Image

one minute later:

Image

@tim-tamm commented on GitHub (Mar 6, 2025): okay, now another client without using exit node, but having hiccups. ![Image](https://github.com/user-attachments/assets/b1436074-30e7-4b21-9457-b2a65d353b40) after disconnecting and re-connecting: ![Image](https://github.com/user-attachments/assets/ee9ef0bf-ad0b-4a89-883f-52e9467f067b) btw: yesterday I had the feeling the hiccups started when _using_ those routes: once, when trying to open RDP session (i was still able to enter credentials, but then....) and once when opening a browser and google (the page got basically loaded, but the google logo didnt make it during the first hiccup round). **how can I share the log files?** (i wouldnt ilke to share them publicly) funny (actually strange), i just noticed: the ping went back to normal once i stopped the file transfer. starting the file transfer again lead to the hiccup. ![Image](https://github.com/user-attachments/assets/12e42163-97c9-4a07-81c5-b1e1bd9ac991) one minute later: ![Image](https://github.com/user-attachments/assets/a6d2fdef-8541-42d6-b8c6-48d7155775e0)
Author
Owner

@tim-tamm commented on GitHub (Mar 6, 2025):

alight. i can easily re-produce the same issue on another machine (one which had the hiccups this morning, using exit node)

Image

...and a bit delayed, google (public internet / exit node) is affected, too.

Image

...it gets slower before it breaks :-/

@tim-tamm commented on GitHub (Mar 6, 2025): alight. i can easily re-produce the same issue on another machine (one which had the hiccups this morning, using exit node) ![Image](https://github.com/user-attachments/assets/e4e2c8d4-2f0b-4223-88b5-f992d056b294) ...and a bit delayed, google (public internet / exit node) is affected, too. ![Image](https://github.com/user-attachments/assets/e90d3427-6595-4460-85cb-0b35dae1fe76) ...it gets slower before it breaks :-/
Author
Owner

@tim-tamm commented on GitHub (Mar 6, 2025):

okay, in order to help locating the root cause i tried to copy the same file the another peer which is a normal client and NOT located at/in that remote LAN where this one route leads to. and again, the actual destination AND the route are affected the same way as before.

Image

I slowly wonder about the performance of our VPS where netbird is hosted. RDP has been working fine almost all day, but when transfering larger files it breaks. btw: that hasnt been an issue until today (referring to the fact that we updated the installation as well as route peers yesterday). furthermore, this wouldn't explain the initial hiccups this morning, because no file transfer was performed.

@tim-tamm commented on GitHub (Mar 6, 2025): okay, in order to help locating the root cause i tried to copy the same file the another **peer** which is a normal client and NOT located at/in that remote LAN where this one route leads to. and again, the actual destination AND the route are affected the same way as before. ![Image](https://github.com/user-attachments/assets/2510d35b-0ead-48e8-8e13-5a4c859ff3fb) I slowly wonder about the performance of our VPS where netbird is hosted. RDP has been working fine almost all day, but when transfering larger files it breaks. btw: that hasnt been an issue until today (referring to the fact that we updated the installation as well as route peers yesterday). furthermore, this wouldn't explain the initial hiccups this morning, because no file transfer was performed.
Author
Owner

@tim-tamm commented on GitHub (Mar 6, 2025):

trying the same on the other machine which uses exit node, the same happens to the *one route connection, but not to the exit node connection (even this gets temporarily slower, too, but goes back to normal again).

Image

Image

@tim-tamm commented on GitHub (Mar 6, 2025): trying the same on the other machine which uses exit node, the same happens to the **one* route connection, but not to the exit node connection (even this gets temporarily slower, too, but goes back to normal again). ![Image](https://github.com/user-attachments/assets/5f286a47-d0c8-47be-aa3b-669178845813) ![Image](https://github.com/user-attachments/assets/d5dee1f9-d066-4ada-a387-76f746924b5e)
Author
Owner

@tim-tamm commented on GitHub (Mar 6, 2025):

sent the logs via e-mail!

@tim-tamm commented on GitHub (Mar 6, 2025): sent the logs via e-mail!
Author
Owner

@tim-tamm commented on GitHub (Mar 7, 2025):

well, we have now analysed a lot ... checked the system resources of the VPS, dismounted the exit node, checked the peer / gateway in the remote LAN => we were able to reproduce the issue over and over again and observed weirdest effects and connection breaks interchanging between RDP and file transfer.

the best thing: we captured all this on 1:05 hours screen recording.

please respond my e-mail i sent to support referring to the isssue and i am more than happy to share that video.

@tim-tamm commented on GitHub (Mar 7, 2025): well, we have now analysed **a lot** ... checked the system resources of the VPS, dismounted the exit node, checked the peer / gateway in the remote LAN => we were able to reproduce the issue over and over again and observed weirdest effects and connection breaks interchanging between RDP and file transfer. the best thing: we captured all this on 1:05 hours screen recording. please respond my e-mail i sent to support referring to the isssue and i am more than happy to share that video.
Author
Owner

@tim-tamm commented on GitHub (Mar 7, 2025):

btw: those connection breaks only appeared since we upgraded the netbird host instance and the peers (exit node peer and peer for route to remote LAN) from 0.36.5 to 0.37.1 - but with 0.36.5 we had the handshake losses mentioned above and only fixed in 0.36.6

  • handshake loss seems to be fixed (need to wait a couple of more days, but it looks promising because broken connection immediately worked again
  • now, connection breaks when transferring bigger files; RDP works generally fine
@tim-tamm commented on GitHub (Mar 7, 2025): btw: those connection breaks only appeared since we upgraded the netbird host instance and the peers (exit node peer and peer for route to remote LAN) from **0.36.5** to **0.37.1** - but with 0.36.5 we had the handshake losses mentioned above and only fixed in 0.36.6 - handshake loss seems to be fixed (need to wait a couple of more days, but it looks promising because broken connection immediately worked again - now, connection breaks when transferring _bigger_ files; RDP works generally fine
Author
Owner

@Gauss23 commented on GitHub (Mar 8, 2025):

I just saw that you created another issue for 0.37.2 but it looks like your problem exists already a bit longer.

How are the nodes connected? P2P or relayed?

When relayed: Does the VPS doing the relay have a reliable network connection? Maybe you just overwhelm it.

Run a ping on the relay server public IP while copying a file.

@Gauss23 commented on GitHub (Mar 8, 2025): I just saw that you created another issue for 0.37.2 but it looks like your problem exists already a bit longer. How are the nodes connected? P2P or relayed? When relayed: Does the VPS doing the relay have a reliable network connection? Maybe you just overwhelm it. Run a ping on the relay server public IP while copying a file.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#1675