Pings disappear from some peers to others after one week #1577

Closed
opened 2025-11-20 05:33:08 -05:00 by saavagebueno · 25 comments
Owner

Originally created by @netandreus on GitHub (Jan 28, 2025).

Originally assigned to: @pappz on GitHub.

Describe the problem

I have 2 peers:

  • "node-3" with netbird address 100.81.65.41
  • "node-2" with netbird address 100.81.94.114
    supporting HA-routes to multiple vlans.

Also I have:

  • "uk-node-1" with netboird address 100.81.73.30 and
  • ELK VM with HeartBeat and NetBird client installed with address 100.81.167.156.

My problem is that after a week these nodes (node-2 and node-3) losing connection to ELK node, and ELK node losing connection to node-2 and node-3, I can't ping them by netbird IP addresses. But in the same time I can ping other peers from ELK peer and ping other than ELK peers from node-2 and node-3.

To Reproduce

Steps to reproduce the behavior:

  1. Run two peers in HA mode.
  2. Run third peer on other location
  3. Wait for at least one week
  4. Peers connection lost

Expected behavior

Stable connection between peers.

Are you using NetBird Cloud?

No, I use self-hosted netbird.

NetBird version

0.36.3

NetBird status -dA output:

When I go by ssh to node-3 I see this:

 elk-huk-internal.netbird.selfhosted:
  NetBird IP: 100.81.167.156
  Public key: P0Xd+rb5EjqfaFXIL/KuQ0yGKHT4qTa99Mz4ABrshRA=
  Status: Connected
  -- detail --
  Connection type: Relayed
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: rels://gateway.xxx.com:443
  Last connection update: 1 day, 7 hours ago <<<<<<<<<<<<<<<<<
  Last WireGuard handshake: 1 day, 7 hours ago <<<<<<<<<<<<<<<<<
  Transfer status (received/sent) 110.7 MiB/838.7 MiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 129.747422ms

And same from node-2:

 elk-huk-internal.netbird.selfhosted:
  NetBird IP: 100.81.167.156
  Public key: P0Xd+rb5EjqfaFXIL/KuQ0yGKHT4qTa99Mz4ABrshRA=
  Status: Connected
  -- detail --
  Connection type: Relayed
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: rels://gateway.xxx.com:443
  Last connection update: 1 day, 7 hours ago <<<<<<<<<<<<<<<<<
  Last WireGuard handshake: 1 day, 7 hours ago <<<<<<<<<<<<<<<<<
  Transfer status (received/sent) 579.3 KiB/6.3 MiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 130.665153ms

Workaround

After I run:

netbird down
netbird up

WireGuard connection re-established.

Related issues

Can you please fix this?

Originally created by @netandreus on GitHub (Jan 28, 2025). Originally assigned to: @pappz on GitHub. **Describe the problem** I have 2 peers: - "node-3" with netbird address 100.81.65.41 - "node-2" with netbird address 100.81.94.114 supporting HA-routes to multiple vlans. Also I have: - "uk-node-1" with netboird address 100.81.73.30 and - ELK VM with HeartBeat and NetBird client installed with address 100.81.167.156. My problem is that after a week these nodes (node-2 and node-3) losing connection to ELK node, and ELK node losing connection to node-2 and node-3, I can't ping them by netbird IP addresses. But in the same time I can ping other peers from ELK peer and ping other than ELK peers from node-2 and node-3. **To Reproduce** Steps to reproduce the behavior: 1. Run two peers in HA mode. 2. Run third peer on other location 3. Wait for at least one week 4. Peers connection lost **Expected behavior** Stable connection between peers. **Are you using NetBird Cloud?** No, I use self-hosted netbird. **NetBird version** 0.36.3 **NetBird status -dA output:** When I go by ssh to node-3 I see this: ``` elk-huk-internal.netbird.selfhosted: NetBird IP: 100.81.167.156 Public key: P0Xd+rb5EjqfaFXIL/KuQ0yGKHT4qTa99Mz4ABrshRA= Status: Connected -- detail -- Connection type: Relayed ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: rels://gateway.xxx.com:443 Last connection update: 1 day, 7 hours ago <<<<<<<<<<<<<<<<< Last WireGuard handshake: 1 day, 7 hours ago <<<<<<<<<<<<<<<<< Transfer status (received/sent) 110.7 MiB/838.7 MiB Quantum resistance: false Routes: - Networks: - Latency: 129.747422ms ``` And same from node-2: ``` elk-huk-internal.netbird.selfhosted: NetBird IP: 100.81.167.156 Public key: P0Xd+rb5EjqfaFXIL/KuQ0yGKHT4qTa99Mz4ABrshRA= Status: Connected -- detail -- Connection type: Relayed ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: rels://gateway.xxx.com:443 Last connection update: 1 day, 7 hours ago <<<<<<<<<<<<<<<<< Last WireGuard handshake: 1 day, 7 hours ago <<<<<<<<<<<<<<<<< Transfer status (received/sent) 579.3 KiB/6.3 MiB Quantum resistance: false Routes: - Networks: - Latency: 130.665153ms ``` **Workaround** After I run: ``` netbird down netbird up ``` WireGuard connection re-established. **Related issues** - [Peer link is being dropped](https://github.com/netbirdio/netbird/issues/3121#top) Can you please fix this?
saavagebueno added the clientrelay labels 2025-11-20 05:33:08 -05:00
Author
Owner

@pappz commented on GitHub (Jan 28, 2025):

Hello @netandreus

Could you reproduce the issue with verbose logging enabled and send me the logs?

@pappz commented on GitHub (Jan 28, 2025): Hello @netandreus Could you reproduce the issue with verbose logging enabled and send me the logs?
Author
Owner

@pappz commented on GitHub (Jan 28, 2025):

@netandreus Could you send me the public keys of the node-2 and node-3? In the connection mechanism has some logic that depends from the public keys between the peers. Maybe if we know which lane of the algorithm is running on your elk side and on the node side then we can go nearer to the root cause of the issue.

@pappz commented on GitHub (Jan 28, 2025): @netandreus Could you send me the public keys of the node-2 and node-3? In the connection mechanism has some logic that depends from the public keys between the peers. Maybe if we know which lane of the algorithm is running on your elk side and on the node side then we can go nearer to the root cause of the issue.
Author
Owner

@netandreus commented on GitHub (Jan 28, 2025):

@pappz sure, how can I fetch the public keys of my nodes?

@netandreus commented on GitHub (Jan 28, 2025): @pappz sure, how can I fetch the public keys of my nodes?
Author
Owner

@pappz commented on GitHub (Jan 28, 2025):

@pappz sure, how can I fetch the public keys of my nodes?

The netbird status -d command will print out. Looking for the "Public key" expression, just like in your example in your original report:

 elk-huk-internal.netbird.selfhosted:
  NetBird IP: 100.81.167.156
  Public key: P0Xd+rb5EjqfaFXIL/KuQ0yGKHT4qTa99Mz4ABrshRA=
@pappz commented on GitHub (Jan 28, 2025): > [@pappz](https://github.com/pappz) sure, how can I fetch the public keys of my nodes? The `netbird status -d` command will print out. Looking for the "Public key" expression, just like in your example in your original report: ``` elk-huk-internal.netbird.selfhosted: NetBird IP: 100.81.167.156 Public key: P0Xd+rb5EjqfaFXIL/KuQ0yGKHT4qTa99Mz4ABrshRA= ```
Author
Owner

@netandreus commented on GitHub (Jan 28, 2025):

@pappz here they are:

node-2:

 node-2.netbird.selfhosted:
  NetBird IP: 100.81.94.114
  Public key: A0/k9FWRkF+JspDjOVIhk0YaaRDvvTZo3C+kEL0feR0=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 10.30.200.47:51820/10.30.200.28:51820
  Relay server address: rels://gateway.xxx.com:443
  Last connection update: 9 hours, 45 minutes ago
  Last WireGuard handshake: 1 minute, 49 seconds ago
  Transfer status (received/sent) 97.1 KiB/40.5 KiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 493.405µs

node-3:

 node-3.netbird.selfhosted:
  NetBird IP: 100.81.65.41
  Public key: 9i5W38NvBSXk7oK+v0KeaXn0csEn5AOXIG2mg3TwOAo=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/host
  ICE candidate endpoints (Local/Remote): 172.30.1.97:51820/10.30.200.47:51820
  Relay server address: rels://gateway.xxx.com:443
  Last connection update: 9 hours, 47 minutes ago
  Last WireGuard handshake: 1 minute, 48 seconds ago
  Transfer status (received/sent) 25.2 KiB/93.4 KiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 502.284µs
@netandreus commented on GitHub (Jan 28, 2025): @pappz here they are: node-2: ``` node-2.netbird.selfhosted: NetBird IP: 100.81.94.114 Public key: A0/k9FWRkF+JspDjOVIhk0YaaRDvvTZo3C+kEL0feR0= Status: Connected -- detail -- Connection type: P2P ICE candidate (Local/Remote): host/prflx ICE candidate endpoints (Local/Remote): 10.30.200.47:51820/10.30.200.28:51820 Relay server address: rels://gateway.xxx.com:443 Last connection update: 9 hours, 45 minutes ago Last WireGuard handshake: 1 minute, 49 seconds ago Transfer status (received/sent) 97.1 KiB/40.5 KiB Quantum resistance: false Routes: - Networks: - Latency: 493.405µs ``` node-3: ``` node-3.netbird.selfhosted: NetBird IP: 100.81.65.41 Public key: 9i5W38NvBSXk7oK+v0KeaXn0csEn5AOXIG2mg3TwOAo= Status: Connected -- detail -- Connection type: P2P ICE candidate (Local/Remote): host/host ICE candidate endpoints (Local/Remote): 172.30.1.97:51820/10.30.200.47:51820 Relay server address: rels://gateway.xxx.com:443 Last connection update: 9 hours, 47 minutes ago Last WireGuard handshake: 1 minute, 48 seconds ago Transfer status (received/sent) 25.2 KiB/93.4 KiB Quantum resistance: false Routes: - Networks: - Latency: 502.284µs ```
Author
Owner

@pappz commented on GitHub (Jan 28, 2025):

Great, thank you!

Do you know which restart solved the issue:

  • Restarting the Netbird agent on the ELK side
  • Restarting the Netbird agent on the node side
  • Or did it not matter?
@pappz commented on GitHub (Jan 28, 2025): Great, thank you! Do you know which restart solved the issue: - Restarting the Netbird agent on the ELK side - Restarting the Netbird agent on the node side - Or did it not matter?
Author
Owner

@netandreus commented on GitHub (Jan 28, 2025):

@pappz It does not matter, what I do at the ELK side, only restarting the Netbird agent (netbird stop && netbird start) at the node side affects the situation.

@netandreus commented on GitHub (Jan 28, 2025): @pappz It does not matter, what I do at the ELK side, only restarting the Netbird agent (```netbird stop && netbird start```) at the node side affects the situation.
Author
Owner

@netandreus commented on GitHub (Jan 30, 2025):

Good morning, @pappz ! Is there something from my side, that could help you?

@netandreus commented on GitHub (Jan 30, 2025): Good morning, @pappz ! Is there something from my side, that could help you?
Author
Owner

@pappz commented on GitHub (Jan 30, 2025):

Hello @netandreus,
Thank you for the information. I found a potential bug and I have prepared a fix for it. You can track the changes here. Nevertheless, I am not sure this is the root cause of your issue because I haven't been able to reproduce it on my machine.

How easy is it to reproduce the issue? Could you enable verbose logging on your agent and collect the logs?

@pappz commented on GitHub (Jan 30, 2025): Hello @netandreus, Thank you for the information. I found a potential bug and I have prepared a fix for it. You can track the changes [here](https://github.com/netbirdio/netbird/pull/3250). Nevertheless, I am not sure this is the root cause of your issue because I haven't been able to reproduce it on my machine. How easy is it to reproduce the issue? Could you enable verbose logging on your agent and collect the logs?
Author
Owner

@netandreus commented on GitHub (Jan 30, 2025):

Thank you for your efforts, @pappz !
I can only disable cron restart and wait for one week. What exact should I do if (or when) it will occurs?

@netandreus commented on GitHub (Jan 30, 2025): Thank you for your efforts, @pappz ! I can only disable cron restart and wait for one week. What exact should I do if (or when) it will occurs?
Author
Owner

@pappz commented on GitHub (Jan 30, 2025):

With these commands, you can set the logging level:
netbird debug log level debug
or
netbird debug log level verbose
The verbose level may log too much data to your disk over a week, so choose the debug level based on your preference. When an issue occurs, please collect the log files and send them to me.

@pappz commented on GitHub (Jan 30, 2025): With these commands, you can set the logging level: `netbird debug log level debug` or `netbird debug log level verbose` The verbose level may log too much data to your disk over a week, so choose the debug level based on your preference. When an issue occurs, please collect the log files and send them to me.
Author
Owner

@pappz commented on GitHub (Jan 30, 2025):

@netandreus
While you ping the unreachable server what error message do you see in the ping output? Is it something like this, or just a simple Destination Host Unreachable.

ubuntu@machine1:~$ ping 100.108.186.247
PING 100.108.186.247 (100.108.186.247) 56(84) bytes of data.
From 100.108.200.99 icmp_seq=1 Destination Host Unreachable
ping: sendmsg: Required key not available
From 100.108.200.99 icmp_seq=2 Destination Host Unreachable
ping: sendmsg: Required key not available
@pappz commented on GitHub (Jan 30, 2025): @netandreus While you ping the unreachable server what error message do you see in the ping output? Is it something like this, or just a simple `Destination Host Unreachable`. ``` ubuntu@machine1:~$ ping 100.108.186.247 PING 100.108.186.247 (100.108.186.247) 56(84) bytes of data. From 100.108.200.99 icmp_seq=1 Destination Host Unreachable ping: sendmsg: Required key not available From 100.108.200.99 icmp_seq=2 Destination Host Unreachable ping: sendmsg: Required key not available
Author
Owner

@pappz commented on GitHub (Jan 30, 2025):

I am working on another logic that can better manage the possible anomalies. Is it an option for you to do tests with a custom build with the patches?

@pappz commented on GitHub (Jan 30, 2025): I am working on another logic that can better manage the possible anomalies. Is it an option for you to do tests with a custom build with the patches?
Author
Owner

@netandreus commented on GitHub (Jan 31, 2025):

@pappz yes, I can. I should deploy custom build on both nodes? And how can I rollback if somethings go wrong? How can I collect logs?

@netandreus commented on GitHub (Jan 31, 2025): @pappz yes, I can. I should deploy custom build on both nodes? And how can I rollback if somethings go wrong? How can I collect logs?
Author
Owner

@pappz commented on GitHub (Jan 31, 2025):

@netandreus

I prepared the test version. Here is the package for Linux. If you are using a different OS, I will send you different artifacts.

Default installation path is /usr/bin/netbird. Create a backup for easy rollback:

netbird down
cp -a /usr/bin/netbird /usr/bin/netbird.bkp
cat /path/to/downloaded/netbird > /usr/bin/netbird
netbird up

Don't forget to set the proper debug level!

Logs are in /var/log/netbird. Clean them before testing for easier handling.

If you are testing with previous machines (node-2, node-3), no need to update ELK peer.

And do not forget, this is just a test version, be careful to use it in production env.

I hope this fix will solve your issue but meantime I will dig deeper into this topic.

@pappz commented on GitHub (Jan 31, 2025): @netandreus I prepared the test version. [Here](https://drive.google.com/file/d/1R1VlSroJi0QxHHIZttUg/view?usp=sharing) is the package for Linux. If you are using a different OS, I will send you different artifacts. Default installation path is `/usr/bin/netbird`. Create a backup for easy rollback: ```bash netbird down cp -a /usr/bin/netbird /usr/bin/netbird.bkp cat /path/to/downloaded/netbird > /usr/bin/netbird netbird up ``` Don't forget to set the proper debug level! Logs are in /var/log/netbird. Clean them before testing for easier handling. If you are testing with previous machines (node-2, node-3), no need to update ELK peer. And do not forget, this is just a test version, be careful to use it in production env. I hope this fix will solve your issue but meantime I will dig deeper into this topic.
Author
Owner

@netandreus commented on GitHub (Feb 3, 2025):

Good morning, @pappz !
Can you please update the link, I can't download test version.

Image
@netandreus commented on GitHub (Feb 3, 2025): Good morning, @pappz ! Can you please update the link, I can't download test version. <img width="660" alt="Image" src="https://github.com/user-attachments/assets/b56076a8-1ac9-4d1e-9d8e-962487506508" />
Author
Owner

@pappz commented on GitHub (Feb 3, 2025):

Strange. Here is the updated link.

@pappz commented on GitHub (Feb 3, 2025): Strange. [Here](https://drive.google.com/file/d/1R1VlSroJi0QxHHIAgG_raJ8OxHIZttUg/view?usp=drive_link) is the updated link.
Author
Owner

@pappz commented on GitHub (Feb 3, 2025):

@netandreus could you send to me a debug bundle? You can generate it with this command: netbird debug bundle -S

I would like to get a better picture of your network-related settings. This package contains all the necessary information.

@pappz commented on GitHub (Feb 3, 2025): @netandreus could you send to me a debug bundle? You can generate it with this command: `netbird debug bundle -S` I would like to get a better picture of your network-related settings. This package contains all the necessary information.
Author
Owner

@netandreus commented on GitHub (Feb 3, 2025):

@pappz Sure, but I can`t downloaad it. May be some persmssions issue from your side?

Image

When I click to the file - I can only copy name. Maybe you need my google account or something from my side?

@netandreus commented on GitHub (Feb 3, 2025): @pappz Sure, but I can`t downloaad it. May be some persmssions issue from your side? <img width="974" alt="Image" src="https://github.com/user-attachments/assets/5bf3977b-bfd1-4c48-b3ef-eb9936f72c28" /> When I click to the file - I can only copy name. Maybe you need my google account or something from my side?
Author
Owner

@pappz commented on GitHub (Feb 3, 2025):

The file that I uploaded is a ZIP archive. I think you opened it by the browser. If you download the full zip and manage it on your machine it would be easier.

@pappz commented on GitHub (Feb 3, 2025): The file that I uploaded is a ZIP archive. I think you opened it by the browser. If you download the full zip and manage it on your machine it would be easier.
Author
Owner

@netandreus commented on GitHub (Feb 3, 2025):

With these commands, you can set the logging level:
netbird debug log level debug
or
netbird debug log level verbose
The verbose level may log too much data to your disk over a week, so choose the debug level based on your preference. When an issue occurs, please collect the log files and send them to me.

Done.

@netandreus could you send to me a debug bundle? You can generate it with this command: netbird debug bundle -S

Done.

You can find all files (both logs when error occures on stable version and debug bundle for test version) here:

https://drive.google.com/drive/folders/1sRO8GprHSPS5LYgUqWg5iP2wCTHm7PHa?usp=drive_link

I hope this fix will solve your issue but meantime I will dig deeper into this topic.

I'm deployed test version on the both nodes.

node-2
Image

node-3
Image

Then I restarted netbird on elk node:

Image

And see this one for node-2 at elk

Image

and for node-3 at elk:

Image
@netandreus commented on GitHub (Feb 3, 2025): ``` With these commands, you can set the logging level: netbird debug log level debug or netbird debug log level verbose The verbose level may log too much data to your disk over a week, so choose the debug level based on your preference. When an issue occurs, please collect the log files and send them to me. ``` Done. ```@netandreus could you send to me a debug bundle? You can generate it with this command: netbird debug bundle -S``` Done. You can find all files (both logs when error occures on stable version and debug bundle for test version) here: https://drive.google.com/drive/folders/1sRO8GprHSPS5LYgUqWg5iP2wCTHm7PHa?usp=drive_link ``` I hope this fix will solve your issue but meantime I will dig deeper into this topic. ``` I'm deployed test version on the both nodes. **node-2** <img width="428" alt="Image" src="https://github.com/user-attachments/assets/b8018322-4ab6-4aa2-a531-bf620d1caabe" /> **node-3** <img width="535" alt="Image" src="https://github.com/user-attachments/assets/30f053d4-db48-4e14-9358-f9797a54a2d6" /> Then I restarted netbird on elk node: <img width="914" alt="Image" src="https://github.com/user-attachments/assets/abff2736-4fae-460d-bede-50f06393d85b" /> And see this one for node-2 at elk <img width="421" alt="Image" src="https://github.com/user-attachments/assets/fc32040f-27c8-4e18-aa22-0e93754e76a7" /> and for node-3 at elk: <img width="421" alt="Image" src="https://github.com/user-attachments/assets/fb566967-89db-446c-ab2a-224a56f9a923" />
Author
Owner

@netandreus commented on GitHub (Feb 3, 2025):

@pappz and now I can't ping neither node-3 nor node-2 from elk.

@netandreus commented on GitHub (Feb 3, 2025): @pappz and now I can't ping neither node-3 nor node-2 from elk.
Author
Owner

@pappz commented on GitHub (Feb 3, 2025):

@netandreus
Thank you for the logs. Can we schedule a call where we can better discover the situation together? It would be faster to resolve this issue than here.

@pappz commented on GitHub (Feb 3, 2025): @netandreus Thank you for the logs. Can we schedule a call where we can better discover the situation together? It would be faster to resolve this issue than here.
Author
Owner

@netandreus commented on GitHub (Feb 3, 2025):

@pappz sure, we can schedule a call for tomorrow 2025-02-04 from 10:00 GMT+4. I can give you access to Anydesk / ssh to these nodes. Please find me on Telegram - https://t.me/netandreus

@netandreus commented on GitHub (Feb 3, 2025): @pappz sure, we can schedule a call for tomorrow 2025-02-04 from 10:00 GMT+4. I can give you access to Anydesk / ssh to these nodes. Please find me on Telegram - https://t.me/netandreus
Author
Owner

@pappz commented on GitHub (Feb 4, 2025):

Thank you. I think you need to accept my messages. https://t.me/pzolinb

@pappz commented on GitHub (Feb 4, 2025): Thank you. I think you need to accept my messages. https://t.me/pzolinb
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#1577