Latency base routing does not work #835

Closed
opened 2025-11-20 05:18:21 -05:00 by saavagebueno · 12 comments
Owner

Originally created by @hurricanehrndz on GitHub (Apr 29, 2024).

Originally assigned to: @pascal-fischer on GitHub.

Describe the problem

latency base routing was suppose to be added with the following patch:
https://github.com/netbirdio/netbird/pull/1732

Unfortunately I believe the following if statement is wrong, but I am hesitant to open a PR to fix it since it will make the underlying wireguard connection unstable:

https://github.com/netbirdio/netbird/pull/1732/files#diff-67387c2b097eeb87387ab2575884677a2f9ee1c53b9e8674dcfa4936463c23f9R155

There also seems to be a few race conditions that could impact the route analyzes poorly.

In my testing sometimes the slower connection is picked because it was the first to connect. Other times the slower connection is picked because latency for one of the later connected peers is missing.

I will try and add logs, as I collect them.

To Reproduce

Steps to reproduce the behavior:
Fix if statement, set a more reasonable latency forgiveness number such as 0.015 (15 ms) rather than 100ms and watch what happens. You will eventually end up disconnected from all peers

Expected behavior

I expected the best route to be picked, unless it is a route that is only 100ms different as per the the latency base routing patch intended

Are you using NetBird Cloud?

Nope

NetBird version

modified version of 0.27.3

NetBird status -d output:

Peers detail:
 lon1.peers.n.com:
  NetBird IP: 100.113.20.218
  Public key: DdiC7cXVeE409wdJO1bQjFQ8JYP52X+/HO46FtkTQSI=
  Status: Connecting
  -- detail --
  Connection type:
  Direct: false
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Last connection update: 2024-04-29 10:29:36
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false
  Routes: -
  Latency: 135.14475ms

 sfo2.peers.n.com:
  NetBird IP: 100.113.35.69
  Public key: S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=
  Status: Connecting
  -- detail --
  Connection type:
  Direct: false
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Last connection update: 2024-04-29 10:29:38
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false
  Routes: -
  Latency: 59.2695ms

 iad1.peers.n.com:
  NetBird IP: 100.113.74.116
  Public key: MGJ+hGFni8uJSgVsF5AZI9GJCSEaOfJOUJ9joDxbhBI=
  Status: Connecting
  -- detail --
  Connection type:
  Direct: false
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Last connection update: 2024-04-29 10:29:39
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false
  Routes: -
  Latency: 69.943916ms

Daemon version: 0.27.3.1009
CLI version: 0.27.3.1009
Management: Connected to https://netbird.n.com:443
Signal: Connected to https://netbird.n.com:443
Relays:
  [stun:stunvpn-ec2.n.com:3478] is Available
  [turn:stunvpn-ec2.n.com:3478?transport=udp] is Unavailable, reason: allocate: all retransmissions failed for fGXwf/+OxEZN791L
Nameservers:
  [10.29.2.45:53, 10.29.2.46:53, 10.29.2.47:53] for [n.com, corp.n.com, dev.n.com] is Available
  [10.29.2.45:53, 10.29.2.46:53, 10.29.2.47:53] for [.] is Available
FQDN: hx7yg952h5-89.peers.n.com
NetBird IP: 100.113.101.234/16
Interface type: Userspace
Quantum resistance: false
Routes: -
Peers count: 0/3 Connected

Additional context

I have explicitly disable turn on my coturn instance to force direct connections

Originally created by @hurricanehrndz on GitHub (Apr 29, 2024). Originally assigned to: @pascal-fischer on GitHub. **Describe the problem** latency base routing was suppose to be added with the following patch: https://github.com/netbirdio/netbird/pull/1732 Unfortunately I believe the following if statement is wrong, but I am hesitant to open a PR to fix it since it will make the underlying wireguard connection unstable: https://github.com/netbirdio/netbird/pull/1732/files#diff-67387c2b097eeb87387ab2575884677a2f9ee1c53b9e8674dcfa4936463c23f9R155 There also seems to be a few race conditions that could impact the route analyzes poorly. In my testing sometimes the slower connection is picked because it was the first to connect. Other times the slower connection is picked because latency for one of the later connected peers is missing. I will try and add logs, as I collect them. **To Reproduce** Steps to reproduce the behavior: Fix if statement, set a more reasonable latency forgiveness number such as 0.015 (15 ms) rather than 100ms and watch what happens. You will eventually end up disconnected from all peers **Expected behavior** I expected the best route to be picked, unless it is a route that is only 100ms different as per the the latency base routing patch intended **Are you using NetBird Cloud?** Nope **NetBird version** modified version of 0.27.3 **NetBird status -d output:** ``` Peers detail: lon1.peers.n.com: NetBird IP: 100.113.20.218 Public key: DdiC7cXVeE409wdJO1bQjFQ8JYP52X+/HO46FtkTQSI= Status: Connecting -- detail -- Connection type: Direct: false ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Last connection update: 2024-04-29 10:29:36 Last WireGuard handshake: - Transfer status (received/sent) 0 B/0 B Quantum resistance: false Routes: - Latency: 135.14475ms sfo2.peers.n.com: NetBird IP: 100.113.35.69 Public key: S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8= Status: Connecting -- detail -- Connection type: Direct: false ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Last connection update: 2024-04-29 10:29:38 Last WireGuard handshake: - Transfer status (received/sent) 0 B/0 B Quantum resistance: false Routes: - Latency: 59.2695ms iad1.peers.n.com: NetBird IP: 100.113.74.116 Public key: MGJ+hGFni8uJSgVsF5AZI9GJCSEaOfJOUJ9joDxbhBI= Status: Connecting -- detail -- Connection type: Direct: false ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Last connection update: 2024-04-29 10:29:39 Last WireGuard handshake: - Transfer status (received/sent) 0 B/0 B Quantum resistance: false Routes: - Latency: 69.943916ms Daemon version: 0.27.3.1009 CLI version: 0.27.3.1009 Management: Connected to https://netbird.n.com:443 Signal: Connected to https://netbird.n.com:443 Relays: [stun:stunvpn-ec2.n.com:3478] is Available [turn:stunvpn-ec2.n.com:3478?transport=udp] is Unavailable, reason: allocate: all retransmissions failed for fGXwf/+OxEZN791L Nameservers: [10.29.2.45:53, 10.29.2.46:53, 10.29.2.47:53] for [n.com, corp.n.com, dev.n.com] is Available [10.29.2.45:53, 10.29.2.46:53, 10.29.2.47:53] for [.] is Available FQDN: hx7yg952h5-89.peers.n.com NetBird IP: 100.113.101.234/16 Interface type: Userspace Quantum resistance: false Routes: - Peers count: 0/3 Connected ``` **Additional context** I have explicitly disable turn on my coturn instance to force direct connections
saavagebueno added the bugroutestroubleshooting labels 2025-11-20 05:18:21 -05:00
Author
Owner

@pascal-fischer commented on GitHub (Apr 29, 2024):

HI @hurricanehrndz, I think you are right. The condition will keep the worse route. I updated the tests and will create a fix soon. As for the 100ms between the routes, I reduced that to 10ms. The idea was to avoid frequently alternating routes if their latency is similar. 10ms should be enough.

@pascal-fischer commented on GitHub (Apr 29, 2024): HI @hurricanehrndz, I think you are right. The condition will keep the worse route. I updated the tests and will create a fix soon. As for the 100ms between the routes, I reduced that to 10ms. The idea was to avoid frequently alternating routes if their latency is similar. 10ms should be enough.
Author
Owner

@pascal-fischer commented on GitHub (Apr 29, 2024):

There is a race condition if a peer just connected the latency might still be 0 if the route selection happens before the first update. However, this should not be an issue as we will recalculate the routes once we receive a new latency. In the future we might need to reduce the frequency at which the latency is updated as in bigger networks this could cause permanent recalculation and therefore wasted resources.

If you see other race conditions please let me know.

@pascal-fischer commented on GitHub (Apr 29, 2024): There is a race condition if a peer just connected the latency might still be 0 if the route selection happens before the first update. However, this should not be an issue as we will recalculate the routes once we receive a new latency. In the future we might need to reduce the frequency at which the latency is updated as in bigger networks this could cause permanent recalculation and therefore wasted resources. If you see other race conditions please let me know.
Author
Owner

@hurricanehrndz commented on GitHub (Apr 29, 2024):

@pascal-fischer thanks for the quick response, with latency base routing fixed eventually this occurs, I added way more logging to try and track the issue. Are relay connections necessary?

2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:120: peer S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8= has 63.008083ms latency
2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:145: analyzed network: STAGE_IPS - 172.19.36.0/22, on peer: S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=, current score: 0, chosen score: 99002.936991917, chosen peer: co6ojai9l7bha9qbpgdg:co6mgqa9l7bha9qbpf70
2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:120: peer MGJ+hGFni8uJSgVsF5AZI9GJCSEaOfJOUJ9joDxbhBI= has 73.212334ms latency
2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:145: analyzed network: STAGE_IPS - 172.19.36.0/22, on peer: MGJ+hGFni8uJSgVsF5AZI9GJCSEaOfJOUJ9joDxbhBI=, current score: 99002.926787666, chosen score: 99002.936991917, chosen peer: co6ojai9l7bha9qbpgdg:co6mgqa9l7bha9qbpf70
2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:156: Current route is not most optimal, comparing route scores: current: 99002.931787666 vs new: 99002.936991917
2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:184: triggered route state update for Peer S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=, state: Connected
2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:184: triggered route state update for Peer S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=, state: Connected
2024-04-29T11:45:25-06:00 INFO client/internal/routemanager/client.go:162: migrating route to new routing peer: S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=, for network: 172.19.36.0/22, with score: 99002.936991917

2024-04-29T11:45:25-06:00 WARN client/internal/peer/conn.go:513: got an error while sending the punch packet, err: write udp4 0.0.0.0:43197->pulbic_ip:51820: use of closed network connection
2024-04-29T11:45:26-06:00 WARN client/internal/peer/conn.go:513: got an error while sending the punch packet, err: write udp4 0.0.0.0:43197->public_ip:51820: use of closed network connection
2024-04-29T11:45:30-06:00 DEBG client/internal/routemanager/routemanager.go:90: Decreasing route ref count 4 for prefix coturn_ip/32
2024-04-29T11:45:30-06:00 DEBG client/internal/relay/relay.go:89: turn probe error from turn:stunvpn-ec2.n.com:3478?transport=udp: allocate: all retransmissions failed for y/z+d12+C8ncq0G5
2024-04-29T11:45:30-06:00 DEBG client/internal/engine.go:1298: received relay probe request, healthy: false
2024-04-29T11:45:30-06:00 DEBG client/internal/engine.go:1305: received wg probe request
2024-04-29T11:45:30-06:00 DEBG client/internal/engine.go:1311: failed to get wg stats for peer DdiC7cXVeE409wdJO1bQjFQ8JYP52X+/HO46FtkTQSI=: find peer info: config key not found: last_handshake_time_sec
2024-04-29T11:45:31-06:00 DEBG client/internal/peer/conn.go:662: peer S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8= ICE ConnectionState has changed to Disconnected

This may be unrelated to latency base routing, but it latency base routing definitely helps to surface whatever the bug is

@hurricanehrndz commented on GitHub (Apr 29, 2024): @pascal-fischer thanks for the quick response, with latency base routing fixed eventually this occurs, I added way more logging to try and track the issue. Are relay connections necessary? ``` 2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:120: peer S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8= has 63.008083ms latency 2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:145: analyzed network: STAGE_IPS - 172.19.36.0/22, on peer: S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=, current score: 0, chosen score: 99002.936991917, chosen peer: co6ojai9l7bha9qbpgdg:co6mgqa9l7bha9qbpf70 2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:120: peer MGJ+hGFni8uJSgVsF5AZI9GJCSEaOfJOUJ9joDxbhBI= has 73.212334ms latency 2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:145: analyzed network: STAGE_IPS - 172.19.36.0/22, on peer: MGJ+hGFni8uJSgVsF5AZI9GJCSEaOfJOUJ9joDxbhBI=, current score: 99002.926787666, chosen score: 99002.936991917, chosen peer: co6ojai9l7bha9qbpgdg:co6mgqa9l7bha9qbpf70 2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:156: Current route is not most optimal, comparing route scores: current: 99002.931787666 vs new: 99002.936991917 2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:184: triggered route state update for Peer S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=, state: Connected 2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:184: triggered route state update for Peer S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=, state: Connected 2024-04-29T11:45:25-06:00 INFO client/internal/routemanager/client.go:162: migrating route to new routing peer: S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=, for network: 172.19.36.0/22, with score: 99002.936991917 ``` ``` 2024-04-29T11:45:25-06:00 WARN client/internal/peer/conn.go:513: got an error while sending the punch packet, err: write udp4 0.0.0.0:43197->pulbic_ip:51820: use of closed network connection 2024-04-29T11:45:26-06:00 WARN client/internal/peer/conn.go:513: got an error while sending the punch packet, err: write udp4 0.0.0.0:43197->public_ip:51820: use of closed network connection 2024-04-29T11:45:30-06:00 DEBG client/internal/routemanager/routemanager.go:90: Decreasing route ref count 4 for prefix coturn_ip/32 2024-04-29T11:45:30-06:00 DEBG client/internal/relay/relay.go:89: turn probe error from turn:stunvpn-ec2.n.com:3478?transport=udp: allocate: all retransmissions failed for y/z+d12+C8ncq0G5 2024-04-29T11:45:30-06:00 DEBG client/internal/engine.go:1298: received relay probe request, healthy: false 2024-04-29T11:45:30-06:00 DEBG client/internal/engine.go:1305: received wg probe request 2024-04-29T11:45:30-06:00 DEBG client/internal/engine.go:1311: failed to get wg stats for peer DdiC7cXVeE409wdJO1bQjFQ8JYP52X+/HO46FtkTQSI=: find peer info: config key not found: last_handshake_time_sec 2024-04-29T11:45:31-06:00 DEBG client/internal/peer/conn.go:662: peer S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8= ICE ConnectionState has changed to Disconnected ``` This may be unrelated to latency base routing, but it latency base routing definitely helps to surface whatever the bug is
Author
Owner

@mlsmaycon commented on GitHub (Apr 29, 2024):

Hello @hurricanehrndz can you share the full masked logs and the output from netbird status -d when all routing peers are connected?

@mlsmaycon commented on GitHub (Apr 29, 2024): Hello @hurricanehrndz can you share the full masked logs and the output from netbird status -d when all routing peers are connected?
Author
Owner

@hurricanehrndz commented on GitHub (Apr 29, 2024):

I can try to catch it, it does disconnect quickly

@hurricanehrndz commented on GitHub (Apr 29, 2024): I can try to catch it, it does disconnect quickly
Author
Owner

@mlsmaycon commented on GitHub (Apr 29, 2024):

Ok, in that case, can you share logs after 2 reconnections?

@mlsmaycon commented on GitHub (Apr 29, 2024): Ok, in that case, can you share logs after 2 reconnections?
Author
Owner

@hurricanehrndz commented on GitHub (Apr 30, 2024):

@mlsmaycon maybe we should close this issue and I will open a new one specifically scoped ICE candidates going to a disconnect state without recovery

@hurricanehrndz commented on GitHub (Apr 30, 2024): @mlsmaycon maybe we should close this issue and I will open a new one specifically scoped ICE candidates going to a disconnect state without recovery
Author
Owner

@mlsmaycon commented on GitHub (May 2, 2024):

Ok, thanks @hurricanehrndz

@mlsmaycon commented on GitHub (May 2, 2024): Ok, thanks @hurricanehrndz
Author
Owner

@hurricanehrndz commented on GitHub (May 2, 2024):

@mlsmaycon I found the other bugs that stop this from working on userspace wireguard interfaces, I will open PRs that should be merged before this does

@hurricanehrndz commented on GitHub (May 2, 2024): @mlsmaycon I found the other bugs that stop this from working on userspace wireguard interfaces, I will open PRs that should be merged before this does
Author
Owner

@mlsmaycon commented on GitHub (May 2, 2024):

Thanks, @hurricanehrndz;

Please let me know if you have any questions about the bugs you've found before opening the PRs.

BTW, we just merged #1903.

@mlsmaycon commented on GitHub (May 2, 2024): Thanks, @hurricanehrndz; Please let me know if you have any questions about the bugs you've found before opening the PRs. BTW, we just merged #1903.
Author
Owner

@hurricanehrndz commented on GitHub (May 2, 2024):

Awesome not a big deal, new release should be avoided till fixes are merged. I did document one of the bugs in the self-hosted channel.

@hurricanehrndz commented on GitHub (May 2, 2024): Awesome not a big deal, new release should be avoided till fixes are merged. I did document one of the bugs in the self-hosted channel.
Author
Owner

@hurricanehrndz commented on GitHub (May 2, 2024):

Let me open a new issue

@hurricanehrndz commented on GitHub (May 2, 2024): Let me open a new issue
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#835