Netbird breaks Cilium local endpoint routing (gateway/envoy) #2333

Open
opened 2025-11-20 07:07:54 -05:00 by saavagebueno · 8 comments
Owner

Originally created by @Blackclaws on GitHub (Oct 2, 2025).

Describe the problem

When running netbird in host network mode on a kubernetes node that is running Cilium CNI with gateways enabled and envoy proxy it breaks local delivery of packages to envoy.

Cilium is using marking and netbird globally sets:

sysctl -w net.ipv4.conf.all.src_valid_mark=1

This breaks Ciliums delivery.

To Reproduce

  1. Setup kubernetes cluster using cilium CNI, with kube proxy replacement (potentially L2 announcements is also needed)
  2. Run netbird as daemonset
  3. When netbird is up gateway hosted on node where netbird is up is no longer reachable from outside the cluster
  4. Netbird is down, gateway reachable.

Expected behavior

Netbird doesn't break Cilium

Are you using NetBird Cloud?

No

NetBird version

0.59.0

Is any other VPN software installed?

No

Debug output

Peers detail:
 router-ottawa.anon-hZjlp.domain:
  NetBird IP: 100.85.45.84
  Public key: QDyJ0jkJU+lOvp287yfvmXDEq+RoBdg81uO8LH/XYww=
  Status: Connected
  -- detail --
  Connection type: Relayed
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: rel://netbird.anon-7ocu9.domain:33080
  Last connection update: 5 minutes, 42 seconds ago
  Last WireGuard handshake: 2 minutes, 41 seconds ago
  Transfer status (received/sent) 780 B/600 B
  Quantum resistance: false
  Networks: 10.0.1.107/32, 10.0.2.236/32
  Latency: 0s

Events:
  [INFO] SYSTEM (f5b9ce8f-3bfd-48cc-ad73-3b42c9ed6753)
    Message: Network map updated
    Time: 43 minutes, 14 seconds ago
  [INFO] SYSTEM (e66579de-0279-4f00-be3e-d012c885c3be)
    Message: Network map updated
    Time: 42 minutes, 38 seconds ago
  [INFO] SYSTEM (00b44017-b622-4181-9414-b059b42d6448)
    Message: Network map updated
    Time: 35 minutes, 56 seconds ago
  [INFO] SYSTEM (3305e385-7e0f-4ffd-9489-d68b2562c1bf)
    Message: Network map updated
    Time: 35 minutes, 44 seconds ago
  [INFO] SYSTEM (8a5a8366-4e16-4a4c-9662-f0e95c61dfcc)
    Message: Network map updated
    Time: 27 minutes, 22 seconds ago
  [INFO] SYSTEM (a7cc3a87-cfee-4d23-8ebe-2163e14a54ee)
    Message: Network map updated
    Time: 23 minutes, 15 seconds ago
  [INFO] SYSTEM (9c155177-20bd-40a7-8ff3-07e7242b829a)
    Message: Network map updated
    Time: 21 minutes, 4 seconds ago
  [INFO] SYSTEM (8548a0bb-abb4-4705-aa5b-75f6ff81b8c9)
    Message: Network map updated
    Time: 18 minutes, 59 seconds ago
  [INFO] SYSTEM (d616633a-e46b-4bbb-ba1c-c2b73349e595)
    Message: Network map updated
    Time: 5 minutes, 42 seconds ago
  [INFO] SYSTEM (b250821c-0254-4e9d-9c5a-3f392b486962)
    Message: Network map updated
    Time: 4 minutes, 50 seconds ago
OS: linux/amd64
Daemon version: 0.59.0
CLI version: 0.59.0
Profile: default
Management: Connected to https://netbird.anon-7ocu9.domain:443
Signal: Connected to https://netbird.anon-7ocu9.domain:443
Relays: 
  [stun:netbird.anon-7ocu9.domain:3478] is Available
  [turn:netbird.anon-7ocu9.domain:3478?transport=udp] is Available
  [rel://netbird.anon-7ocu9.domain:33080] is Available
Nameservers: 
FQDN: kurisu-180-65.anon-hZjlp.domain
NetBird IP: 100.85.180.65/16
Interface type: Kernel
Quantum resistance: false
Lazy connection: false
Networks: -
Forwarding rules: 0
Peers count: 1/1 Connected

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

Have you tried these troubleshooting steps?

  • Reviewed client troubleshooting (if applicable)
  • Checked for newer NetBird versions
  • Searched for similar issues on GitHub (including closed ones)
  • Restarted the NetBird client
  • Disabled other VPN software
  • Checked firewall settings
Originally created by @Blackclaws on GitHub (Oct 2, 2025). **Describe the problem** When running netbird in host network mode on a kubernetes node that is running Cilium CNI with gateways enabled and envoy proxy it breaks local delivery of packages to envoy. Cilium is using marking and netbird globally sets: ``` sysctl -w net.ipv4.conf.all.src_valid_mark=1 ``` This breaks Ciliums delivery. **To Reproduce** 1. Setup kubernetes cluster using cilium CNI, with kube proxy replacement (potentially L2 announcements is also needed) 2. Run netbird as daemonset 3. When netbird is up gateway hosted on node where netbird is up is no longer reachable from outside the cluster 4. Netbird is down, gateway reachable. **Expected behavior** Netbird doesn't break Cilium **Are you using NetBird Cloud?** No **NetBird version** 0.59.0 **Is any other VPN software installed?** No **Debug output** ``` Peers detail: router-ottawa.anon-hZjlp.domain: NetBird IP: 100.85.45.84 Public key: QDyJ0jkJU+lOvp287yfvmXDEq+RoBdg81uO8LH/XYww= Status: Connected -- detail -- Connection type: Relayed ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: rel://netbird.anon-7ocu9.domain:33080 Last connection update: 5 minutes, 42 seconds ago Last WireGuard handshake: 2 minutes, 41 seconds ago Transfer status (received/sent) 780 B/600 B Quantum resistance: false Networks: 10.0.1.107/32, 10.0.2.236/32 Latency: 0s Events: [INFO] SYSTEM (f5b9ce8f-3bfd-48cc-ad73-3b42c9ed6753) Message: Network map updated Time: 43 minutes, 14 seconds ago [INFO] SYSTEM (e66579de-0279-4f00-be3e-d012c885c3be) Message: Network map updated Time: 42 minutes, 38 seconds ago [INFO] SYSTEM (00b44017-b622-4181-9414-b059b42d6448) Message: Network map updated Time: 35 minutes, 56 seconds ago [INFO] SYSTEM (3305e385-7e0f-4ffd-9489-d68b2562c1bf) Message: Network map updated Time: 35 minutes, 44 seconds ago [INFO] SYSTEM (8a5a8366-4e16-4a4c-9662-f0e95c61dfcc) Message: Network map updated Time: 27 minutes, 22 seconds ago [INFO] SYSTEM (a7cc3a87-cfee-4d23-8ebe-2163e14a54ee) Message: Network map updated Time: 23 minutes, 15 seconds ago [INFO] SYSTEM (9c155177-20bd-40a7-8ff3-07e7242b829a) Message: Network map updated Time: 21 minutes, 4 seconds ago [INFO] SYSTEM (8548a0bb-abb4-4705-aa5b-75f6ff81b8c9) Message: Network map updated Time: 18 minutes, 59 seconds ago [INFO] SYSTEM (d616633a-e46b-4bbb-ba1c-c2b73349e595) Message: Network map updated Time: 5 minutes, 42 seconds ago [INFO] SYSTEM (b250821c-0254-4e9d-9c5a-3f392b486962) Message: Network map updated Time: 4 minutes, 50 seconds ago OS: linux/amd64 Daemon version: 0.59.0 CLI version: 0.59.0 Profile: default Management: Connected to https://netbird.anon-7ocu9.domain:443 Signal: Connected to https://netbird.anon-7ocu9.domain:443 Relays: [stun:netbird.anon-7ocu9.domain:3478] is Available [turn:netbird.anon-7ocu9.domain:3478?transport=udp] is Available [rel://netbird.anon-7ocu9.domain:33080] is Available Nameservers: FQDN: kurisu-180-65.anon-hZjlp.domain NetBird IP: 100.85.180.65/16 Interface type: Kernel Quantum resistance: false Lazy connection: false Networks: - Forwarding rules: 0 Peers count: 1/1 Connected ``` **Screenshots** If applicable, add screenshots to help explain your problem. **Additional context** Add any other context about the problem here. **Have you tried these troubleshooting steps?** - [ ] Reviewed [client troubleshooting](https://docs.netbird.io/how-to/troubleshooting-client) (if applicable) - [ ] Checked for newer NetBird versions - [ ] Searched for similar issues on GitHub (including closed ones) - [ ] Restarted the NetBird client - [ ] Disabled other VPN software - [ ] Checked firewall settings
saavagebueno added the bugclientcompatibility labels 2025-11-20 07:07:54 -05:00
Author
Owner

@Blackclaws commented on GitHub (Oct 2, 2025):

A potential solution is to set the valid_mark = 1 only on the interfaces that are relevant for netbird.

@Blackclaws commented on GitHub (Oct 2, 2025): A potential solution is to set the valid_mark = 1 only on the interfaces that are relevant for netbird.
Author
Owner

@Blackclaws commented on GitHub (Oct 2, 2025):

I've opened a companion issue in cilium as I'm not sure which side is responsible for fixing this.

@Blackclaws commented on GitHub (Oct 2, 2025): I've opened a companion issue in cilium as I'm not sure which side is responsible for fixing this.
Author
Owner

@nazarewk commented on GitHub (Oct 2, 2025):

Thanks for investigating this and reporting on both sides. It saved us a lot of precious time spinning up various clusters.

Do you have any idea if it's specific to Cilium or it's combination with some specific Kubernetes distributions?

@nazarewk commented on GitHub (Oct 2, 2025): Thanks for investigating this and reporting on both sides. It saved us a lot of precious time spinning up various clusters. Do you have any idea if it's specific to Cilium or it's combination with some specific Kubernetes distributions?
Author
Owner

@Blackclaws commented on GitHub (Oct 2, 2025):

So in this case we're using TalosOS which basically is linux only for kubernetes. I'm not sure if its specific to Cilium but the way Cilium does its local transparent proxying is what leads to this. It might also be that this only happens when you use the L2 announcement feature where nodes dynamically respond to arp based on acquired leases, in this case the actual IP address that is the destination isn't assigned to an interface so rp_filter together with strict marks might trigger this issue then.

If you need any logs or firewall rulesets from a node exhibiting this behaviour pre/post netbird up let me know.

@Blackclaws commented on GitHub (Oct 2, 2025): So in this case we're using TalosOS which basically is linux only for kubernetes. I'm not sure if its specific to Cilium but the way Cilium does its local transparent proxying is what leads to this. It might also be that this only happens when you use the L2 announcement feature where nodes dynamically respond to arp based on acquired leases, in this case the actual IP address that is the destination isn't assigned to an interface so rp_filter together with strict marks might trigger this issue then. If you need any logs or firewall rulesets from a node exhibiting this behaviour pre/post netbird up let me know.
Author
Owner

@mad73923 commented on GitHub (Oct 14, 2025):

Having the same issue. Would love to keep cilium + netbird as a pair!

@mad73923 commented on GitHub (Oct 14, 2025): Having the same issue. Would love to keep cilium + netbird as a pair!
Author
Owner

@mad73923 commented on GitHub (Oct 16, 2025):

Update:
I was able to aviod setting net.ipv4.conf.all.src_valid_mark=1 by setting the following env variables for netbird (adapting the netbird service in systemctl):

[Service]
Environment="NB_DISABLE_CUSTOM_ROUTING=true"
Environment="NB_SKIP_SOCKET_MARK=true"

It seems to be working for my usecase but idk how sustainable it is.

@mad73923 commented on GitHub (Oct 16, 2025): Update: I was able to aviod setting `net.ipv4.conf.all.src_valid_mark=1` by setting the following env variables for netbird (adapting the netbird service in systemctl): ``` [Service] Environment="NB_DISABLE_CUSTOM_ROUTING=true" Environment="NB_SKIP_SOCKET_MARK=true" ``` It seems to be working for my usecase but idk how sustainable it is.
Author
Owner

@Joao-1 commented on GitHub (Nov 2, 2025):

Hi, guys! Do you have any updates on this? I tried @mad73923's solution, but it didn't work.

@Joao-1 commented on GitHub (Nov 2, 2025): Hi, guys! Do you have any updates on this? I tried @mad73923's solution, but it didn't work.
Author
Owner

@Joao-1 commented on GitHub (Nov 3, 2025):

For more information, we are using a cluster with Cilium 1.18.3 and all nodes with Netbird version 0.59.11. Once all the nodes are connected, all external connections are lost. Similarly, when the VPN interface is deactivated, everything reverts to normal. To troubleshoot the issue, we added other nodes from a different cluster that does not use Cilium, but rather Flannel as CNI, and everything functioned perfectly.

@Joao-1 commented on GitHub (Nov 3, 2025): For more information, we are using a cluster with Cilium 1.18.3 and all nodes with Netbird version 0.59.11. Once all the nodes are connected, all external connections are lost. Similarly, when the VPN interface is deactivated, everything reverts to normal. To troubleshoot the issue, we added other nodes from a different cluster that does not use Cilium, but rather Flannel as CNI, and everything functioned perfectly.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#2333