Performance issue large setup with lots of policies #2272

Open
opened 2025-11-20 07:06:51 -05:00 by saavagebueno · 3 comments
Owner

Originally created by @saule1508 on GitHub (Sep 12, 2025).

Describe the problem

On a large set-up with a high number of policies and policy rules the system consumes a lot of CPU and does not scale with the number of peers connected.

After profiling the cpu I could see the issue is almost entirely caused by the posture checks for the OS version and the NB version. The checks are executed excessively often and burn a lot of cpu (regexp).

Image

I disabled the version posture checks and the issue disappeared

Sorry the flame graph is very small, I can paste a better one which is on my corporate laptop that I don't have now

To Reproduce

Steps to reproduce the behavior:

  1. A netbird network with a large number of networks. policies and policy_rules (we have 480 networks)
  2. As peer are connecting observe the CPU increasing, cpu spikes when AccountUpdatePeers is executed which seems to be two twice when a Peer connect: first for Login and shortly after for a Sync.
  3. We see the UpdateAccountPeers being executed too often (sometimes 4 times in 30 seconds interval) and with a high latency (up to 30 secons)
  4. People complain about Login timeout, but more importantly the system consume a lot of CPU. With only 70 peers connected we need 8 vCPU.

Expected behavior

CPU should be flat and not increase too much depending on the number of peers

Are you using NetBird Cloud?

Self-hosted

NetBird version

0.56.1

Is any other VPN software installed?

no

Debug output

To help us resolve the problem, please attach the following anonymized status output

netbird status -dA

Create and upload a debug bundle, and share the returned file key:

netbird debug for 1m -AS -U

Uploaded files are automatically deleted after 30 days.

Alternatively, create the file only and attach it here manually:

netbird debug for 1m -AS

Screenshots

The first day is with the version check, the two others without. The spike are flattened a lot by the interval of prometheus, actually the spikes were much bigger

Image

All the metrics have improved, especially the GetPeerNetworkMap

Users are not suffering (or less) of the Login latency (sometimes two or 3 login were needed)

Another massive improvement was to remove the Group for sysadmin which was in the source of all the 480 rules, this was causing a big negative impact both on windows client (adding 480 routes was very slow) and on the server (improved the UpdateAccountPeers a lot). But it is probable that the real issue was the version checks.

Additional context

Add any other context about the problem here.

I discussed this on slack and I got interesting feedback from @pascal-fischer confirming this high number of networks/groups/policies certainly are causing an issue.

The issue is worked-around by disabling the version checks. I am sure it can be optimized and will be glad to propose a MR. Possible solutions would be:

  • see if the posture checks are not applied too often, I suspect the same peer is validated over and over if he appears in multiple rule. It might be better to first compute the list of peers and exclude the one that don't pass the check at the end.
  • another more radical approach would be to cache the result of the check per peer (either in a "global" singleton cache at the package level or a cache at the accountManager level). Since OS version and Netbird version don't change, the invalidation could be done when a peer logs in.

Also I think the UpdateAccountPeers buffering is not working as it should, sometimes I see 4 or 5 UpdateAccountPeers in a 30 seconds interval. The buffering/queuing does not seem to happen at the AccountManager level ? I believe also the UpdateAccountPeers is triggered without buffering via the API (PUT group for ex.) which also makes the API slow. It would be better to send a buffered update.

Originally created by @saule1508 on GitHub (Sep 12, 2025). **Describe the problem** On a large set-up with a high number of policies and policy rules the system consumes a lot of CPU and does not scale with the number of peers connected. After profiling the cpu I could see the issue is almost entirely caused by the posture checks for the OS version and the NB version. The checks are executed excessively often and burn a lot of cpu (regexp). <img width="719" height="194" alt="Image" src="https://github.com/user-attachments/assets/8841d7af-f886-408c-852f-0c0985f37ae0" /> I disabled the version posture checks and the issue disappeared Sorry the flame graph is very small, I can paste a better one which is on my corporate laptop that I don't have now **To Reproduce** Steps to reproduce the behavior: 1. A netbird network with a large number of networks. policies and policy_rules (we have 480 networks) 2. As peer are connecting observe the CPU increasing, cpu spikes when AccountUpdatePeers is executed which seems to be two twice when a Peer connect: first for Login and shortly after for a Sync. 3. We see the UpdateAccountPeers being executed too often (sometimes 4 times in 30 seconds interval) and with a high latency (up to 30 secons) 4. People complain about Login timeout, but more importantly the system consume a lot of CPU. With only 70 peers connected we need 8 vCPU. **Expected behavior** CPU should be flat and not increase too much depending on the number of peers **Are you using NetBird Cloud?** Self-hosted **NetBird version** 0.56.1 **Is any other VPN software installed?** no **Debug output** To help us resolve the problem, please attach the following anonymized status output netbird status -dA Create and upload a debug bundle, and share the returned file key: netbird debug for 1m -AS -U *Uploaded files are automatically deleted after 30 days.* Alternatively, create the file only and attach it here manually: netbird debug for 1m -AS **Screenshots** The first day is with the version check, the two others without. The spike are flattened a lot by the interval of prometheus, actually the spikes were much bigger <img width="1497" height="1282" alt="Image" src="https://github.com/user-attachments/assets/a6166777-c9c2-449f-9125-6e89f3ed93a4" /> All the metrics have improved, especially the GetPeerNetworkMap Users are not suffering (or less) of the Login latency (sometimes two or 3 login were needed) Another massive improvement was to remove the Group for sysadmin which was in the source of all the 480 rules, this was causing a big negative impact both on windows client (adding 480 routes was very slow) and on the server (improved the UpdateAccountPeers a lot). But it is probable that the real issue was the version checks. **Additional context** Add any other context about the problem here. I discussed this on slack and I got interesting feedback from @pascal-fischer confirming this high number of networks/groups/policies certainly are causing an issue. The issue is worked-around by disabling the version checks. I am sure it can be optimized and will be glad to propose a MR. Possible solutions would be: - see if the posture checks are not applied too often, I suspect the same peer is validated over and over if he appears in multiple rule. It might be better to first compute the list of peers and exclude the one that don't pass the check at the end. - another more radical approach would be to cache the result of the check per peer (either in a "global" singleton cache at the package level or a cache at the accountManager level). Since OS version and Netbird version don't change, the invalidation could be done when a peer logs in. Also I think the UpdateAccountPeers buffering is not working as it should, sometimes I see 4 or 5 UpdateAccountPeers in a 30 seconds interval. The buffering/queuing does not seem to happen at the AccountManager level ? I believe also the UpdateAccountPeers is triggered without buffering via the API (PUT group for ex.) which also makes the API slow. It would be better to send a buffered update.
saavagebueno added the triage-needed label 2025-11-20 07:06:51 -05:00
Author
Owner

@saule1508 commented on GitHub (Sep 12, 2025):

There is an other improvement I made on my setup, in the sql_store GetAccount. The issue is that the policy rules are first loaded by the ORM (in one query, with a IN clause) but the result is not used because for some reason the ORM does not attach it to the slice in account. Then the rules are queried one by one, but with 400 it takes betweeb 400 and 600ms.
I will propose a MR which is working on my side, I serialized the account before and after and it is the same

@saule1508 commented on GitHub (Sep 12, 2025): There is an other improvement I made on my setup, in the sql_store GetAccount. The issue is that the policy rules are first loaded by the ORM (in one query, with a IN clause) but the result is not used because for some reason the ORM does not attach it to the slice in account. Then the rules are queried one by one, but with 400 it takes betweeb 400 and 600ms. I will propose a MR which is working on my side, I serialized the account before and after and it is the same
Author
Owner

@saule1508 commented on GitHub (Oct 18, 2025):

Disabling the os version and netbird version check has improved the CPU usage a lot, but now I have 600 concurrent connected peers (1600 total peers) and the CPU is spiking again.

Looking at the CPU profile it is still the posture checks the culprit and this time the process check (not surprisingly since the os version and netbird version checks are disabled).

There is a design flaw and also a bug that explain the CPU usage, I think both will be fixable and I will try next week.

The logic is like this: UpdateAccountPeer (which is triggered way too often, but it is a different issue) loops on all the connected peers (600). Then for each peer, it calculate the network map. Part of the network map is to calculate with peers are reachable by this peer, and there the logic is:

  • for each policy rule (we have 480), calculate the peers in the source groups of the policy and each time apply the posture checks. Even if the peer is expired (that is the bug).

to give an example, I was tracking the Debug log line "an error occurred check ProcessCheck: on peer: xxx :unsupported peer's operating system: android" accountID=xxx context=GRPC peerID="xxx" requestID=xxx source="management/server/types/account.go:1164" for a android peer that is offline since very long, and I arrive at the staggering 2K per second !!!

Which is : each UpdateAccountPeer iterates over 600 peers, each iteration iterates over all policies (480 in my case), this android peer is part of 9 policies (it is in 9 group each in a different policy) so it means for each UpdateAccountPeer the check is performed 600 x 9 time, for each UpdateAccountPeer.

Note: the UpdateAccountPeer is executed very often (6 times on 30 seconds in my case), because the buffered logic is not completely right and furthermore it is not always used (api call directly use the UpdateAccountPeer not the buffered).

So the time spent in the posture check is staggering and basically the same check for a peer is done thousand time per second

Two low risk fixes:

  1. Exclude the expired peers in getAllPeersFromGroups
  2. have a cache in the posture package, with the key being the peerID and the checkName. The os version and netbird version check could be cached for long time (1h), the process check for maybe 2 minutes. In the loginpeer, the cache should be invalidated for the peer login IN. I think this is very sound design decision because currently the same check can be done thousands of time for the same peer.

another suggestion, I think it would be best to have a central scheduling of the UpdateAccountPeers, like once per 30 seconds maybe, instead of the current situation where it can be called by plenty of concurrent requests. Same for the peer expiration job.

I'll have a go at that on my self-hosted set-up to keep the cpu under controle

@saule1508 commented on GitHub (Oct 18, 2025): Disabling the os version and netbird version check has improved the CPU usage a lot, but now I have 600 concurrent connected peers (1600 total peers) and the CPU is spiking again. Looking at the CPU profile it is still the posture checks the culprit and this time the process check (not surprisingly since the os version and netbird version checks are disabled). There is a design flaw and also a bug that explain the CPU usage, I think both will be fixable and I will try next week. The logic is like this: UpdateAccountPeer (which is triggered way too often, but it is a different issue) loops on all the connected peers (600). Then for each peer, it calculate the network map. Part of the network map is to calculate with peers are reachable by this peer, and there the logic is: - for each policy rule (we have 480), calculate the peers in the source groups of the policy and each time apply the posture checks. Even if the peer is expired (that is the bug). to give an example, I was tracking the Debug log line "an error occurred check ProcessCheck: on peer: xxx :unsupported peer's operating system: android" accountID=xxx context=GRPC peerID="xxx" requestID=xxx source="management/server/types/account.go:1164" for a android peer that is offline since very long, and I arrive at the staggering 2K per second !!! Which is : each UpdateAccountPeer iterates over 600 peers, each iteration iterates over all policies (480 in my case), this android peer is part of 9 policies (it is in 9 group each in a different policy) so it means for each UpdateAccountPeer the check is performed 600 x 9 time, for each UpdateAccountPeer. Note: the UpdateAccountPeer is executed very often (6 times on 30 seconds in my case), because the buffered logic is not completely right and furthermore it is not always used (api call directly use the UpdateAccountPeer not the buffered). So the time spent in the posture check is staggering and basically the same check for a peer is done thousand time per second Two low risk fixes: 1. Exclude the expired peers in getAllPeersFromGroups 2. have a cache in the posture package, with the key being the peerID and the checkName. The os version and netbird version check could be cached for long time (1h), the process check for maybe 2 minutes. In the loginpeer, the cache should be invalidated for the peer login IN. I think this is very sound design decision because currently the same check can be done thousands of time for the same peer. another suggestion, I think it would be best to have a central scheduling of the UpdateAccountPeers, like once per 30 seconds maybe, instead of the current situation where it can be called by plenty of concurrent requests. Same for the peer expiration job. I'll have a go at that on my self-hosted set-up to keep the cpu under controle
Author
Owner

@saule1508 commented on GitHub (Oct 18, 2025):

Image
@saule1508 commented on GitHub (Oct 18, 2025): <img width="3846" height="1217" alt="Image" src="https://github.com/user-attachments/assets/9cc94f3a-e1d8-4505-b231-b3b9299945b6" />
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#2272