NetBird client disconnects with the error: grpc: received message larger than max #2050

Open
opened 2025-11-20 06:11:54 -05:00 by saavagebueno · 4 comments
Owner

Originally created by @lolinool on GitHub (Jul 7, 2025).

Bug

NetBird client disconnects from the Management service with the error:
rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4198925 vs. 4194304)

Steps to reproduce

  1. Run NetBird in an environment with a large number of peers/rules
  2. The client crashes with the above error after connecting

Expected behavior

Stable connection without errors

Versions

  • OS: Ubuntu 22.04
  • NetBird: version 0.49
  • Deployment: self-hosted in docker

Logs

Full error output from logs:

disconnected from the Management service but will retry silently. Reason: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4198925 vs. 4194304)

Additional context

  • Approximate number of peers in the network: ~80
Originally created by @lolinool on GitHub (Jul 7, 2025). ### Bug NetBird client disconnects from the Management service with the error: `rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4198925 vs. 4194304)` ### Steps to reproduce 1. Run NetBird in an environment with a large number of peers/rules 2. The client crashes with the above error after connecting ### Expected behavior Stable connection without errors ### Versions - OS: Ubuntu 22.04 - NetBird: version `0.49` - Deployment: self-hosted in docker ### Logs Full error output from logs: ``` disconnected from the Management service but will retry silently. Reason: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4198925 vs. 4194304) ``` ### Additional context - Approximate number of peers in the network: ~80
saavagebueno added the bugmanagement-servicetriage-needed labels 2025-11-20 06:11:54 -05:00
Author
Owner

@nazarewk commented on GitHub (Jul 7, 2025):

Could you give us an estimate on how much of different configs do you have in your setup? We have order of magnitude larger customers in terms of Peers in the network than you do not observing such issues.

Could you upload a debug bundle (netbird debug for 1m -SU)? Might also give us some clues

@nazarewk commented on GitHub (Jul 7, 2025): Could you give us an estimate on how much of different configs do you have in your setup? We have order of magnitude larger customers in terms of Peers in the network than you do not observing such issues. Could you upload a debug bundle (`netbird debug for 1m -SU`)? Might also give us some clues
Author
Owner

@lolinool commented on GitHub (Jul 7, 2025):

When we first encountered this error, we assumed it was related to having a large number of resources in the network (at that time, there were around 80). We cleaned up some resources, and the error disappeared.

Later, the error reappeared, although we hadn’t added any new resources to the network. We once again reduced the list of resources and removed peers that had not been online for more than a month (out of 165 peers, only about 40 were actually in use). These actions resolved the problem again.

To reproduce the error, we ran the following test:

We started adding new resources to the network, but the error did not occur, even when we reached the same count of around 80 resources (while previously we thought this was the threshold triggering the issue).

Image

Then we wrote a script that created an additional 100 peers in the network, and as a result, the error occurred again.

#!/bin/bash


NB_SETUP_KEY="***********"
NB_MANAGEMENT_URL="*******"
IMAGE="netbirdio/netbird:0.49.0-amd64"


NUM_CONTAINERS=100

for i in $(seq 1 $NUM_CONTAINERS); do
    docker run -d \
        --name netbird_$i \
        -e NB_SETUP_KEY="$NB_SETUP_KEY" \
        -e NB_MANAGEMENT_URL="$NB_MANAGEMENT_URL" \
        "$IMAGE"
done

We’d also like to point out that a large number of inactive peers appear because users log in from new devices or from different IP addresses, leading to the creation of new peers that remain unused and just “hang around” in the system. We’ve attached a screenshot showing the peers created by our test script.

Image

At this point, we can only monitor such unused peers manually. Could you please advise:

Are there any mechanisms or best practices for automatically cleaning up or detecting inactive peers?

Are there any limits or recommendations for the number of peers or resources in a network beyond which we might expect such issues?

What logs or metrics should we collect to help diagnose this problem more effectively?

We would appreciate any recommendations that could help us avoid or prevent this error in the future.
@nazarewk

@lolinool commented on GitHub (Jul 7, 2025): When we first encountered this error, we assumed it was related to having a large number of resources in the network (at that time, there were around 80). We cleaned up some resources, and the error disappeared. Later, the error reappeared, although we hadn’t added any new resources to the network. We once again reduced the list of resources and removed peers that had not been online for more than a month (out of 165 peers, only about 40 were actually in use). These actions resolved the problem again. To reproduce the error, we ran the following test: We started adding new resources to the network, but the error did not occur, even when we reached the same count of around 80 resources (while previously we thought this was the threshold triggering the issue). ![Image](https://github.com/user-attachments/assets/52ad61a1-f12a-4f9c-9c62-2714003e8dd8) Then we wrote a script that created an additional 100 peers in the network, and as a result, the error occurred again. ``` #!/bin/bash NB_SETUP_KEY="***********" NB_MANAGEMENT_URL="*******" IMAGE="netbirdio/netbird:0.49.0-amd64" NUM_CONTAINERS=100 for i in $(seq 1 $NUM_CONTAINERS); do docker run -d \ --name netbird_$i \ -e NB_SETUP_KEY="$NB_SETUP_KEY" \ -e NB_MANAGEMENT_URL="$NB_MANAGEMENT_URL" \ "$IMAGE" done ``` We’d also like to point out that a large number of inactive peers appear because users log in from new devices or from different IP addresses, leading to the creation of new peers that remain unused and just “hang around” in the system. We’ve attached a screenshot showing the peers created by our test script. ![Image](https://github.com/user-attachments/assets/86a77f1a-55eb-4022-93ae-87916dbf1440) At this point, we can only monitor such unused peers manually. Could you please advise: Are there any mechanisms or best practices for automatically cleaning up or detecting inactive peers? Are there any limits or recommendations for the number of peers or resources in a network beyond which we might expect such issues? What logs or metrics should we collect to help diagnose this problem more effectively? We would appreciate any recommendations that could help us avoid or prevent this error in the future. @nazarewk
Author
Owner

@Blackclaws commented on GitHub (Jul 9, 2025):

To me this looks very much like the standard settings for GRPC kicking in that restrict message size. I see two potential solutions here. One is to simply increase the size (which would just delay the problem albeit quite a long time potentially) the other is to actually make changes to what is sent or introduce a chunking functionality. I wonder what exactly is causing the blowup in size, my guess is on the network map.

I wonder if you have any network rules in place that would connect those test peers to other test peers etc. so you'd have a lot of connection infos sent at once that might clog the connection.

@Blackclaws commented on GitHub (Jul 9, 2025): To me this looks very much like the standard settings for GRPC kicking in that restrict message size. I see two potential solutions here. One is to simply increase the size (which would just delay the problem albeit quite a long time potentially) the other is to actually make changes to what is sent or introduce a chunking functionality. I wonder what exactly is causing the blowup in size, my guess is on the network map. I wonder if you have any network rules in place that would connect those test peers to other test peers etc. so you'd have a lot of connection infos sent at once that might clog the connection.
Author
Owner

@mlsmaycon commented on GitHub (Jul 9, 2025):

Hello @lolinool , we will build a binary to help us debug this issue.

@mlsmaycon commented on GitHub (Jul 9, 2025): Hello @lolinool , we will build a binary to help us debug this issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#2050