Suspected Memory Leak in Management service #773

Closed
opened 2025-11-20 05:17:17 -05:00 by saavagebueno · 10 comments
Owner

Originally created by @TSJasonH on GitHub (Apr 4, 2024).

Describe the problem

After running for a few days (or less) on a busy host the management service will gobble up all the available RAM. A restart of the docker container will free the RAM only to start the process again.

To Reproduce

Steps to reproduce the behavior:

  1. Run a busy netbird self-hosted deployment
  2. wait
  3. see that all RAM is used up and restart the container
  4. repeat

Expected behavior
Memory is freed up as server usage diminishes.

Are you using NetBird Cloud?

self-hosted docker

NetBird version

0.26.6 & 0.26.7

Screenshots

Here's the RAM usage graph of the last 24hrs where I restarted the management service last night when I got worried that it would crash again while I was sleeping.

image

Originally created by @TSJasonH on GitHub (Apr 4, 2024). **Describe the problem** After running for a few days (or less) on a busy host the management service will gobble up all the available RAM. A restart of the docker container will free the RAM only to start the process again. **To Reproduce** Steps to reproduce the behavior: 1. Run a busy netbird self-hosted deployment 2. wait 3. see that all RAM is used up and restart the container 4. repeat **Expected behavior** Memory is freed up as server usage diminishes. **Are you using NetBird Cloud?** self-hosted docker **NetBird version** 0.26.6 & 0.26.7 **Screenshots** Here's the RAM usage graph of the last 24hrs where I restarted the management service last night when I got worried that it would crash again while I was sleeping. ![image](https://github.com/netbirdio/netbird/assets/25849774/e42917f3-d5d8-4ab4-8b66-1cc9b3c03e4c)
saavagebueno added the bugmanagement-serviceself-hosting labels 2025-11-20 05:17:17 -05:00
Author
Owner

@pappz commented on GitHub (Apr 8, 2024):

Hello!

Are you using Zitadel with Cacroachdb?

@pappz commented on GitHub (Apr 8, 2024): Hello! Are you using Zitadel with Cacroachdb?
Author
Owner

@TSJasonH commented on GitHub (Apr 8, 2024):

Negative, using Authentik.

@TSJasonH commented on GitHub (Apr 8, 2024): Negative, using Authentik.
Author
Owner

@pappz commented on GitHub (Apr 9, 2024):

Thank you!
What means "busy"? How many users/peers do you have?

@pappz commented on GitHub (Apr 9, 2024): Thank you! What means "busy"? How many users/peers do you have?
Author
Owner

@TSJasonH commented on GitHub (Apr 9, 2024):

385 peers

@TSJasonH commented on GitHub (Apr 9, 2024): 385 peers
Author
Owner

@mlsmaycon commented on GitHub (Apr 9, 2024):

@TSJasonH can you confirm which container is generating the memory consumption?

You can share the output of:

docker stats
@mlsmaycon commented on GitHub (Apr 9, 2024): @TSJasonH can you confirm which container is generating the memory consumption? You can share the output of: ```shell docker stats ```
Author
Owner

@TSJasonH commented on GitHub (Apr 9, 2024):

Sure, most memory is consumed by management and signal. The difference is that over time the management container will keep gobbling more and more, whereas signal stays fairly well constrained.

The docker stats is a little misleading right now because it's 7:15am and my cron to restart the management container ran at 2am. The heavy peer load starts around 8am. There are only about 65 peers connected at the moment.

CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O        PIDS
36a69c6aecca   artifacts-management-1   0.00%     691.4MiB / 15.61GiB   4.33%     254MB / 13.6GB   0B / 58.3GB      22
190365600caa   artifacts-dashboard-1    0.03%     35.42MiB / 15.61GiB   0.22%     8.14MB / 101MB   254MB / 64.6MB   19
1c6964d0aa55   artifacts-signal-1       17.53%    751MiB / 15.61GiB     4.70%     13.3GB / 6GB     0B / 0B          22
9b323ba0d1ae   artifacts-coturn-1       0.51%     193.4MiB / 15.61GiB   1.21%     0B / 0B          0B / 0B          99

It's probably more evident from the RAM graph that shows the automatic 2am restarts of just the management container.

image

@TSJasonH commented on GitHub (Apr 9, 2024): Sure, most memory is consumed by management and signal. The difference is that over time the management container will keep gobbling more and more, whereas signal stays fairly well constrained. The docker stats is a little misleading right now because it's 7:15am and my cron to restart the management container ran at 2am. The heavy peer load starts around 8am. There are only about 65 peers connected at the moment. ``` CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 36a69c6aecca artifacts-management-1 0.00% 691.4MiB / 15.61GiB 4.33% 254MB / 13.6GB 0B / 58.3GB 22 190365600caa artifacts-dashboard-1 0.03% 35.42MiB / 15.61GiB 0.22% 8.14MB / 101MB 254MB / 64.6MB 19 1c6964d0aa55 artifacts-signal-1 17.53% 751MiB / 15.61GiB 4.70% 13.3GB / 6GB 0B / 0B 22 9b323ba0d1ae artifacts-coturn-1 0.51% 193.4MiB / 15.61GiB 1.21% 0B / 0B 0B / 0B 99 ``` It's probably more evident from the RAM graph that shows the automatic 2am restarts of just the management container. ![image](https://github.com/netbirdio/netbird/assets/25849774/a1691db3-747b-47f5-a284-58a69d4d9ec8)
Author
Owner

@mlsmaycon commented on GitHub (Apr 9, 2024):

Thanks for sharing the stats and the graphs again.

Can you run it again around 12 PM or 4 PM? We should see a better number there.

@mlsmaycon commented on GitHub (Apr 9, 2024): Thanks for sharing the stats and the graphs again. Can you run it again around 12 PM or 4 PM? We should see a better number there.
Author
Owner

@TSJasonH commented on GitHub (Apr 9, 2024):

I ended up having to restart the management container already today, so I grabbed a docker stats before doing so.

CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
36a69c6aecca   artifacts-management-1   189.88%   1.262GiB / 15.61GiB   8.08%     792MB / 41.9GB    0B / 110GB       24
190365600caa   artifacts-dashboard-1    0.04%     35.44MiB / 15.61GiB   0.22%     8.67MB / 111MB    254MB / 64.6MB   19
1c6964d0aa55   artifacts-signal-1       19.21%    751.1MiB / 15.61GiB   4.70%     14.3GB / 6.53GB   0B / 0B          22
9b323ba0d1ae   artifacts-coturn-1       0.30%     207.1MiB / 15.61GiB   1.30%     0B / 0B           0B / 0B          99

after the restart:

CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
36a69c6aecca   artifacts-management-1   0.08%     357.8MiB / 15.61GiB   2.24%     2.05MB / 19.8MB   0B / 71.3MB      34
190365600caa   artifacts-dashboard-1    0.03%     35.43MiB / 15.61GiB   0.22%     8.68MB / 111MB    254MB / 64.6MB   19
1c6964d0aa55   artifacts-signal-1       25.11%    751.1MiB / 15.61GiB   4.70%     14.3GB / 6.54GB   0B / 0B          22
9b323ba0d1ae   artifacts-coturn-1       0.74%     216.4MiB / 15.61GiB   1.35%     0B / 0B           0B / 0B          99

Users were having trouble connecting, the dashboard wasn't finishing a load before trying to auto-refresh and the logs were filling with these types of messages:

"log":"2024-04-09T13:19:04Z WARN management/server/grpcserver.go:376: failed logging in peer Ej8E5w2a2/LkDrz8VK6doA04AuvIsQzAZbP6v4O05Go=\n","stream":"stderr","time":"2024-04-09T13:19:04.035874189Z"}
{"log":"2024-04-09T13:19:08Z WARN management/server/grpcserver.go:376: failed logging in peer 9mDnkjOBvl4LDogzZCg4El5b8divXv7F88/9q131FjY=\n","stream":"stderr","time":"2024-04-09T13:19:08.932876892Z"}
{"log":"2024-04-09T13:19:52Z WARN management/server/grpcserver.go:376: failed logging in peer Z3K6D5JHntkEsb27SZvJZghyK2eQQnptfrPL0FJRTwM=\n","stream":"stderr","time":"2024-04-09T13:19:52.301454042Z"}
{"log":"2024-04-09T13:19:59Z WARN management/server/grpcserver.go:376: failed logging in peer xuKFibw2/OSVhoufYdpA1kqHmWnG4PfmGRZXIejsOxc=\n","stream":"stderr","time":"2024-04-09T13:19:59.699700349Z"}

@TSJasonH commented on GitHub (Apr 9, 2024): I ended up having to restart the management container already today, so I grabbed a docker stats before doing so. ``` CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 36a69c6aecca artifacts-management-1 189.88% 1.262GiB / 15.61GiB 8.08% 792MB / 41.9GB 0B / 110GB 24 190365600caa artifacts-dashboard-1 0.04% 35.44MiB / 15.61GiB 0.22% 8.67MB / 111MB 254MB / 64.6MB 19 1c6964d0aa55 artifacts-signal-1 19.21% 751.1MiB / 15.61GiB 4.70% 14.3GB / 6.53GB 0B / 0B 22 9b323ba0d1ae artifacts-coturn-1 0.30% 207.1MiB / 15.61GiB 1.30% 0B / 0B 0B / 0B 99 ``` after the restart: ``` CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 36a69c6aecca artifacts-management-1 0.08% 357.8MiB / 15.61GiB 2.24% 2.05MB / 19.8MB 0B / 71.3MB 34 190365600caa artifacts-dashboard-1 0.03% 35.43MiB / 15.61GiB 0.22% 8.68MB / 111MB 254MB / 64.6MB 19 1c6964d0aa55 artifacts-signal-1 25.11% 751.1MiB / 15.61GiB 4.70% 14.3GB / 6.54GB 0B / 0B 22 9b323ba0d1ae artifacts-coturn-1 0.74% 216.4MiB / 15.61GiB 1.35% 0B / 0B 0B / 0B 99 ``` Users were having trouble connecting, the dashboard wasn't finishing a load before trying to auto-refresh and the logs were filling with these types of messages: ``` "log":"2024-04-09T13:19:04Z WARN management/server/grpcserver.go:376: failed logging in peer Ej8E5w2a2/LkDrz8VK6doA04AuvIsQzAZbP6v4O05Go=\n","stream":"stderr","time":"2024-04-09T13:19:04.035874189Z"} {"log":"2024-04-09T13:19:08Z WARN management/server/grpcserver.go:376: failed logging in peer 9mDnkjOBvl4LDogzZCg4El5b8divXv7F88/9q131FjY=\n","stream":"stderr","time":"2024-04-09T13:19:08.932876892Z"} {"log":"2024-04-09T13:19:52Z WARN management/server/grpcserver.go:376: failed logging in peer Z3K6D5JHntkEsb27SZvJZghyK2eQQnptfrPL0FJRTwM=\n","stream":"stderr","time":"2024-04-09T13:19:52.301454042Z"} {"log":"2024-04-09T13:19:59Z WARN management/server/grpcserver.go:376: failed logging in peer xuKFibw2/OSVhoufYdpA1kqHmWnG4PfmGRZXIejsOxc=\n","stream":"stderr","time":"2024-04-09T13:19:59.699700349Z"} ```
Author
Owner

@TSJasonH commented on GitHub (Apr 10, 2024):

@mlsmaycon thanks for the troubleshooting chat in slack.

After your suggestion to revert from sqlite back to the json store, things have been working smoothly.
I disabled the nightly mgmt container restart to see what happened overnight, and indeed the memory usage has been holding very steady.

Latest docker stats:

CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O     PIDS
5ba38625f571   artifacts-management-1   122.27%   859MiB / 15.61GiB     5.37%     677MB / 35.6GB    0B / 19.2MB   23
29ec4549d6de   artifacts-signal-1       31.19%    196.3MiB / 15.61GiB   1.23%     4.78GB / 2.28GB   0B / 0B       21
a1bdd6223fa2   artifacts-coturn-1       2.66%     205MiB / 15.61GiB     1.28%     0B / 0B           0B / 0B       99
3c3b5bce7164   artifacts-dashboard-1    0.04%     34.49MiB / 15.61GiB   0.22%     3MB / 38MB        0B / 64.6MB   19

Memory Graph:
image

@TSJasonH commented on GitHub (Apr 10, 2024): @mlsmaycon thanks for the troubleshooting chat in slack. After your suggestion to revert from sqlite back to the json store, things have been working smoothly. I disabled the nightly mgmt container restart to see what happened overnight, and indeed the memory usage has been holding very steady. Latest docker stats: ``` CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 5ba38625f571 artifacts-management-1 122.27% 859MiB / 15.61GiB 5.37% 677MB / 35.6GB 0B / 19.2MB 23 29ec4549d6de artifacts-signal-1 31.19% 196.3MiB / 15.61GiB 1.23% 4.78GB / 2.28GB 0B / 0B 21 a1bdd6223fa2 artifacts-coturn-1 2.66% 205MiB / 15.61GiB 1.28% 0B / 0B 0B / 0B 99 3c3b5bce7164 artifacts-dashboard-1 0.04% 34.49MiB / 15.61GiB 0.22% 3MB / 38MB 0B / 64.6MB 19 ``` Memory Graph: ![image](https://github.com/netbirdio/netbird/assets/25849774/5d37221c-ec61-47b9-89cc-8d8b49591279)
Author
Owner

@TSJasonH commented on GitHub (Jul 12, 2024):

This was resolved with the changes in 0.28.x

@TSJasonH commented on GitHub (Jul 12, 2024): This was resolved with the changes in 0.28.x
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#773