[Post-mortem] NetBird Cloud Service Interruption 20/Jan/2025 #1573

Open
opened 2025-11-20 05:33:05 -05:00 by saavagebueno · 0 comments
Owner

Originally created by @mlsmaycon on GitHub (Jan 26, 2025).

Hello,

On January 20th, 2025, our cloud service suffered an issue that temporarily made the management service unavailable. We will discuss the problem and the action the NetBird team took.

On January 20th, at 18:10 UTC, a new version of the management service was deployed into our production servers; this deployment had various changes that aimed to speed up the peer's API operations. Soon after the rollout was completed, we noticed that the public API became unresponsive. Upon initial investigation, our engineers noticed a lack of new logs for the monitoring service. That indicated a potential lock in either the service or the database layers.

Our immediate action was to roll out the previous version, which initially brought back the service and allowed us to authenticate new peers and check log flows again, and all monitoring alerts normalized. After that, the backend team focussed on investigating the issue with the new version, but after a few minutes, the service experienced another outage with the previous version. This time, the alerts took around 30 minutes to trigger again.

Since we were running a week-old version this time, we looked at the database layer again and discovered that one of the hosts was experiencing a performance issue caused by memory pressure and swap usage.

When continuing the analysis, we discovered that the main issue was a query used to get the account information. This was due to the deploy process, which, when rolling out the new version and reverting to the previous one, reconnected all previously connected peers and called a new sync instance of the network map, retrieving a copy of a customer account for each peer requesting the map.

To mitigate the issue, our team resized the database VMs, restoring stability to the management service around 19:00 UTC.

In the following days, the team applied a few changes to reduce the number of calls during deployments. Next, we will conduct more changes to eliminate the need to get the whole account when calculating the network map. We will also migrate the database system to a managed service, giving us more performance and visibility and reducing ownership costs.

During the service interruption, our customers reported an inability to add new machines or authenticate expired sessions, but existing connections and sessions weren't disrupted, and the peer connections stayed up during this time.

Thank you for your understanding and continued support.
Regards,
The NetBird Team

Originally created by @mlsmaycon on GitHub (Jan 26, 2025). Hello, On January 20th, 2025, our cloud service suffered an issue that temporarily made the management service unavailable. We will discuss the problem and the action the NetBird team took. On January 20th, at 18:10 UTC, a new version of the management service was deployed into our production servers; this deployment had various changes that aimed to speed up the peer's API operations. Soon after the rollout was completed, we noticed that the public API became unresponsive. Upon initial investigation, our engineers noticed a lack of new logs for the monitoring service. That indicated a potential lock in either the service or the database layers. Our immediate action was to roll out the previous version, which initially brought back the service and allowed us to authenticate new peers and check log flows again, and all monitoring alerts normalized. After that, the backend team focussed on investigating the issue with the new version, but after a few minutes, the service experienced another outage with the previous version. This time, the alerts took around 30 minutes to trigger again. Since we were running a week-old version this time, we looked at the database layer again and discovered that one of the hosts was experiencing a performance issue caused by memory pressure and swap usage. When continuing the analysis, we discovered that the main issue was a query used to get the account information. This was due to the deploy process, which, when rolling out the new version and reverting to the previous one, reconnected all previously connected peers and called a new sync instance of the network map, retrieving a copy of a customer account for each peer requesting the map. To mitigate the issue, our team resized the database VMs, restoring stability to the management service around 19:00 UTC. In the following days, the team applied a few changes to reduce the number of calls during deployments. Next, we will conduct more changes to eliminate the need to get the whole account when calculating the network map. We will also migrate the database system to a managed service, giving us more performance and visibility and reducing ownership costs. During the service interruption, our customers reported an inability to add new machines or authenticate expired sessions, but existing connections and sessions weren't disrupted, and the peer connections stayed up during this time. Thank you for your understanding and continued support. Regards, The NetBird Team
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#1573