[Important notification] NetBird Cloud Service Interruption #694

Closed
opened 2025-11-20 05:16:04 -05:00 by saavagebueno · 6 comments
Owner

Originally created by @mlsmaycon on GitHub (Mar 10, 2024).

Originally assigned to: @mlsmaycon on GitHub.

Hello,

We are writing to inform you of a service interruption that occurred within our NetBird cloud platform, specifically affecting our Management service. This incident took place between 4:00 am and 5:15 am UTC, during which time the service was temporarily unavailable.

In our efforts to swiftly resolve the issue, an unintended bug was introduced, resulting in an incorrect response from our Management service. This, unfortunately, led to connected clients experiencing a pending login state.

To address this and restore your service to its full functionality, we kindly request that you restart your clients. This action should prompt an immediate reconnection. If your machines were not connected during the time of the incident, there is no action needed.

We sincerely apologize for any inconvenience this incident and the subsequent bug may have caused you. Please rest assured, we are taking comprehensive measures to not only rectify the current situation but also to implement enhanced protocols to prevent such occurrences in the future. Our team is conducting a thorough review of our systems and processes to strengthen our service reliability and your user experience. Your trust in our services is paramount, and we are committed to ensuring the highest standards of service reliability and customer satisfaction. Should you have any questions or require further assistance, please do not hesitate to contact our support team.

Thank you for your understanding and continued support.

Regards,
The NetBird Team

Originally created by @mlsmaycon on GitHub (Mar 10, 2024). Originally assigned to: @mlsmaycon on GitHub. Hello, We are writing to inform you of a service interruption that occurred within our NetBird cloud platform, specifically affecting our Management service. This incident took place between 4:00 am and 5:15 am UTC, during which time the service was temporarily unavailable. In our efforts to swiftly resolve the issue, an unintended bug was introduced, resulting in an incorrect response from our Management service. This, unfortunately, led to connected clients experiencing a pending login state. To address this and restore your service to its full functionality, we kindly request that you restart your clients. This action should prompt an immediate reconnection. If your machines were not connected during the time of the incident, there is no action needed. We sincerely apologize for any inconvenience this incident and the subsequent bug may have caused you. Please rest assured, we are taking comprehensive measures to not only rectify the current situation but also to implement enhanced protocols to prevent such occurrences in the future. Our team is conducting a thorough review of our systems and processes to strengthen our service reliability and your user experience. Your trust in our services is paramount, and we are committed to ensuring the highest standards of service reliability and customer satisfaction. Should you have any questions or require further assistance, please do not hesitate to contact our support team. Thank you for your understanding and continued support. Regards, The NetBird Team
saavagebueno added the cloudstatus labels 2025-11-20 05:16:04 -05:00
Author
Owner

@SISheogorath commented on GitHub (Mar 10, 2024):

First of all: No biggy this can happen to anyone.

I would love to read a proper post-mortem about it and what the actions out of this are. Not so much on the server, but especially on the client side. I was away from home today, and having a properly tight home network meant I had to become a bit creative to restore connectivity. But at least I could. I can imagine there are situation where it's not possible.

This kind of "I'm stuck in an unhealthy state" shouldn't appear on clients.

@SISheogorath commented on GitHub (Mar 10, 2024): First of all: No biggy this can happen to anyone. I would love to read a proper [post-mortem](https://sre.google/workbook/postmortem-culture/) about it and what the actions out of this are. Not so much on the server, but especially on the client side. I was away from home today, and having a properly tight home network meant I [had to become a bit creative](https://git.shivering-isles.com/shivering-isles/infrastructure-gitops/-/commit/c0b238b64c48ad3bbe5d5e00a8097c00de9918f0) to restore connectivity. But at least I could. I can imagine there are situation where it's not possible. This kind of "I'm stuck in an unhealthy state" shouldn't appear on clients.
Author
Owner

@tarasglek commented on GitHub (Mar 10, 2024):

I love the ambition behind netbird, but I am sad that signs of immaturity delay me considering netbird commercially.

It's concerning that downtime occurred 4am utc, but there was no communication about it until after 10am utc. Likewise I could not find the equivalent of https://status.tailscale.com/ to check to see if problem was on my end or netbird end.

@tarasglek commented on GitHub (Mar 10, 2024): I love the ambition behind netbird, but I am sad that signs of immaturity delay me considering netbird commercially. It's concerning that downtime occurred 4am utc, but there was no communication about it until after 10am utc. Likewise I could not find the equivalent of https://status.tailscale.com/ to check to see if problem was on my end or netbird end.
Author
Owner

@lalik77 commented on GitHub (Mar 12, 2024):

I have a computer running Ubuntu, which performs certain tasks and is located 100 km away from me. After an interruption that occurred within our NetBird cloud platform, my computer is still disconnected. Should I now go and start NetBird up?
2024-03-12_13-42-31

@lalik77 commented on GitHub (Mar 12, 2024): I have a computer running Ubuntu, which performs certain tasks and is located 100 km away from me. After an interruption that occurred within our NetBird cloud platform, my computer is still disconnected. Should I now go and start NetBird up? ![2024-03-12_13-42-31](https://github.com/netbirdio/netbird/assets/45859153/ab1bfd9c-6d5f-4611-80b1-af14a5984c12)
Author
Owner

@mlsmaycon commented on GitHub (Mar 12, 2024):

Hello @lalik77, With the new release, there is a chance the client will be upgraded today, but that would depend on the OS configurations if the automatic updates are enabled.

@mlsmaycon commented on GitHub (Mar 12, 2024): Hello @lalik77, With the new release, there is a chance the client will be upgraded today, but that would depend on the OS configurations if the automatic updates are enabled.
Author
Owner

@mlsmaycon commented on GitHub (Mar 12, 2024):

@tarasglek, a status page is in our plans as part of the action items to prevent this from happening and improve our communication regarding incidents.

@mlsmaycon commented on GitHub (Mar 12, 2024): @tarasglek, a status page is in our plans as part of the action items to prevent this from happening and improve our communication regarding incidents.
Author
Owner

@mlsmaycon commented on GitHub (Mar 12, 2024):

@SISheogorath, to give you more context, the client has an automatic retry that runs for 3 months in case of disconnection with management and signal services. This retry runs until there is a response from the management service triggering a needs login, which causes the management and signal connection clients to exit.

With the change in #1690, we are introducing a new layer of retry, which won't depend on the response from the management; it will run for up to 14 days in longer intervals (up to 60 minutes), and it will be similar to a ping message that in a similar case would reconnect the client. The 0.26.3 version already contains this change.

In the morning of the incident, we updated the Management service to handle better the database event that caused the issue, and introduced new tests to validate similar cases with higher load as we have in production.

@mlsmaycon commented on GitHub (Mar 12, 2024): @SISheogorath, to give you more context, the client has an automatic retry that runs for 3 months in case of disconnection with management and signal services. This retry runs until there is a response from the management service triggering a needs login, which causes the management and signal connection clients to exit. With the change in #1690, we are introducing a new layer of retry, which won't depend on the response from the management; it will run for up to 14 days in longer intervals (up to 60 minutes), and it will be similar to a ping message that in a similar case would reconnect the client. The 0.26.3 version already contains this change. In the morning of the incident, we updated the Management service to handle better the database event that caused the issue, and introduced new tests to validate similar cases with higher load as we have in production.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#694