Netbird Management HA #636

Open
opened 2025-11-20 05:15:00 -05:00 by saavagebueno · 22 comments
Owner

Originally created by @JaSei on GitHub (Feb 15, 2024).

Is your feature request related to a problem? Please describe.
The problem is that the management component lacks high availability (HA) support. Currently, the management component is central to other components like signal and dashboard, and stores data in a JSON file or, experimentally, in SQLite. However, these storage options cannot be shared across multiple instances. This limitation inhibits the ability to achieve high availability for the whole Netbird.

Describe the solution you'd like
Therefore, I propose introducing a new database connector for PostgreSQL as an alternative to SQLite. By adding support for PostgreSQL, the service can become stateless and run seamlessly on container orchestration platforms like Kubernetes or Docker Swarm across multiple instances. This change would enable high availability (HA) by allowing the management component to distribute its load and ensure resilience through redundancy.

Describe alternatives you've considered

  • I tried storing the JSON file and SQLite database on AWS EFS (mounted as NFS) to share storage across multiple instances. However, this approach was unsuccessful as it did not support concurrent access effectively, leading to operational failures in multi-instance setups (only one instance was able to handle requests successfully).
  • Another option is to explore other distributed database solutions or storage mechanisms that support concurrent access and are compatible with the existing architecture. PostgreSQL is the preferred option due to its robustness, scalability, and wide adoption in the industry.

Additional context
Achieving high availability for the management component is crucial for ensuring the reliability and scalability of services that depend on NetBirdIO. By enabling the management component to operate across multiple instances without storage bottlenecks, users can leverage container orchestration platforms to achieve better resilience and load distribution. This enhancement would greatly improve NetBirdIO's operational capabilities, particularly in production environments where uptime and scalability are crucial.

Originally created by @JaSei on GitHub (Feb 15, 2024). **Is your feature request related to a problem? Please describe.** The problem is that the management component lacks high availability (HA) support. Currently, the management component is central to other components like signal and dashboard, and stores data in a JSON file or, experimentally, in SQLite. However, these storage options cannot be shared across multiple instances. This limitation inhibits the ability to achieve high availability for the whole Netbird. **Describe the solution you'd like** Therefore, I propose introducing a new database connector for PostgreSQL as an alternative to SQLite. By adding support for PostgreSQL, the service can become stateless and run seamlessly on container orchestration platforms like Kubernetes or Docker Swarm across multiple instances. This change would enable high availability (HA) by allowing the management component to distribute its load and ensure resilience through redundancy. **Describe alternatives you've considered** * I tried storing the JSON file and SQLite database on AWS EFS (mounted as NFS) to share storage across multiple instances. However, this approach was unsuccessful as it did not support concurrent access effectively, leading to operational failures in multi-instance setups (only one instance was able to handle requests successfully). * Another option is to explore other distributed database solutions or storage mechanisms that support concurrent access and are compatible with the existing architecture. PostgreSQL is the preferred option due to its robustness, scalability, and wide adoption in the industry. **Additional context** Achieving high availability for the management component is crucial for ensuring the reliability and scalability of services that depend on NetBirdIO. By enabling the management component to operate across multiple instances without storage bottlenecks, users can leverage container orchestration platforms to achieve better resilience and load distribution. This enhancement would greatly improve NetBirdIO's operational capabilities, particularly in production environments where uptime and scalability are crucial.
saavagebueno added the feature-requestmanagement-service labels 2025-11-20 05:15:00 -05:00
Author
Owner

@surik commented on GitHub (Feb 15, 2024):

Hi @JaSei, thank you for the request and details explanation. I want to let you know that we have plans for PostgreSQL support, see our public roadmap: https://github.com/netbirdio/netbird/projects/2

@surik commented on GitHub (Feb 15, 2024): Hi @JaSei, thank you for the request and details explanation. I want to let you know that we have plans for PostgreSQL support, see our public roadmap: https://github.com/netbirdio/netbird/projects/2
Author
Owner

@JaSei commented on GitHub (Feb 15, 2024):

That's amazing. Thanks for the roadmap. In this case, this ticket is probably useless.

@JaSei commented on GitHub (Feb 15, 2024): That's amazing. Thanks for the roadmap. In this case, this ticket is probably useless.
Author
Owner

@ez1976 commented on GitHub (Apr 9, 2024):

Hi.
until this is implemented, any other way except replicating the instance with its sql DB.
creating another copy with a dedicated relay servers and any change made to one instance be changed also on the replicated one.
that way we can use a load balancer and any peer can connect to any instance and have the same routes and policies (of course each instance will use its own relay).
i am afraid of relying only on one instance for the whole solution if we plan to replace traditional VPN with NetBird.

@ez1976 commented on GitHub (Apr 9, 2024): Hi. until this is implemented, any other way except replicating the instance with its sql DB. creating another copy with a dedicated relay servers and any change made to one instance be changed also on the replicated one. that way we can use a load balancer and any peer can connect to any instance and have the same routes and policies (of course each instance will use its own relay). i am afraid of relying only on one instance for the whole solution if we plan to replace traditional VPN with NetBird.
Author
Owner

@awapf commented on GitHub (May 3, 2024):

What happens if the management service goes down for 1min, 1h, 24h?
How long do existing connections work? On my test installation established connections are still available. I used the quickstart with keycloak. Besides the backup section, is there any description how to add multiple signal and coturn servers on different hosts? Do additional signal and coturn servers improve availability?

@awapf commented on GitHub (May 3, 2024): What happens if the management service goes down for 1min, 1h, 24h? How long do existing connections work? On my test installation established connections are still available. I used the quickstart with keycloak. Besides the backup section, is there any description how to add multiple signal and coturn servers on different hosts? Do additional signal and coturn servers improve availability?
Author
Owner

@Tivin-i commented on GitHub (May 3, 2024):

I am wondering, if by default netbird is recommending Zitadel and that spins up a cockroachDB instance, wouldn't it make more sense to leverage cockroachDB rather than Postgres?

@Tivin-i commented on GitHub (May 3, 2024): I am wondering, if by default netbird is recommending Zitadel and that spins up a cockroachDB instance, wouldn't it make more sense to leverage cockroachDB rather than Postgres?
Author
Owner

@ykorzikowski commented on GitHub (Jun 19, 2024):

I noticed that a netbird client configured to route a local network (e.g. 10.0.0.0/24) will lose its configuration if the management API is down after a time. Not on short outages, but after a longer period of time.

@ykorzikowski commented on GitHub (Jun 19, 2024): I noticed that a netbird client configured to route a local network (e.g. 10.0.0.0/24) will lose its configuration if the management API is down after a time. Not on short outages, but after a longer period of time.
Author
Owner

@ghost commented on GitHub (Jul 1, 2024):

Has anyone made a distributed sqlite setup? With something like libsql/Turso, rqlite, LiteFS, Cloudflare D1?

@ghost commented on GitHub (Jul 1, 2024): Has anyone made a distributed sqlite setup? With something like libsql/Turso, rqlite, LiteFS, Cloudflare D1?
Author
Owner

@dkrhodes commented on GitHub (Jul 28, 2024):

I succeed to deploy the dashboard/management/signal service in k8s cluster and use keycloak as the IDP. And keycloak and management use the same postgresql instance.

Everything works perfect.

Here is my question:

Can I scale the management or signal service replicas from 1 to 2 or more to help with HA ?

@dkrhodes commented on GitHub (Jul 28, 2024): I succeed to deploy the dashboard/management/signal service in k8s cluster and use keycloak as the IDP. And keycloak and management use the same postgresql instance. Everything works perfect. Here is my question: Can I scale the management or signal service replicas from 1 to 2 or more to help with HA ?
Author
Owner

@ednxzu commented on GitHub (Aug 20, 2024):

There is this piece of doc now about postgres datastore.

I guess this means (even if still in beta/early access) that HA setups are a thing now ?

@ednxzu commented on GitHub (Aug 20, 2024): There is [this piece of doc now](https://docs.netbird.io/selfhosted/postgres-store) about postgres datastore. I guess this means (even if still in beta/early access) that HA setups are a thing now ?
Author
Owner

@klinux commented on GitHub (Aug 21, 2024):

I had configured Netbird with postgresql on k8s, but if I scale management or signal from 1 to 2, the client can't connect or get correct domain routing. There some configurations that I need to do, or it's not possible in open source version?

Thank you

@klinux commented on GitHub (Aug 21, 2024): I had configured Netbird with postgresql on k8s, but if I scale management or signal from 1 to 2, the client can't connect or get correct domain routing. There some configurations that I need to do, or it's not possible in open source version? Thank you
Author
Owner

@Oriann commented on GitHub (Sep 17, 2024):

There is this piece of doc now about postgres datastore.

I guess this means (even if still in beta/early access) that HA setups are a thing now ?

Same question...is Postgres still experimental ? Is currently HA management being worked on? It would provide very useful to have management cluster. I know I can do this with VM HA and replication but its not resource effective and maybe it wont work as expected.

@Oriann commented on GitHub (Sep 17, 2024): > There is [this piece of doc now](https://docs.netbird.io/selfhosted/postgres-store) about postgres datastore. > > I guess this means (even if still in beta/early access) that HA setups are a thing now ? Same question...is Postgres still experimental ? Is currently HA management being worked on? It would provide very useful to have management cluster. I know I can do this with VM HA and replication but its not resource effective and maybe it wont work as expected.
Author
Owner

@netandreus commented on GitHub (Oct 23, 2024):

Any news, mates? This is needed.

@netandreus commented on GitHub (Oct 23, 2024): Any news, mates? This is needed.
Author
Owner

@creeram commented on GitHub (Dec 27, 2024):

I had configured Netbird with postgresql on k8s, but if I scale management or signal from 1 to 2, the client can't connect or get correct domain routing. There some configurations that I need to do, or it's not possible in open source version?

Thank you

@klinux Do you have any workaround for this? did you manage to make it work by scaling it to multiple replicas?

@creeram commented on GitHub (Dec 27, 2024): > I had configured Netbird with postgresql on k8s, but if I scale management or signal from 1 to 2, the client can't connect or get correct domain routing. There some configurations that I need to do, or it's not possible in open source version? > > Thank you @klinux Do you have any workaround for this? did you manage to make it work by scaling it to multiple replicas?
Author
Owner

@ayprof commented on GitHub (Feb 7, 2025):

Any updates?

@ayprof commented on GitHub (Feb 7, 2025): Any updates?
Author
Owner

@adam-skalicky commented on GitHub (Feb 21, 2025):

I am also very interested in this. Adding node and/or geo redundancy to management would be a huge win!

@adam-skalicky commented on GitHub (Feb 21, 2025): I am also very interested in this. Adding node and/or geo redundancy to management would be a huge win!
Author
Owner

@pwd-rh commented on GitHub (Mar 21, 2025):

I would also like to see geo-redundancy. Nebula has this now but does not have a management console.

@pwd-rh commented on GitHub (Mar 21, 2025): I would also like to see geo-redundancy. Nebula has this now but does not have a management console.
Author
Owner

@nazarewk commented on GitHub (Mar 21, 2025):

To give you some form of an update, I have noticed the question popped up in the last Kubernetes Operator webinar and was answered.

It is currently possible to use an external database (postgres) managed independently of the management components. It should in turn make it possible to spin up a second (failover) set of components and configure the reverse proxies to divert traffic to it while the primary instance is unavailable.

@nazarewk commented on GitHub (Mar 21, 2025): To give you some form of an update, I have noticed the question popped up in the last [Kubernetes Operator webinar](https://youtu.be/jil3-DtWgjY) and was answered. It is currently possible to use an external database (postgres) managed independently of the management components. It should in turn make it possible to spin up a second (failover) set of components and configure the reverse proxies to divert traffic to it while the primary instance is unavailable.
Author
Owner

@adam-skalicky commented on GitHub (Mar 26, 2025):

Hi @nazarewk, that is a great step in the right direction but sounds like active/passive instead of a true active/active multi node environment. Do you know if this is something that is roadmapped?

@adam-skalicky commented on GitHub (Mar 26, 2025): Hi @nazarewk, that is a great step in the right direction but sounds like active/passive instead of a true active/active multi node environment. Do you know if this is something that is roadmapped?
Author
Owner

@nazarewk commented on GitHub (Mar 26, 2025):

I have an official update:

There is no plan to support any additional form of HA for unlicensed self-hosting.
On the other hand, a licensed self-hosting offering is geared towards large enterprises and is expected to eventually implement all of the cloud features, including HA.


You can get in touch through our enterprise contact form for more information.

@nazarewk commented on GitHub (Mar 26, 2025): I have an official update: There is no plan to support any additional form of HA for unlicensed self-hosting. On the other hand, a licensed self-hosting offering is geared towards large enterprises and is expected to eventually implement all of the cloud features, including HA. --- You can get in touch through [our enterprise contact form](https://www.netbird.io/demo?form=enterprise) for more information.
Author
Owner

@RHDHV-simon-sutcliffe commented on GitHub (May 6, 2025):

I have an official update:

There is no plan to support any additional form of HA for unlicensed self-hosting. On the other hand, a licensed self-hosting offering is geared towards large enterprises and is expected to eventually implement all of the cloud features, including HA.

Is there any information timeline on this suggested licenses self-hosted version?

@RHDHV-simon-sutcliffe commented on GitHub (May 6, 2025): > I have an official update: > > There is no plan to support any additional form of HA for unlicensed self-hosting. On the other hand, a licensed self-hosting offering is geared towards large enterprises and is expected to eventually implement all of the cloud features, including HA. Is there any information timeline on this suggested licenses self-hosted version?
Author
Owner

@nazarewk commented on GitHub (May 8, 2025):

Is there any information timeline on this suggested licenses self-hosted version?

Please get in touch through our enterprise contact form for more information.

@nazarewk commented on GitHub (May 8, 2025): > Is there any information timeline on this suggested licenses self-hosted version? Please get in touch through [our enterprise contact form](https://www.netbird.io/demo?form=enterprise) for more information.
Author
Owner

@awapf commented on GitHub (May 28, 2025):

I would also like to see geo-redundancy. Nebula has this now but does not have a management console.

That is exactly why we build meshadmin https://gitlab.com/meshadmin/meshadmin . Still, I like netbird.

@awapf commented on GitHub (May 28, 2025): > I would also like to see geo-redundancy. Nebula has this now but does not have a management console. That is exactly why we build meshadmin https://gitlab.com/meshadmin/meshadmin . Still, I like netbird.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#636