Management service won't start after upgrade to v0.54.1+ due to code to disable foreign key checks #2216

Open
opened 2025-11-20 07:06:00 -05:00 by saavagebueno · 8 comments
Owner

Originally created by @ressys1978 on GitHub (Aug 21, 2025).

Describe the problem

Management service won't start due to managemnt/server/migration.go trying to create duplicate index.

I'm using AWS Aurora MySQL (MySQL 8.0 compatible) as the storage engine.

I upgraded to v0.54.0, which went fine.

I then upgraded to 0.54.1, and the Management service, and it errors out with:

2025-08-21T15:08:08Z FATL management/internals/server/boot.go:62: failed to create store: Error 1227 (42000): Access denied; you need (at least one of) the SUPER or SYSTEM_VARIABLES_ADMIN privilege(s) for this operation

Then it goes into a restart loop.

To Reproduce

Steps to reproduce the behavior:

  1. Upgrade to v0.54.1 or higher using AWS Aurora MySQL (MySQL 8.0 compatible) as the storage engine

Expected behavior

I'd expect the management service to run, and not error out and enter a restart loop.

Are you using NetBird Cloud?

Self-hosted

NetBird version

v0.54.1 and can re-create right up to v 0.55.1

Is any other VPN software installed?

No

Debug output

From the management service logs:

2025-08-21T15:08:08Z FATL management/internals/server/boot.go:62: failed to create store: Error 1227 (42000): Access denied; you need (at least one of) the SUPER or SYSTEM_VARIABLES_ADMIN privilege(s) for this operation

Screenshots

N/A

Additional context

Seems the changes in https://github.com/netbirdio/netbird/pull/4308 have caused this, specifically the lines 82-84 in management/server/store/sql_store.go

If I custom compile the management service and comment out lines 82 to 84, I can get the management service to run successfully.

With AWS Aurora MySQL (MySQL 8.0 compatible), it's not possible to grant SUPER or SYSTEM_VARIABLES_ADMIN, so it's not possible to "SET GLOBAL FOREIGN_KEY_CHECKS = 0".

It is possible for the session to run "SET FOREIGN_KEY_CHECKS = 0;", which is safer as disabling foreign key checks permanently is bad.

Here's why you don't want to disable foreign key checks permanently:

  • Orphaned rows (child rows with no parent)
  • Inconsistent data
  • Hard-to-debug application bugs
  • Corrupted relationships

But, reviewing the code I don't see anything that would need those foreign key checks to be disabled in the first place.

I would recommend removing the code that disables foreign key checks completely as eventually having them disabled will cause issues with the database.

Have you tried these troubleshooting steps?

  • Reviewed client troubleshooting (if applicable)
  • Checked for newer NetBird versions
  • Searched for similar issues on GitHub (including closed ones)
  • Restarted the NetBird client
  • Disabled other VPN software
  • Checked firewall settings
Originally created by @ressys1978 on GitHub (Aug 21, 2025). **Describe the problem** Management service won't start due to managemnt/server/migration.go trying to create duplicate index. I'm using AWS Aurora MySQL (MySQL 8.0 compatible) as the storage engine. I upgraded to v0.54.0, which went fine. I then upgraded to 0.54.1, and the Management service, and it errors out with: 2025-08-21T15:08:08Z FATL management/internals/server/boot.go:62: failed to create store: Error 1227 (42000): Access denied; you need (at least one of) the SUPER or SYSTEM_VARIABLES_ADMIN privilege(s) for this operation Then it goes into a restart loop. **To Reproduce** Steps to reproduce the behavior: 1. Upgrade to v0.54.1 or higher using AWS Aurora MySQL (MySQL 8.0 compatible) as the storage engine **Expected behavior** I'd expect the management service to run, and not error out and enter a restart loop. **Are you using NetBird Cloud?** Self-hosted **NetBird version** v0.54.1 and can re-create right up to v 0.55.1 **Is any other VPN software installed?** No **Debug output** From the management service logs: 2025-08-21T15:08:08Z FATL management/internals/server/boot.go:62: failed to create store: Error 1227 (42000): Access denied; you need (at least one of) the SUPER or SYSTEM_VARIABLES_ADMIN privilege(s) for this operation **Screenshots** N/A **Additional context** Seems the changes in https://github.com/netbirdio/netbird/pull/4308 have caused this, specifically the lines 82-84 in management/server/store/sql_store.go If I custom compile the management service and comment out lines 82 to 84, I can get the management service to run successfully. With AWS Aurora MySQL (MySQL 8.0 compatible), it's not possible to grant SUPER or SYSTEM_VARIABLES_ADMIN, so it's not possible to "SET GLOBAL FOREIGN_KEY_CHECKS = 0". It is possible for the session to run "SET FOREIGN_KEY_CHECKS = 0;", which is safer as disabling foreign key checks permanently is bad. Here's why you don't want to disable foreign key checks permanently: - Orphaned rows (child rows with no parent) - Inconsistent data - Hard-to-debug application bugs - Corrupted relationships But, reviewing the code I don't see anything that would need those foreign key checks to be disabled in the first place. I would recommend removing the code that disables foreign key checks completely as eventually having them disabled will cause issues with the database. **Have you tried these troubleshooting steps?** - [X] Reviewed [client troubleshooting](https://docs.netbird.io/how-to/troubleshooting-client) (if applicable) - [X] Checked for newer NetBird versions - [X] Searched for similar issues on GitHub (including closed ones) - [X] Restarted the NetBird client - [X] Disabled other VPN software - [X] Checked firewall settings
saavagebueno added the triage-needed label 2025-11-20 07:06:00 -05:00
Author
Owner

@Westie commented on GitHub (Sep 13, 2025):

I agree, this behaviour is bad, and should not under any circumstances be considered acceptable (outside of running migrations)

@Westie commented on GitHub (Sep 13, 2025): I agree, this behaviour is bad, and should not under any circumstances be considered acceptable (outside of running migrations)
Author
Owner

@Westie commented on GitHub (Sep 13, 2025):

@pascal-fischer can you give us some insight as to why there was a need to set a global variable in this manner?

@Westie commented on GitHub (Sep 13, 2025): @pascal-fischer can you give us some insight as to why there was a need to set a global variable in this manner?
Author
Owner

@Westie commented on GitHub (Sep 27, 2025):

FYI @ressys1978 because of the fact that this requires superadmin access to a database to run with MySQL, I e-mailed security@netbird.io - unfortunately they didn't get back to me.

It's obviously clear that netbird do not care about security.

@Westie commented on GitHub (Sep 27, 2025): FYI @ressys1978 because of the fact that this requires superadmin access to a database to run with MySQL, I e-mailed security@netbird.io - unfortunately they didn't get back to me. It's obviously clear that netbird do not care about security.
Author
Owner

@Westie commented on GitHub (Oct 4, 2025):

@pascal-fischer @crn4 any updates to this?

@Westie commented on GitHub (Oct 4, 2025): @pascal-fischer @crn4 any updates to this?
Author
Owner

@pascal-fischer commented on GitHub (Oct 6, 2025):

Sorry, I missed this one.

The issue was on how MySQL does locking on parent rows when accessing a child with foreign keys configured. While Postgres requires a shared lock on the parent row, MySQL requires an exclusive lock. For Postgres this is working fine, MySQL is running the risk of a deadlock. By disabling the FOREIGN_KEY_CHECKS we could circumvent this for the time being. As we do the cleanup of child rows manually in the code and do not depend on cascading deletes there should be no risk of orphaned rows.

I agree that this was not the ideal solution but it seemed to be an acceptable one without changing how we accessed the data as this is a much bigger refactor.

I will think if there is another way for us to avoid the deadlocks without disabling the foreign key checks.

@pascal-fischer commented on GitHub (Oct 6, 2025): Sorry, I missed this one. The issue was on how MySQL does locking on parent rows when accessing a child with foreign keys configured. While Postgres requires a shared lock on the parent row, MySQL requires an exclusive lock. For Postgres this is working fine, MySQL is running the risk of a deadlock. By disabling the `FOREIGN_KEY_CHECKS` we could circumvent this for the time being. As we do the cleanup of child rows manually in the code and do not depend on cascading deletes there should be no risk of orphaned rows. I agree that this was not the ideal solution but it seemed to be an acceptable one without changing how we accessed the data as this is a much bigger refactor. I will think if there is another way for us to avoid the deadlocks without disabling the foreign key checks.
Author
Owner

@Westie commented on GitHub (Oct 6, 2025):

@pascal-fischer thank you for the explanation, that does make sense. Refactoring DB queries are never easy, especially as to get the behaviour you want, you need to modify MySQL queries in a very specific way.

Was there any particular reason as to why disabling foreign keys across the entire server and/or cluster was warranted?

The main reason why I was quite alarmed (and upset) by the use of GLOBAL was because superuser/root privileges were needed in order to set up and use netbird - I do not feel comfortable with something as sensitive as Netbird being ran with that level of permissions

If you were to use:

if err := db.Exec("SET FOREIGN_KEY_CHECKS = 0").Error; err != nil {

...instead, where foreign key checks are disabled for the current session, would the potential for deadlocks still remain?

@Westie commented on GitHub (Oct 6, 2025): @pascal-fischer thank you for the explanation, that does make sense. Refactoring DB queries are never easy, especially as to get the behaviour you want, you need to modify MySQL queries in a very specific way. Was there any particular reason as to why disabling foreign keys across the entire server and/or cluster was warranted? The main reason why I was quite alarmed (and upset) by the use of `GLOBAL` was because superuser/root privileges were needed in order to set up and use netbird - I do not feel comfortable with something as sensitive as Netbird being ran with that level of permissions If you were to use: ``` if err := db.Exec("SET FOREIGN_KEY_CHECKS = 0").Error; err != nil { ``` ...instead, where foreign key checks are disabled for the current session, would the potential for deadlocks still remain?
Author
Owner

@ressys1978 commented on GitHub (Oct 7, 2025):

Sorry, I missed this one.

The issue was on how MySQL does locking on parent rows when accessing a child with foreign keys configured. While Postgres requires a shared lock on the parent row, MySQL requires an exclusive lock. For Postgres this is working fine, MySQL is running the risk of a deadlock. By disabling the FOREIGN_KEY_CHECKS we could circumvent this for the time being. As we do the cleanup of child rows manually in the code and do not depend on cascading deletes there should be no risk of orphaned rows.

I agree that this was not the ideal solution but it seemed to be an acceptable one without changing how we accessed the data as this is a much bigger refactor.

I will think if there is another way for us to avoid the deadlocks without disabling the foreign key checks.

@pascal-fischer I use Amazon RDS Aurora (MySQL) which has no ability to disable them globally.

A better solution would be when the child row clean up is going to run, disable them for the session, then re-enable them once that clean up is complete.

@ressys1978 commented on GitHub (Oct 7, 2025): > Sorry, I missed this one. > > The issue was on how MySQL does locking on parent rows when accessing a child with foreign keys configured. While Postgres requires a shared lock on the parent row, MySQL requires an exclusive lock. For Postgres this is working fine, MySQL is running the risk of a deadlock. By disabling the `FOREIGN_KEY_CHECKS` we could circumvent this for the time being. As we do the cleanup of child rows manually in the code and do not depend on cascading deletes there should be no risk of orphaned rows. > > I agree that this was not the ideal solution but it seemed to be an acceptable one without changing how we accessed the data as this is a much bigger refactor. > > I will think if there is another way for us to avoid the deadlocks without disabling the foreign key checks. @pascal-fischer I use Amazon RDS Aurora (MySQL) which has no ability to disable them globally. A better solution would be when the child row clean up is going to run, disable them for the session, then re-enable them once that clean up is complete.
Author
Owner

@pascal-fischer commented on GitHub (Oct 10, 2025):

So I created a PR moving to session based checks disabled and the tests did nnot show any issues, no deadlocks.
https://github.com/netbirdio/netbird/pull/4615
We will review this and merge. Please notify me if you see any deadlock errors in your logs at any point.
As MySQL does have automatic deadlock recovery the damage would be on a per request basis, only single requests fail but never the whole management.

@pascal-fischer commented on GitHub (Oct 10, 2025): So I created a PR moving to session based checks disabled and the tests did nnot show any issues, no deadlocks. https://github.com/netbirdio/netbird/pull/4615 We will review this and merge. Please notify me if you see any deadlock errors in your logs at any point. As MySQL does have automatic deadlock recovery the damage would be on a per request basis, only single requests fail but never the whole management.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#2216