mirror of
https://github.com/netbirdio/netbird.git
synced 2026-05-15 04:32:40 -04:00
When leveraging highly available routing peers in the same location, we experience route flapping periodically. #989
Closed
opened 2025-11-20 05:21:09 -05:00 by saavagebueno
·
12 comments
No Branch/Tag Specified
main
ui-refactor
fix/rosenpass
drop-candidateviaroutes-filter
e2e-windows-dns-combined
refactor-combined
wasm-websocket-dial
drop-dns-probes
feature/affected-peers
dependabot/go_modules/github.com/Azure/go-ntlmssp-0.1.1
debug-logs
reduce-embed-wg-pool
windows-dns-firewall
dependabot/go_modules/github.com/jackc/pgx/v5-5.9.2
fix/login-cmd-root-flags
feat/reseller-openapi-spec
github-issue-resolver
add-steamos-support
fix-darwin-uninstaller
flutter-test
dependabot/npm_and_yarn/proxy/web/postcss-8.5.12
ci/freebsd-pkg-bootstrap
cached-serial-check-on-sync
fix-mgmt-cache-bypass-overlay
revert-easyjson-5938
revert-ice-5820
revert-firewalld-5928
refactor/permissions-manager
wasm-js-func-release
revert-dns-5935-systemd-resolved
revert-dns-5935-5945
revert-dns-5945-mgmt-cache
feature/log-most-busy-peers
prototype/ui-wails
vnc-server
coderabbitai/utg/8ae8f20
feature/use-peer-fqdn-on-https
dependabot/go_modules/golang.org/x/image-0.38.0
feature/metrics-push-management-control
release/0.68.3
dependabot/go_modules/github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream-1.7.8
dependabot/go_modules/github.com/aws/aws-sdk-go-v2/service/s3-1.97.3
add-slack-channel
claude/rdp-token-passthrough-eNcqW
transparent-proxy
fix/macos-stale-route-eexist
crowdsec-selfhosted
fix/remove-otel-units
entire/checkpoints/v1
dependabot/go_modules/github.com/go-jose/go-jose/v4-4.1.4
fix/getting-started
feat/static-connectors-combined-server
feature/use-local-keys-embedded
feature/fleetdm
set-env-only-if-not-fork
feature/expose-has-channel
fix/connection-status-race
fix/filter-cgnat-cni-ice-candidates
feature/check-cert-locker-before-acme
test/proxy-fixes
test/proxy-mtu
prototype/ui-tauri
test/proxy-speed
fix-reused-ports
feat/migrate-to-embedded-idp
feature/add-serial-to-proxy-merged
deploy/proxy-serial
test/connection
feature/disable-legacy-port
feature/flag-to-disable-legacy-port
test/perftest
dependabot/go_modules/github.com/pion/dtls/v3-3.0.11
fix/http-redirect
poc-token-command
dn-reverse-proxy
prototype/reverse-proxy-rename
prototype/reverse-proxy-logs-pagination
feature/client-metrics
prototype/reverse-proxy-clusters
debug-dns-route
fix/win-dns-batch
add-extra-route-logs
job-stream-notify-disconnection-eof
deploy/secrets-manager
trigger-proxy-update
bug/update-ios-client-code-build-tags
sync-client-netmap-serial
log/conn-disconn
nmap/compaction-deploy
ci-win-test
feature/disk-encryption-check
wasm-debug
swap-dns-prio
fix/dex-config
feature/migrate-auto-groups-to-table
dependabot/go_modules/github.com/quic-go/quic-go-0.57.0
nmap/compaction
dex-nocgo-stub
feature/exclude-terraform-from-rate-limiting
test-freebsd
retries-refactor
coderabbitai/docstrings/b7e98ac
feat/integrate-zitadel
bug/ios-hanging-reconection
zitadel-idp
feat/network-map-serial
refactor/get-account-no-users
feat/auto-upgrade
feature/report-high-pat-id
feature/temporary-access-for-resource
fix/nmap-fwrules
dont-restart-dns
prototype/ui
update-gomobile
go-dns-for-ice
wasm-ldflags
test-ldflags
wasmbuild-test
feature/networks-s2s
vk/compare-nmaps
dbg/bothmaps
feature/changeset
reorder-dns-shutdown
fix/relay-reconnection-race
fix/nmap-exitnodes
vk/debug/nmap-both
move-licensed-code
feat/better-daemon-connection-lost-message
feat/auto-update-2
test/timings
refactor/getaccount-raw
tests/nmap-getaccount
refactor/nmap
refactor/nmap-limit-buffer
feature/detect-mac-wakeup
feature/extract-modules
quick-setings
feat/sync-limiter
feature/store-cache-impl
fix-install-version
feature/store-metrics
feature/metrics-on-store
feature/use-gorm-cache
loadtest-signal
unsymmetrical-squash
refactor/reducate-signaling
test/update-reduce
feature/store-cache
feature/remote-debug
cli-ws-proxy-backend-addr
feat/mgmt-map-serial
snyk-fix-d9d0081a4c7f9137bdb59d0d50a141a2
snyk-fix-7415cea5a11acd66753540ca2c598c63
job-yml-update
feature/android-allow-selecting-routes
fix/up-sequence
fix/dns-hash-update
snyk-fix-967adae9863f17f108ce8948d9117b8d
log/getaccount-by-peer
signal-suppressor
dns-exit-node
feature/auto-updates
feature/cache-srv-key
merged-fixes
fix/missed-offers-and-debug
debug-and-fixes
poc-wasm-clean-backend-s2s
test/remote-debug
debug-api
dependabot/go_modules/github.com/docker/docker-28.0.0incompatible
fix/remove-gpo-if-empty
fix/test-freebsd
fix/mysql-setup
fix/remove-logout-btn
handle-existing-domain-user
chore/unify-domain-validation
snyk-fix-c5fafc8a50ce1f29046e25a1fc346185
feat/profile-edit-btn
snyk-fix-a54966211e18d4cf67e5a2757cc006d1
log-short-id
feat/logout-ephemeral
log-checks
batch-wg-ops
nb-interface-default
feat/aws-integration
add/race-test
feature/relay-feature-versioning
fix/systemd-service-logs
poc/preprocessed-map
add-account-onboarding
bind-ipv6
fix/merge-main
logs/peerlogs-addpeer
feature/net-297-network-migration
feature/support-skip-auto-apply-exit-node-routes
set-cmd
set-command-with-cursor
feature/limit-update-channel
stop-using-locking-share
feature/poc-lazy-detection
feature/net-248-removal-of-sync-mutex-locks
test/multiple-peer-logging
preresolve
add-ns-punnycode-support
apply-routes-early
windows-search-domains
fix/connecting-route-filter
feature/management/rest-client/impersonate
debug-local-records
resource-fields-snake-case
test/grpc-rate-limit
traffic-correlation-policy
feature/rest-client-options
feat/events-metrics
feature/buf-cli
test/add-ratelimiter
test/remove-write-lock-on-add-peer
fix/add-peer-semaphore
feature/users-roles-endpoint
mlsmaycon-patch-1
debug-user-role
chore/primary-key-on-networks
feature/update-account-peers-buffer-startup
remove-ubuntu2004-runners
refactor/permissions-no-pat-allowed
ref/logrus-factory
use-conntrack-zone
deploy/permissions-account
feature/lazy-connection-idle
ref/improve-test-cov
restore-pr-3440
test/increase-grpc-timeouts
feat/buffer-account-peers-update
test/networkmapgeneration-changes
feature/base-manager
feature/flow-receiver
chore/benchmark-with-large-runner
refactor/handshake-initiator
client/ui-update-systray-icons
userspace-router
wgwatcher-test
output-if-key-already-exists
fix/relay-reconnection
feature/port-forwarding-client-codecleaning
detached2
test/callbacks-nil-iceconninfo
refactor/optimize-peer-expiration
enable-udp-port-for-docker-template
fix/relay-update
feature/apply-posture-netmap
fix/group-update-existing-resource
conntrack-stats
upgrade-okta-sdk
multi-price
test/conn-stat
set-min-parallel-tests-for-management
dns-interceptor
debug-dns
router-dns
add-static-system-info
debug-0.29.4
debug-0.33.0
account-refactoring
relay/2800_quic
route-get-account-refactoring
test/seed-random-routes
feature/get-account-refactoring
test/reconnect-race-condition
refactor/get-account-usage
feature/add-session-id-to-update-channel
improve-ipv4conn
fix/async-pion-event-handling
debug
add-offload
feature/validate-group-association-debug
fix/limit-conn-for-sqlite
test/engine-iface
test/transaction-for-jwt-sync
fix/engine-stop-in-foreground
feature/add-mysql-support
test-migration
refactor/header-size-values
relay/eliminate-gob
test/signal-dispatcher-with-relay
relay/debug
validate-icon
feature/ipv6-support
use-pre-expanded-peers-map
feature/use-signal-dispatcher
validate/peer-status
add-read-write-times
fix/sync-peer-race
feature/relay-status
netmap
evaluate/network-map-hash
fix/lower-dns-resolve-interval-on-fail
feature/relay
fix/go-mod-version
upgrade-nftables
synology-userspace-mode
fix/use-ip-for-default-routes-on-darwin
fix/proxy_close
enable-release-workflow-on-pr
deploy/peer-performance
feature/permanent-turn
feature/permanent-turn-proxy
deploy/posture-check-sqlite
feature/optimize_sqlite_save
debug-ios-behavior
fix/delete-route-only-after-adding
tshoot/windows-logger
remove-new-routing
refactor/eliminate-repo-dependency
add-arm-to-ci
refactor-demo-account-object
test/abc2
test/abc
send-ssh-rosenpass-config-meta
refactor-demo
ensure-schedule-never-runs-non-positive
feature/peer-validator-groupmgm
feature/peer-validator-fix
fix/include-active-dashboard-users
fix/handle-canceling-schedule
fix/geo-download
debug-google-workspace
yury/resolve-ip-to-location
feature/extend-sysinfo
sqlite-async-peer-status
yury/add-postgresql-store
fix/route
test-build
posture-checks-poc
debug-keycloak-idp
poc/netstack
for-pascal-tmp
peer-logout-management
manual-peer-logout
detached
chore/refactor-management
test/dns-bind
fix/enforce-acl-for-containers
yury/use-sync-map-in-updatechannel
fix/events-key-handling
filter-cache-on-load-account
fix/user-expiration
handle-user-context-cancellation
nb-client-k8s-statefulset
fake-addr
fix/iptables_in_docker
ebpf-debug
update-getting-started-flow-use-postgres
fix/peer_list_notification
feature/device-authentication-with-client-secret
feature/keep_alive
feat-groups-from-jwt
separate_proxy_from_wgconfig
fix/wg_conn
wg_conn_fix
wg_bind_parallel_processing
fix-rollback-get-acls
proxy_cfg_cleanup
performance-improvement-rego
update-lock-log-level
feat-client-side-acl
refactor/move_grpcserver_logic_to_account_manager
feature/event-storage
feature/update-idp-redeeming-invite
feature/api-peer-info
return-groupminimum-setupkey
feature/interface-bind
documentation_enhancement
fix-peer-registration
ssh
users_cache
pass-client-caller
client_caller_type
revert-283-feat-fix-windows-installer
periodic-peer-updates
ebpf
braginini/wasm
v0.71.0
v0.70.5
v0.70.4
v0.70.3
v0.70.2
v0.70.1
v0.70.0
v0.69.0
v0.68.3
v0.68.2
v0.68.1
v0.68.0
v0.67.4
v0.67.3
v0.67.2
v0.67.1
v0.67.0
v0.66.4
v0.66.3
v0.66.2
v0.66.1
v0.66.0
v0.65.3
v0.65.2
v0.65.1
v0.65.0
v0.64.6
v0.64.5
v0.64.4
v0.64.3
v0.64.2
v0.64.1
v0.64.0
v0.63.0
v0.62.3
v0.62.2
v0.62.1
v0.62.0
v0.61.2
v0.61.1
v0.61.0
v0.60.9
v0.60.8
v0.60.7
v0.60.6
v0.60.5
v0.60.4
v0.60.3
v0.60.2
v0.60.1
v0.60.0
v0.59.13
v0.59.12
v0.59.11
v0.59.10
v0.59.9
v0.59.8
v0.59.7
v0.59.6
v0.59.5
v0.59.4
v0.59.3
v0.59.2
v0.59.1
v0.59.0
v0.58.2
v0.58.1
v0.58.0
v0.57.1
v0.57.0
v0.56.1
v0.56.0
v0.55.1
v0.55.0
v0.54.2
v0.54.1
v0.54.0
v0.53.0
v0.52.2
v0.52.1
v0.52.0
v0.51.2
v0.51.1
v0.51.0
v0.50.3
v0.50.2
v0.50.1
v0.50.0
v0.49.0
v0.48.0-dev2
v0.48.0
v0.47.2
v0.47.1
v0.47.0
v0.46.0
v0.45.3
v0.45.2
v0.45.1
v0.45.0
v0.44.0
v0.43.3
v0.43.2
v0.43.1
v0.43.0
v0.42.0
v0.41.3
v0.41.2
v0.41.1
v0.41.0
v0.40.1
v0.40.0
v0.39.2
v0.39.1
v0.39.0
v0.38.2
v0.38.1
v0.38.0
v0.37.2
v0.37.1
v0.37.0
v0.36.7
v0.36.6
v0.36.5
v0.36.4
v0.36.3
v0.36.2
v0.36.1
v0.36.0
v0.35.2
v0.35.1
v0.35.0
v0.34.1
v0.34.0
v0.33.0
v0.32.0
v0.31.1
v0.31.0
v0.30.3
v0.30.2
v0.30.1
v0.30.0
v0.29.4
v0.29.3
0.29.3
v0.29.2
v0.29.1
v0.29.0
v0.28.9
v0.28.8
v0.28.7
v0.28.6
v0.28.5
v0.28.4
v0.28.3
v0.28.2
v0.28.1
v0.28.0
v0.27.10
v0.27.9
v0.27.8
v0.27.7
v0.27.6
v0.27.5
v0.27.4
v0.27.3
v0.27.2
v0.27.1
v0.27.0
v0.26.7
v0.26.6
v0.26.5
v0.26.4
v0.26.3
v0.26.2
v0.26.1
v0.26.0
v0.25.9
v0.25.8
v0.25.7
v0.25.6
v0.25.5
v0.25.4
v0.25.3
v0.25.2
v0.25.1
v0.25.0
v0.24.4
v0.24.3
v0.24.2
v0.24.1
v0.24.0
v0.23.9
v0.23.8
v0.23.7
v0.23.6
v0.23.5
v0.23.4
v0.23.3
v0.23.2
v0.23.1
v0.23.0
v0.22.7
v0.22.6
v0.22.5
v0.22.4
v0.22.3
v0.22.2
v0.22.1
v0.22.0
v0.21.11
v0.21.10
v0.21.9
v0.21.8
v0.21.7
v0.21.6
v0.21.5
v0.21.4
v0.21.3
v0.21.2
v0.21.1
v0.21.0
v0.20.8
v0.20.7
v0.20.6
v0.20.5
v0.20.4
v0.20.3
v0.20.2
v0.20.1
v0.20.0
v0.19.0
v0.18.1
v0.18.0
v0.17.0
v0.16.0
v0.15.3
v0.15.2
v0.15.1
v0.15.0
v0.14.6
v0.14.5
v0.14.4
v0.14.3
v0.14.2
v0.14.1
v0.14.0
v0.13.0
v0.12.0
v0.11.6
v0.11.5
v0.11.4
v0.11.3
v0.11.2
v0.11.1
v0.11.0
v0.10.10
v0.10.9
v0.10.8
v0.10.7
v0.10.6
v0.10.5
v0.10.4
v0.10.3
v0.10.2
v0.10.1
v0.10.0
v0.9.8
v0.9.7
v0.9.6
v0.9.5
v0.9.4
v0.9.3
v0.9.2
v0.9.1
v0.9.0
v0.8.12
v0.8.11
v0.8.10
v0.8.9
v0.8.8
v0.8.7
v0.8.6
v0.8.5
v0.8.4
v0.8.3
v0.8.2
v0.8.1
v0.8.0
v0.7.1
v0.7.0
v0.6.4
v0.6.3
v0.6.2
v0.6.1
v0.6.0
v0.5.11
v0.5.10
v0.5.1
v0.5.0
v0.4.0
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.3
v0.2.2-beta.1
v0.2.1-beta.5
v0.2.0-beta.5
v0.2.0-beta.4
v0.2.0-beta.3
v0.2.0-beta.2
v0.2.0-beta.1
v0.1.0-beta.3
v0.1.0-beta.2
v0.1.0-beta.1
v0.1.0-rc.2
v0.1.0-rc-1
v0.0.8-hotfix-1
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
v0.0.0
Labels
Clear labels
2021 Q4
2022 Q1
2022 Q1
accessibility
acl
agent
agent
Android
Android
api
authentik
automation
azure
battery-usage
bug
cache
client
client-ui
cloud
cloud-only
cloudflare
community
compatibility
config-idp
config-issue
connection
contribution
coturn
cross-vpn
dashboard
data-usage
distribution
dns
docker
documentation
duplicate
enhancement
enhancement
event-stream
feature-request
freebsd
getting-started
go
good first issue
gui
help wanted
home-assistant
idp
inconsistency
integration
integrations
ios
ipv6
jwt
k8s
keycloak
linux
login
macos
management-service
missing-docs
mobile
moved-internal
needs-review
netbird-ui
networking
new-platform
nginx
notification
okta
openwrt
packaging
peer-management
peer-management
peer-management
performance
postgres
posture-checks
psk
pull-request
question
refactor
relay
release
rfc
routes
security
security-related
self-hosting
server
signal
sleep-issue
ssh
ssl
status
store
synology
system-compatibility-issue
test-suite
third-party-integration
triage
triage-needed
troubleshooting
UX
waiting-feedback
windows
wontfix
zitadel
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
Assignees
saavagebueno
Clear assignees
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: SVI/netbird#989
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @briemann on GitHub (Jun 18, 2024).
Describe the problem
We're seeing an issue periodically crop up in our ecosystem where we've setup our routing peers in a high availability pair for each environment and in this issue window, random clients will experience route flapping where assets behind the RP nodes are accessible but extremely slow (due to the route flaps).
We've got a variety of 0.27 windows clients in the fleet that have all been able to replicate this behavior and myself this morning experienced this.
Generally the remedy is to shut the windows service down for a few minutes and start it back up, however this is less than ideal long term as we want non-power users to be leveraging this solution and telling them to do technical steps will fall on deaf ears.
As I suspect we're more of an edge case because of this HA setup, we're going to stop the netbird client on one of the two nodes in each location to try and isolate the issue to see if it's related to the HA pair or if it's client side. I just wanted to open this issue to see what we could supply in the meantime to get clarity from other avenues.
To Reproduce
Due to the nature it's not reproducible on command. Generally it happens on the first connection of the day, 9 our of 10 days logging in will be fine but that one day it won't.
Expected behavior
For the client not to flap when attempting to create a route.
Are you using NetBird Cloud?
No. Self Hosted.
NetBird version
0.27.10
NetBird status -d output:
See attached.
Screenshots
See attached output.log
Additional context
It looks almost like the routemanager has an issue with one of the RPs, sees the RP come online and flops the routes over to the other node because peer has a slightly better score.. I guess the fix for this would be to make sure that peers are aware they are right next to each other and that it's possible to have a slightly different score for the same environment to ensure it doesn't flap? Not sure, maybe I am off-base.
netbird_output.log
netbird_status_output.txt
@pascal-fischer commented on GitHub (Jun 19, 2024):
Hi @briemann,
I've checked the logs and the score is noticeably different. 0.93... to 2.92... which means the flapping is not due to similar latency to the routing peers. To me it looks like one of the routing peers might be reconnecting (or somewhat attempting) and the connection type is changing between e.g. P2P or relay this way causing the difference of 2 (the rest is due to latency).
Once you have the issue could you gather some debug logs (follow Debug for a specific time) to figure out what is causing the score to change so frequently.
It might also be possible to catch whatever is happening by running netbird status command multiple time over one of the flaps.
@briemann commented on GitHub (Jun 21, 2024):
@pascal-fischer sure thing, we turned off one node on all of the high availability pairs this week to do some other testing and will bring them back online next week. At that point i'll try to replicate or see how long it takes and will report back when we're able to.
@LeszekBlazewski commented on GitHub (Jul 5, 2024):
Hey @pascal-fischer ,
I just wanted to bump this issue as I am observing the same problems. So far I have been loving the product but the high availability setup results in the above issue described by @briemann.
Context
Self hosted netbird server v0.28.4 deployed on public EC2 and netbird routing peers with multiple replicas deployed in three different EKS clusters (each set of routing peers in given cluster is responsible for routing the traffic within the VPC in which it runs). Around 30 users using the whole stack. We are using netbird as a secure way to access our resources running in private subnets. Also all of the netbird routing peers inside kubernetes are running in private subnets and we are spreading them across 3 different availability zones in N.Virginia (which means for each set of peers there are 3 different EC2 nodes in 3 different availability zones). The architecture looks like this:
TBH I am not 100% sure wether the connections between peers look like this because they are all showing as relay/relay from my macos client (I assume this is due to the fact that the routing k8s peers are in private subnets). Actually if I could receive some guidance regarding the setup, wether using private subnets for the netbird client peers is suboptimal in terms of going for a public based peers in each VPC since if I get this right, the TCP/UDP hole punching mechanism sends all of the traffic via public internet regardless where the peers are spawned? Still using private subnets (not publicly routable) means that all connections must be relayed, and if the routing peers would be public routable then we could achieve full P2P connections, right?
Netbird setup
Issue
From time to time, when connecting to netbird from macos client we see a constantly switching routes which result in a situation where the traffic is being routed across those switching peers and the requests either time out or everything loads really slow. In the client logs it looks more or less like this:
Macos client logs
And the above logs just keep on going. Only a disconnection and then connection again can fix the above issue which sometimes needs to happen few times since it can reoccur. Since I have found this out, I have decreased the number of the netbird peers running inside each kubernetes cluster to 1 from 2 and I couldn't reproduce this issue anymore (once routes are chosen, they are simply used because there is just 1 routing peer). This is of course not ideal because whenever we need to do maintenance on the node which runs the netbird client - we are locked out of given VPC access (until the peer starts up again). So I would really like to restore the HA setup if it would prove to be working OK.
I can provide further logs and other troubleshooting details as needed.
@hurricanehrndz commented on GitHub (Jul 12, 2024):
So I have studied this, one possible workaround is to compare the route update with the current routes, if routes aren't different recalculating of routes should be skipped. This works for us by leveraging the management ACLs we let client have access to routing peers, but block clients from seeing each other
@hurricanehrndz commented on GitHub (Jul 12, 2024):
I will submit a PR
@hurricanehrndz commented on GitHub (Jul 14, 2024):
@LeszekBlazewski I just looked over your log again, and something else is going on with your setup. It seems to me like one of the routing peer is disconnecting and reconnecting. Did you ever capture the logs of the routing peer and why it is disconnecting and reconnecting? I would turn on debug on the routing peer and capture the log as to why it keeps disconnecting and reconnecting.
My PR would do very little in this situation. Because
getBestRoutewould be called in this situation. To add I don't think latency is the issue at play here, not to mention that the logs show that latency is pretty stable with little to no deviation.Additionally, I believe that one of the routing peer is making a direct connection to the client via the other routing peer's tunnel. I say this because the score of the one routing peer is 2points better than the other.
So perhaps the solution for your situation would be to ensure that port 51820 is not accessible on the private address space or disable the relay and make the routing peers publicly accessible.
To answer your question, I trust more the wireguard tech then I do STUN, so I rather avoid using relay when possible and just have a public subnet
@LeszekBlazewski commented on GitHub (Jul 16, 2024):
Than you for the detailed answer @hurricanehrndz. I appreciate all of your feedback. It makes sense, I will move my routing peers to a public subnet and check the setup again.
@LeszekBlazewski commented on GitHub (Jul 22, 2024):
Hi @hurricanehrndz,
I have a quick question regarding your suggested changes:
I have executed some tests by moving the netbird routing peers to public subnets but have observed a situation where the routing peers would join randomly as either P2P with connection
Direct: trueor same as before withrelay/relayconnection type. (Observed from my macos peer who uses those peers for routing traffic for a HA network route that is specified for10.X.0.0/16).As you mentioned those 2 routing peers are running in 2 different public AWS subnets (same VPC) but I was wondering why relay connections are chosen instead of direct even if the routing peers are made public. The AWS subnets in which those 2 peers run only have to routes in their routing table:
0.0.0.0/0pointing to internet gateway10.X.0.0/16pointing to local (so resources running in private subnets are accessible from public subnet)I started to think, that as you said, the following is probably true:
I am just not 100% sure wether I understand this correctly, by private space you mean the whole
10.X.0.0/16range (the VPC CIDR block), right? In that case to avoid those connections to go through the other peer tunnel, I would have to add a firewall rule on the public routing peers that would block the10.X.0.0/16range on the WG port51820? Smth like this?:@LeszekBlazewski commented on GitHub (Jul 22, 2024):
Quick update to the above. I have run further tests and it turned out that the constant reconnecting peer (that can be observed in the logs above) was present due to the usage of an exit node. One of the set of routing peers was assigned as an exit node and that caused constant jumping of connections on few other peers from P2P to relay which made the connection very unreliable. On the debug logs I have observed that as @hurricanehrndz pointed out, one of the peers was trying to use the other peers tunnel.
Moreover after disabling the exit node and running some tests I have noticed that in order to have proper P2P direct connection to a public EC2 netbird client one has to open up the 51820 port for all UDP traffic. Without that, I was able to connect to the peer but the connection from my macos peer was shown as relay for most of the time and only sometimes went as P2P. As soon as I opened the port, it's always P2P. It's something that has been mentioned here: https://github.com/netbirdio/netbird/issues/254#issuecomment-2229534046.
Since I have been planning to move to a full split DNS and routing setup, the above mentioned issue is solved for me. Thank you all for providing such enormous amount of information. Netbird is a great product, keep up the good work!
@hurricanehrndz commented on GitHub (Jul 23, 2024):
Ah yeah opening the port is key. Glad you figured it out
@nazarewk commented on GitHub (Apr 23, 2025):
@briemann Did you resolve the issue? Any idea whether this is still happening with the latest NetBird version?
@briemann commented on GitHub (Apr 23, 2025):
@nazarewk We upgraded 0.27.5 -> 0.36.5 of the management plane a month back and haven't seen issues since. We can close out this issue, thanks!