[PR #4791] [relay-server] Fix race condition in relay peer reconnection handling #4405

Open
opened 2025-11-20 08:08:19 -05:00 by saavagebueno · 0 comments
Owner

Original Pull Request: https://github.com/netbirdio/netbird/pull/4791

State: open
Merged: No


When a peer reconnects with the same ID, other peers were not reliably notified that the old connection went offline. This caused "connection already exists" errors when attempting to establish new connections to the reconnected peer.

The issue occurred because the old peer's cleanup notification raced with the new peer's online notification. If reconnection happened before cleanup, the offline notification was silently dropped.

The fix sends an offline notification synchronously during reconnection (when AddPeer returns true), ensuring all subscribed peers receive events in the correct order (offline → online).

Added TestBindReconnectRace to validate the fix with 1000 reconnection iterations.

Describe your changes

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • Bug Fixes

    • Reconnection flow now emits offline/online notifications to keep peer state consistent.
  • Tests

    • Added a repeatable race-detection test for reconnect scenarios and made critical test failures fail-fast; improved test cleanup and logging.
  • Chores

    • CI test workflow updated to enable additional test tags and more verbose race-mode output.
  • Refactor

    • Manager construction changed to use a per-instance options struct, exposing configurable MTU, cleanup interval, and idle timeout.
**Original Pull Request:** https://github.com/netbirdio/netbird/pull/4791 **State:** open **Merged:** No --- When a peer reconnects with the same ID, other peers were not reliably notified that the old connection went offline. This caused "connection already exists" errors when attempting to establish new connections to the reconnected peer. The issue occurred because the old peer's cleanup notification raced with the new peer's online notification. If reconnection happened before cleanup, the offline notification was silently dropped. The fix sends an offline notification synchronously during reconnection (when AddPeer returns true), ensuring all subscribed peers receive events in the correct order (offline → online). Added TestBindReconnectRace to validate the fix with 1000 reconnection iterations. ## Describe your changes ## Issue ticket number and link ## Stack <!-- branch-stack --> ### Checklist - [x] Is it a bug fix - [ ] Is a typo/documentation fix - [ ] Is a feature enhancement - [ ] It is a refactor - [ ] Created tests that fail without the change (if possible) > By submitting this pull request, you confirm that you have read and agree to the terms of the [Contributor License Agreement](https://github.com/netbirdio/netbird/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT.md). ## Documentation Select exactly one: - [ ] I added/updated documentation for this change - [x] Documentation is **not needed** for this change (explain why) ### Docs PR URL (required if "docs added" is checked) Paste the PR link from https://github.com/netbirdio/docs here: https://github.com/netbirdio/docs/pull/__ <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Reconnection flow now emits offline/online notifications to keep peer state consistent. * **Tests** * Added a repeatable race-detection test for reconnect scenarios and made critical test failures fail-fast; improved test cleanup and logging. * **Chores** * CI test workflow updated to enable additional test tags and more verbose race-mode output. * **Refactor** * Manager construction changed to use a per-instance options struct, exposing configurable MTU, cleanup interval, and idle timeout. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
saavagebueno added the pull-request label 2025-11-20 08:08:19 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#4405