mirror of
https://github.com/netbirdio/netbird.git
synced 2026-07-03 12:41:56 -04:00
* [client] always clean up on Engine.Start failure via defer The rosenpass init paths (NewManager/Run) returned without calling e.close(), leaking the WireGuard interface and other partially initialized state on failure. Per-branch cleanup was easy to miss when adding new early returns. Convert Start to a named error return and tear down via a single defer that calls e.close() whenever err != nil, removing the scattered per-branch close() calls (including the redundant one in initFirewall). * [client] make Engine single-use and guard against double Start Create the run context once in NewEngine instead of in Start. This keeps e.cancel valid for the engine's whole lifetime, so Stop can cancel a Start that is blocked waiting on the network while holding syncMsgMux: Stop now cancels before taking the lock, unblocking that Start so it can release the mutex. Reject re-entry into Start: a non-nil wgInterface means a prior Start already ran (ErrEngineAlreadyStarted), and a cancelled run context means the engine was stopped (ErrEngineAlreadyStopped). Both checks run before the cleanup defer so a duplicate call cannot tear down the running engine's state. * [client] let engine context unblock WaitStreamConnected WaitStreamConnected only watched the signal client's own context, which derives from the parent engineCtx rather than the engine's run context. A Start blocked here (signal stream not yet up) could therefore not be released by Engine.Stop, since Stop only cancels the engine's run context. Pass a context into WaitStreamConnected and select on it too, and have the engine pass e.ctx, so Stop cancelling e.ctx unblocks a parked Start. Update the Client interface, the mock, and callers accordingly. * [client] fix Start/Stop race by making the run loop own engine shutdown ConnectClient.Stop stopped the engine directly while the run loop's backoff cycle could still be starting an engine, so Engine.close raced Engine.Start (e.g. firewall setup reading wgInterface while close nils it). embed.Client.Start's rollback only avoided a deadlock by cancelling before Stop; the race itself remained and was caught by -race. Make the run loop the sole owner of engine shutdown: derive the run context in NewConnectClient, and have Stop cancel it and wait for the loop to exit (skipping the wait when the loop never ran) instead of calling engine.Stop. The loop now always stops the engine on its way out, dropping the unsynchronised wgInterface check it used to guard that call. Self-calls from within the loop use runCancel to avoid waiting on themselves. embed keeps a defensive pre-Stop cancel(); the daemon's cleanupConnection gets a TODO to adopt Stop() rather than stopping the engine in parallel. * [client] init context state in engine tests Engine tests built the engine context with context.WithCancel( context.Background()), omitting CtxInitState. Now that the run context is created in the constructor, the wgIfaceMonitor goroutine can reach triggerClientRestart during teardown, which calls CtxGetState and panics on the missing state. Real entry points (up, embed, service) always CtxInitState; only the tests skipped it. * [client] interrupt connect backoff on context cancel The run loop retried with a raw ExponentialBackOff, so a backoff sleep ignored context cancellation. Now that ConnectClient.Stop waits for the run loop to exit, a cancel landing during a sleep would block Stop for the full interval (up to MaxInterval). Wrap the backoff with the run context so Retry returns promptly on cancel; the retry budget itself (MaxElapsedTime) is unchanged. * [client] bound WaitStreamConnected in signal client tests The tests waited on WaitStreamConnected with context.Background() and the client's own context was also Background, so a stream that never connects would hang until the suite timeout. Pass a 5s timeout context and assert StreamConnected afterwards so the tests fail fast with a clear reason. * [client] fix WaitStreamConnected stale-channel race The StreamConnected check and the wait-channel creation took the mutex separately, so notifyStreamConnected could set the status and close/clear connectedCh in between: the waiter then created a fresh channel nobody would ever close and blocked forever. Also, the status read was unlocked while notify wrote it under the mutex (a data race). Do the check and the channel fetch in one locked section; drop the now-unused getStreamStatusChan helper. Pre-existing bug, not introduced by this branch. * [client] abort Start if context cancelled while waiting for signal stream receiveSignalEvents blocks in WaitStreamConnected until the signal stream connects or the context is cancelled. If Stop cancelled e.ctx while Start was parked there, Start kept going: it started the remaining subsystems on a cancelled context and marked a shutting-down engine as started. Return the context error from receiveSignalEvents and propagate it from Start, so the deferred cleanup runs and the cancellation reaches the caller. * [client] clean up all started components on Start failure Start's failure defer only called close(), which covers the wg interface, firewall, rosenpass and port forwarding but leaves connMgr, srWatcher, route/DNS/flow/state managers and the monitor goroutines running. A late failure (e.g. the context-cancelled check after the signal stream) thus leaked them. Extract Stop's locked teardown into stopLocked (caller holds syncMsgMux, does not wait on shutdownWg) and call it from both Stop and Start's defer. The defer also cancels the run context first so goroutines started before the failure unwind. Teardown order is unchanged.