Relay server bug - Connections not closed after EOF exception #1410

Open
opened 2025-11-20 05:29:50 -05:00 by saavagebueno · 0 comments
Owner

Originally created by @mathiash98 on GitHub (Nov 12, 2024).

Originally assigned to: @pappz on GitHub.

Solved by #2879

Potential Relay server bug - Saturating number of connections on EOF websocket error: Hopefully I just misunderstand something:

Scenario:

Many peers with high ping and low internet speed

Background:

Hosting relay server behind nginx reverse proxy

Reproduction

  1. lsof count inside docker container before error: sudo docker exec netbird-relay-0-1 lsof | wc -> 800
  2. Whenever Relay server get error code relay/server/peer.go:61: failed to read message: failed to get reader: failed to read frame header: EOF it will return the Work code in b4d7605147/relay/server/peer.go (L48-L64) Which in turn result in removing deleting the peer from store in Relay.go file b4d7605147/relay/server/relay.go (L134-L139) and logs relay/server/relay.go:137: relay connection closed
  3. The client will also get an error code when sending the data and the same peer will then reconnect right after (I have not found the code for this yet) leading to log message: relay/server/relay.go:129: peer connected from: 172.23.0.1:39226
  4. The lsof has now increased by one sudo docker exec netbird-relay-0-1 lsof | wc -> 801
  5. When the lsof of the docker container reaches 1000 it will stop accepting new connections and the docker container must be restarted

Potential fix? -> Run p.conn.Close() inside b4d7605147/relay/server/peer.go (L48-L64) whenever an error is met?

Full log example showcasing that the peer_id is same:

netbird-relay-0-1  | 2024-11-11T12:54:32Z ERRO [peer_id: sha-Pfq9J4xZuOoCYsq6OMQCUZ7j0GOemtq/28tDkCvWGJc=] relay/server/peer.go:61: failed to read message: failed to get reader: failed to read frame header: EOF
netbird-relay-0-1  | 2024-11-11T12:54:32Z DEBG [peer_id: sha-Pfq9J4xZuOoCYsq6OMQCUZ7j0GOemtq/28tDkCvWGJc=] relay/server/relay.go:137: relay connection closed
netbird-relay-0-1  | 2024-11-11T12:54:32Z DEBG [peer_id: sha-0Uu8KZZs8rTiypKGr9PaxkKtTFWqnttiYaCZRwMFuHE=] relay/server/peer.go:196: peer not found: sha-Pfq9J4xZuOoCYsq6OMQCUZ7j0GOemtq/28tDkCvWGJc=
netbird-relay-0-1  | 2024-11-11T12:54:34Z INFO [peer_id: sha-Pfq9J4xZuOoCYsq6OMQCUZ7j0GOemtq/28tDkCvWGJc=] relay/server/relay.go:129: peer connected from: 172.23.0.1:43988

Can mention that I also observe that whenever a client is gracefully shut down or healtcheck timeout leads to a reduction of connected sockets.

Originally created by @mathiash98 on GitHub (Nov 12, 2024). Originally assigned to: @pappz on GitHub. Solved by #2879 Potential Relay server bug - Saturating number of connections on EOF websocket error: Hopefully I just misunderstand something: ## Scenario: Many peers with high ping and low internet speed ## Background: Hosting relay server behind nginx reverse proxy ## Reproduction 1. lsof count inside docker container before error: `sudo docker exec netbird-relay-0-1 lsof | wc` -> 800 2. Whenever Relay server get error code `relay/server/peer.go:61: failed to read message: failed to get reader: failed to read frame header: EOF` it will return the Work code in https://github.com/netbirdio/netbird/blob/b4d7605147e6ddbd22c214b42ef43267bc78ce80/relay/server/peer.go#L48-L64 Which in turn result in removing deleting the peer from store in `Relay.go` file https://github.com/netbirdio/netbird/blob/b4d7605147e6ddbd22c214b42ef43267bc78ce80/relay/server/relay.go#L134-L139 and logs `relay/server/relay.go:137: relay connection closed` 3. The client will also get an error code when sending the data and the same peer will then reconnect right after (I have not found the code for this yet) leading to log message: `relay/server/relay.go:129: peer connected from: 172.23.0.1:39226` 4. The lsof has now increased by one `sudo docker exec netbird-relay-0-1 lsof | wc` -> 801 5. When the lsof of the docker container reaches 1000 it will stop accepting new connections and the docker container must be restarted Potential fix? -> Run `p.conn.Close()` inside https://github.com/netbirdio/netbird/blob/b4d7605147e6ddbd22c214b42ef43267bc78ce80/relay/server/peer.go#L48-L64 whenever an error is met? Full log example showcasing that the peer_id is same: ``` netbird-relay-0-1 | 2024-11-11T12:54:32Z ERRO [peer_id: sha-Pfq9J4xZuOoCYsq6OMQCUZ7j0GOemtq/28tDkCvWGJc=] relay/server/peer.go:61: failed to read message: failed to get reader: failed to read frame header: EOF netbird-relay-0-1 | 2024-11-11T12:54:32Z DEBG [peer_id: sha-Pfq9J4xZuOoCYsq6OMQCUZ7j0GOemtq/28tDkCvWGJc=] relay/server/relay.go:137: relay connection closed netbird-relay-0-1 | 2024-11-11T12:54:32Z DEBG [peer_id: sha-0Uu8KZZs8rTiypKGr9PaxkKtTFWqnttiYaCZRwMFuHE=] relay/server/peer.go:196: peer not found: sha-Pfq9J4xZuOoCYsq6OMQCUZ7j0GOemtq/28tDkCvWGJc= netbird-relay-0-1 | 2024-11-11T12:54:34Z INFO [peer_id: sha-Pfq9J4xZuOoCYsq6OMQCUZ7j0GOemtq/28tDkCvWGJc=] relay/server/relay.go:129: peer connected from: 172.23.0.1:43988 ``` Can mention that I also observe that whenever a client is gracefully shut down or healtcheck timeout leads to a reduction of connected sockets.
saavagebueno added the server label 2025-11-20 05:29:50 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#1410