kworker/0:0-wg-c pegs CPU to 100% utilisation on exit node, each time a resource makes a web request though a client. #1517

Closed
opened 2025-11-20 05:32:04 -05:00 by saavagebueno · 11 comments
Owner

Originally created by @rihards-simanovics on GitHub (Dec 26, 2024).

Hi Netbird team, I'm sorry, but I can't provide any logs this time as it takes too long to censor them and prep for debugging. Instead, I'm happy to email the full system dump from the exit node so you can review the information.

The long and short of the issue is that since version 0.30.0, all exit nodes appear to be having the same problem where a kworker/0:0-wg-c (where 0:0 could be any core or thread for the 2-core processor) generates spikes (or sometimes prolonged) 99% CPU utilisation.

This doesn't seem to be an issue on version 0.29.4 and below. Below are a bunch of btop++ screenshots of the issue. The problem appears to manifest regardless of what client version the resource uses. On the exit node, you can also see occasional spikes in memory utilisation, which looks like memory leaks, that get detected and the process gets killed, hence why the memory gets released immediately; one such leak killed my btop++ session, which generated a crash dump.

Screenshots below fold

Exit Node btop++ screenshots v0.30.0 and up

Screenshot 2024-12-18 100003
Screenshot 2024-12-18 100011
Screenshot 2024-12-18 100155
Screenshot 2024-12-18 101422
Screenshot 2024-12-18 101440
Screenshot 2024-12-18 101503
Screenshot 2024-12-18 101534
Screenshot 2024-12-18 101600
Screenshot 2024-12-18 101611
Screenshot 2024-12-26 042904
Screenshot 2024-12-26 042921
Screenshot 2024-12-26 042932
Screenshot 2024-12-26 042957
Screenshot 2024-12-26 043210
Screenshot 2024-12-26 043219

Exit Node btop++ screenshots v0.29.4 and below

Screenshot 2024-12-18 103647
Screenshot 2024-12-18 101943
Screenshot 2024-12-18 102917
Screenshot 2024-12-18 103022
Screenshot 2024-12-18 103207
Screenshot 2024-12-26 043641
Screenshot 2024-12-26 043552
Screenshot 2024-12-26 043627

Originally created by @rihards-simanovics on GitHub (Dec 26, 2024). Hi Netbird team, I'm sorry, but I can't provide any logs this time as it takes too long to censor them and prep for debugging. Instead, I'm happy to email the full system dump from the exit node so you can review the information. The long and short of the issue is that since version 0.30.0, all exit nodes appear to be having the same problem where a `kworker/0:0-wg-c` (where 0:0 could be any core or thread for the 2-core processor) generates spikes (or sometimes prolonged) 99% CPU utilisation. This doesn't seem to be an issue on version 0.29.4 and below. Below are a bunch of btop++ screenshots of the issue. The problem appears to manifest regardless of what client version the resource uses. On the exit node, you can also see occasional spikes in memory utilisation, which looks like memory leaks, that get detected and the process gets killed, hence why the memory gets released immediately; one such leak killed my btop++ session, which generated a crash dump. <details> <summary>Screenshots below fold</summary> ### Exit Node btop++ screenshots v0.30.0 and up ![Screenshot 2024-12-18 100003](https://github.com/user-attachments/assets/531bf17d-a05c-4754-bb6b-310a3e542879) ![Screenshot 2024-12-18 100011](https://github.com/user-attachments/assets/1e03924f-10ca-41c8-9bf1-7bf8cab63d7f) ![Screenshot 2024-12-18 100155](https://github.com/user-attachments/assets/a4c16cf1-7264-4471-8c1b-5b9a43877cfd) ![Screenshot 2024-12-18 101422](https://github.com/user-attachments/assets/2fcd9c1e-4d59-4fba-b54a-2fd1d57c078f) ![Screenshot 2024-12-18 101440](https://github.com/user-attachments/assets/31818f5d-dd00-4c2d-8876-9965a2335ef9) ![Screenshot 2024-12-18 101503](https://github.com/user-attachments/assets/2787a82f-d4eb-4177-ab50-110328f8bca2) ![Screenshot 2024-12-18 101534](https://github.com/user-attachments/assets/e56e2e9c-03f7-4d59-804b-434fe149a076) ![Screenshot 2024-12-18 101600](https://github.com/user-attachments/assets/f5114751-4ae6-4266-9bfc-145fb886cfd0) ![Screenshot 2024-12-18 101611](https://github.com/user-attachments/assets/ef360138-3cb9-46a9-96dc-2e80a1f56b31) ![Screenshot 2024-12-26 042904](https://github.com/user-attachments/assets/757a6092-8324-4031-a914-80bc4752b049) ![Screenshot 2024-12-26 042921](https://github.com/user-attachments/assets/fd9d8e6c-df94-4c61-a4af-683638a2a229) ![Screenshot 2024-12-26 042932](https://github.com/user-attachments/assets/133ba974-110e-4967-b28b-9adc54fedc5c) ![Screenshot 2024-12-26 042957](https://github.com/user-attachments/assets/e9b3185a-6db6-4517-8107-c8bb7cf18658) ![Screenshot 2024-12-26 043210](https://github.com/user-attachments/assets/114c94b4-7a03-403d-8a39-f4f05a2ba1a4) ![Screenshot 2024-12-26 043219](https://github.com/user-attachments/assets/6ddae33a-873b-4025-8fe4-35b0a3992ec5) ### Exit Node btop++ screenshots v0.29.4 and below ![Screenshot 2024-12-18 103647](https://github.com/user-attachments/assets/1fa021be-3460-4482-b410-96de537e3182) ![Screenshot 2024-12-18 101943](https://github.com/user-attachments/assets/911ddb54-d244-44c5-b496-de8b629be314) ![Screenshot 2024-12-18 102917](https://github.com/user-attachments/assets/aefba153-a9f9-418f-960d-e178f95bca7c) ![Screenshot 2024-12-18 103022](https://github.com/user-attachments/assets/fbb9b7d0-7606-40a7-a1b6-5ef8b35bf5e3) ![Screenshot 2024-12-18 103207](https://github.com/user-attachments/assets/6688dace-ee6a-4c72-92c2-172b5a7eb018) ![Screenshot 2024-12-26 043641](https://github.com/user-attachments/assets/c5ccaf04-a94e-4bfa-a217-75d4d65e57e0) ![Screenshot 2024-12-26 043552](https://github.com/user-attachments/assets/8bc5e8f9-f8c4-43b4-b5c9-ed3d69de12aa) ![Screenshot 2024-12-26 043627](https://github.com/user-attachments/assets/0ee86430-e147-4839-b8ef-0806f02ba99b) </details>
saavagebueno added the triage-needed label 2025-11-20 05:32:04 -05:00
Author
Owner

@rihards-simanovics commented on GitHub (Dec 26, 2024):

Hey, @mlsmaycon, please let me know the best email to send detailed crash dumps or logs; unfortunately, I cannot post the logs here anymore due to security concerns.

@rihards-simanovics commented on GitHub (Dec 26, 2024): Hey, @mlsmaycon, please let me know the best email to send detailed crash dumps or logs; unfortunately, I cannot post the logs here anymore due to security concerns.
Author
Owner

@rihards-simanovics commented on GitHub (Dec 26, 2024):

The issue is so bad that even a simple web request to a website fails, let alone an SSH connection or video stream.

Clarification: In this case, by resource, I refer to an Android/Windows/MacOS or other user-facing device used to access other company resources.

@rihards-simanovics commented on GitHub (Dec 26, 2024): The issue is so bad that even a simple web request to a website fails, let alone an SSH connection or video stream. Clarification: In this case, by resource, I refer to an Android/Windows/MacOS or other user-facing device used to access other company resources.
Author
Owner

@mlsmaycon commented on GitHub (Dec 26, 2024):

Hello @rihards-simanovics , it will be great if you can run commands below, run for some time and then you can send them to support@netbird.io.

Enable trace logs on both nodes, router and client

netbird debug log level trace
netbird debug persistence on

The , after 5-10 minutes

netbird debug bundle -S
@mlsmaycon commented on GitHub (Dec 26, 2024): Hello @rihards-simanovics , it will be great if you can run commands below, run for some time and then you can send them to support@netbird.io. Enable trace logs on both nodes, router and client ``` netbird debug log level trace netbird debug persistence on ``` The , after 5-10 minutes ``` netbird debug bundle -S ```
Author
Owner

@rihards-simanovics commented on GitHub (Dec 26, 2024):

Hello @rihards-simanovics , it will be great if you can run commands below, run for some time and then you can send them to support@netbird.io.

Enable trace logs on both nodes, router and client

netbird debug log level trace
netbird debug persistence on

The , after 5-10 minutes

netbird debug bundle -S

Hey @mlsmaycon, thanks; just before I do, would you like me to run netbird debug bundle -S on both affected Netbird clients simultaneously or would it be fine individually?

@rihards-simanovics commented on GitHub (Dec 26, 2024): > Hello @rihards-simanovics , it will be great if you can run commands below, run for some time and then you can send them to [support@netbird.io](mailto:support@netbird.io). > > Enable trace logs on both nodes, router and client > > ``` > netbird debug log level trace > netbird debug persistence on > ``` > > The , after 5-10 minutes > > ``` > netbird debug bundle -S > ``` Hey @mlsmaycon, thanks; just before I do, would you like me to run `netbird debug bundle -S` on both affected Netbird clients simultaneously or would it be fine individually?
Author
Owner

@mlsmaycon commented on GitHub (Dec 26, 2024):

can be individually too

@mlsmaycon commented on GitHub (Dec 26, 2024): can be individually too
Author
Owner

@rihards-simanovics commented on GitHub (Dec 26, 2024):

Hey, @mlsmaycon just sent the email. It will be from rih******@gr*****-w**.st****.

Btop just dumped the core again while testing and running it. Just waiting another 8min to get you a better picture of what is happening:

image

By around the 5th-minute mark, the MacOS client had given up the ghost and was not routing any traffic at all.

I'm not 100% sure, but it might be an intel MacOS-specific client issue after all, as things quit down after the MacOS client failed. I asked some staff members to update their Windows PCs to the latest client, and the problem doesn't appear when they use it.

@rihards-simanovics commented on GitHub (Dec 26, 2024): Hey, @mlsmaycon just sent the email. It will be from rih******@gr*****-w**.st****. Btop just dumped the core again while testing and running it. Just waiting another 8min to get you a better picture of what is happening: ![image](https://github.com/user-attachments/assets/8a0506d1-32d3-4cd8-9b42-01805d942546) By around the 5th-minute mark, the MacOS client had given up the ghost and was not routing any traffic at all. I'm not 100% sure, but it might be an intel MacOS-specific client issue after all, as things quit down after the MacOS client failed. I asked some staff members to update their Windows PCs to the latest client, and the problem doesn't appear when they use it.
Author
Owner

@User-26 commented on GitHub (Dec 26, 2024):

Hello, everyone. I have the same issue. Exit node on Linux + clients on Windows. Netbird v0.35.0 on both.

@User-26 commented on GitHub (Dec 26, 2024): Hello, everyone. I have the same issue. Exit node on Linux + clients on Windows. Netbird v0.35.0 on both.
Author
Owner

@rihards-simanovics commented on GitHub (Dec 26, 2024):

Hi @mlsmaycon I did some more testing on Intel MacOS. The issue started to rear its head on the client with versions 0.30.2 and 0.33.0. However, the interesting bit is that it would only do it once after installation—I haven't restarted the iMac after installation, so I can't confirm if it appeared after.

On 0.30.2, it did so straight after the upgrade from 0.30.1, on 0.33.0 after the upgrade from 0.32.0, and if 0.33.0 was re-installed. I haven't tested the other two versions as it started having this issue consistently since 0.34.0.

@rihards-simanovics commented on GitHub (Dec 26, 2024): Hi @mlsmaycon I did some more testing on Intel MacOS. The issue started to rear its head on the client with versions 0.30.2 and 0.33.0. However, the interesting bit is that it would only do it once after installation—I haven't restarted the iMac after installation, so I can't confirm if it appeared after. On 0.30.2, it did so straight after the upgrade from 0.30.1, on 0.33.0 after the upgrade from 0.32.0, and if 0.33.0 was re-installed. I haven't tested the other two versions as it started having this issue consistently since 0.34.0.
Author
Owner

@User-26 commented on GitHub (Dec 26, 2024):

I've updated Netbird on exit node to version v0.35.1. Looks like the issue is fixed. Thanks.

@User-26 commented on GitHub (Dec 26, 2024): I've updated Netbird on exit node to version v0.35.1. Looks like the issue is fixed. Thanks.
Author
Owner

@rihards-simanovics commented on GitHub (Dec 26, 2024):

I've updated Netbird on exit node to version v0.35.1. Looks like the issue is fixed. Thanks.

Give it some time. The issue usually appears after the computer, its client, or the router node restarts. That said, if it works for you, you're lucky 😅; this issue has been bugging me for ages now.

@rihards-simanovics commented on GitHub (Dec 26, 2024): > I've updated Netbird on exit node to version v0.35.1. Looks like the issue is fixed. Thanks. Give it some time. The issue usually appears after the computer, its client, or the router node restarts. That said, if it works for you, you're lucky 😅; this issue has been bugging me for ages now.
Author
Owner

@rihards-simanovics commented on GitHub (Dec 28, 2024):

OK, I literally have no clue what I've changed but it resolved itself. might have had something to do with a route routing the connection to the management server, which was pointed out in email exchange. Closing this for now.

@rihards-simanovics commented on GitHub (Dec 28, 2024): OK, I literally have no clue what I've changed but it resolved itself. might have had something to do with a route routing the connection to the management server, which was pointed out in email exchange. Closing this for now.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#1517