[Management/Client] Trigger debug bundle runs from API/Dashboard #2189

New Issue

saavagebueno · 2025-11-20T07:05:31-05:00

saavagebueno commented

2025-11-20 07:05:31 -05:00

Originally created by @mlsmaycon on GitHub (Aug 15, 2025).

Originally assigned to: @aliamerj on GitHub.

As an administrator, I would like to trigger a debug bundle from within the dashboard to collect debug information for users who have connectivity issues. This trigger should have similar options to CLI commands, and the result of it should be an upload key that can be used to retrieve the uploaded bundle.

To implement this feature, we need to update the following components:

Dashboard

On the peers menu, users with Admin or Owner should be able to access a peer's detailed view and see a debug tab. In this debug tab, the user will be displayed a history of debug bundle upload key jobs, and also be able to trigger a new debug bundle or debug for Xm (max of 5 minutes).

The debug jobs history should have the following column:

Time | Status | Upload Key

When there are pending jobs, the user should not be able to create new requests until the job status changes.

If the peer is not connected, don't allow job creation.

REST API

For API, we should have the following endpoints and methods:

GET /api/peers/<peer_id>/jobs: returns the list of jobs with the following response:

[
  {
    "id": "lsknmclks",
    "created_at": timestamp(UTC),
    "completed_at": timestamp(UTC),
    "triggered_by": "user_id",
    "type": "bundle",
    "status": "<successed|pending|failed>",
    "failed_reason": "description of failure(e.g timeout or failed to run)",
    "result": "<upload_key>",
    "parameters": "{
            "bundle_for": bool,
           "bundle_for_time": int, 
          "log-file-count": int,
          "anonymize": bool
     }",
  }
]

GET /api/peers/<peer_id>/jobs/<job_id>: returns the job with id and has the following response:

{
  "id": "lsknmclks",
  "created_at": timestamp(UTC),
  "completed_at": timestamp(UTC),
  "triggered_by": "user_id",
  "type": "bundle",
  "status": "<successed|pending|failed>",
  "failed_reason": "description of failure(e.g timeout or failed to run)",
  "result": "<upload_key>",
  "parameters": "{
          "bundle_for": bool,
         "bundle_for_time": int, 
        "log-file-count": int,
        "anonymize": bool
   }",
}

POST /api/peers/<peer_id>/jobs: Triggers a job with the following payload:

{
  "type": "bundle",
  "parameters": "{
          "bundle_for": bool,
         "bundle_for_time": int, 
        "log-file-count": int,
        "anonymize": bool
   }",
}

Response:

{
  "id": "lsknmclks",
  "created_at": timestamp(UTC),
  "completed_at": "",
  "triggered_by": "user_id",
  "type": "bundle",
  "status": "<successed|pending|failed>",
  "failed_reason": "description of failure(e.g timeout or failed to run)",
  "result": "",
  "parameters": "{
          "bundle_for": bool,
         "bundle_for_time": int, 
        "log-file-count": int,
        "anonymize": bool
   }",
}

Management Server

Management should have generic job methods that route the calls to the bundle methods.

The bundle methods will validate user permissions, and then the input parameters. After that, the job should be sent to the client, and an audit event should be stored for a job triggered, where the peer ID, initiator, job ID, job parameters, and job type should be recorded. The job should be stored.

When getting job statuses, the methods should return a maximum of 10 jobs ordered by created_at timestamp.

If a job is created for a client that is not connected, we should fail the job immediately and return with this reason and status.

When retrieving a job that is on a pending state for more than 5 minutes(hardcoded for now), we should mark it as timed out

Store

We should have a peer_jobs table with all the job information needed to be recorded, plus the account and peer IDs

gRPC API

We need a new bi-directional stream for jobs that will allow us to expand it in the future and is backwards compatible.

service ManagementService {
...
    rpc Job(EncryptedMessage) returns (stream EncryptedMessage) {}
...
}

message JobRequest {
       JobType type = 1;
       bytes ID = 2;
       bytes Payload =3;
}

enum JobType {
     unknown=0; //placeholder
     debug=1;
}

enum JobStatus {
     unknown=0; //placeholder
     successed=1;
     failed=2;
}

message JobResponse{
       JobType type = 1;
       bytes ID = 2;
       JobStatus status=3;
       bytes Reason=4;
       bytes Result =5;
}

When receiving a job response, the status of the job should be updated.

Client

Client should handle management servers that don't support the new jobs gRPC service, so we can connect new clients to old management servers without failures.
We should introduce a flag allowing remote debug jobs. This means that this feature will be disabled by default in the first versions (we will update based on user feedback). This setting will be stored in the config file for the current profile. We should also add GUI support.

Originally created by @mlsmaycon on GitHub (Aug 15, 2025). Originally assigned to: @aliamerj on GitHub. As an administrator, I would like to trigger a debug bundle from within the dashboard to collect debug information for users who have connectivity issues. This trigger should have similar options to CLI commands, and the result of it should be an upload key that can be used to retrieve the uploaded bundle. To implement this feature, we need to update the following components: #### Dashboard On the peers menu, users with Admin or Owner should be able to access a peer's detailed view and see a debug tab. In this debug tab, the user will be displayed a history of debug bundle upload key jobs, and also be able to trigger a new debug bundle or debug for Xm (max of 5 minutes). The debug jobs history should have the following column: Time | Status | Upload Key When there are pending jobs, the user should not be able to create new requests until the job status changes. If the peer is not connected, don't allow job creation. #### REST API For API, we should have the following endpoints and methods: `GET /api/peers/<peer_id>/jobs`: returns the list of jobs with the following response: ```json [ { "id": "lsknmclks", "created_at": timestamp(UTC), "completed_at": timestamp(UTC), "triggered_by": "user_id", "type": "bundle", "status": "<successed|pending|failed>", "failed_reason": "description of failure(e.g timeout or failed to run)", "result": "<upload_key>", "parameters": "{ "bundle_for": bool, "bundle_for_time": int, "log-file-count": int, "anonymize": bool }", } ] ``` `GET /api/peers/<peer_id>/jobs/<job_id>`: returns the job with id and has the following response: ```json { "id": "lsknmclks", "created_at": timestamp(UTC), "completed_at": timestamp(UTC), "triggered_by": "user_id", "type": "bundle", "status": "<successed|pending|failed>", "failed_reason": "description of failure(e.g timeout or failed to run)", "result": "<upload_key>", "parameters": "{ "bundle_for": bool, "bundle_for_time": int, "log-file-count": int, "anonymize": bool }", } ``` `POST /api/peers/<peer_id>/jobs`: Triggers a job with the following payload: ```json { "type": "bundle", "parameters": "{ "bundle_for": bool, "bundle_for_time": int, "log-file-count": int, "anonymize": bool }", } ``` Response: ```json { "id": "lsknmclks", "created_at": timestamp(UTC), "completed_at": "", "triggered_by": "user_id", "type": "bundle", "status": "<successed|pending|failed>", "failed_reason": "description of failure(e.g timeout or failed to run)", "result": "", "parameters": "{ "bundle_for": bool, "bundle_for_time": int, "log-file-count": int, "anonymize": bool }", } ``` #### Management Server Management should have generic job methods that route the calls to the bundle methods. The bundle methods will validate user permissions, and then the input parameters. After that, the job should be sent to the client, and an audit event should be stored for a job triggered, where the peer ID, initiator, job ID, job parameters, and job type should be recorded. The job should be stored. When getting job statuses, the methods should return a maximum of 10 jobs ordered by created_at timestamp. If a job is created for a client that is not connected, we should fail the job immediately and return with this reason and status. When retrieving a job that is on a pending state for more than 5 minutes(hardcoded for now), we should mark it as timed out #### Store We should have a peer_jobs table with all the job information needed to be recorded, plus the account and peer IDs #### gRPC API We need a new bi-directional stream for jobs that will allow us to expand it in the future and is backwards compatible. ```proto service ManagementService { ... rpc Job(EncryptedMessage) returns (stream EncryptedMessage) {} ... } message JobRequest { JobType type = 1; bytes ID = 2; bytes Payload =3; } enum JobType { unknown=0; //placeholder debug=1; } enum JobStatus { unknown=0; //placeholder successed=1; failed=2; } message JobResponse{ JobType type = 1; bytes ID = 2; JobStatus status=3; bytes Reason=4; bytes Result =5; } ``` When receiving a job response, the status of the job should be updated. #### Client 1. Client should handle management servers that don't support the new jobs gRPC service, so we can connect new clients to old management servers without failures. 2. We should introduce a flag allowing remote debug jobs. This means that this feature will be disabled by default in the first versions (we will update based on user feedback). This setting will be stored in the config file for the current profile. We should also add GUI support.

saavagebueno commented

2025-11-20 07:05:32 -05:00

@nazarewk commented on GitHub (Aug 15, 2025):

I'm not sure at all this should be a long-running request:
- might put strain on the number of open connections (running requests) on the management
- might be interrupted/dropped by firewalls and/or reverse proxies
- I would rather see it as a separate endpoints for 1) initiate the job 2) is the job ready yet? 3) get job details.
  the "is the job ready yet?" could be optimized away to be very cheap to call
We should make sure this is logged in Audit Events (for the job being initiaited)
Theoretically, Audit Events could be sufficient for the whole thing to be fully functional when combined with just a new "initiate job" endpoint
There must be a way to download it through the Dashboard. The Upload Key will not suffice for the bundle to be accessed by anyone except the NetBird staff

@nazarewk commented on GitHub (Aug 15, 2025): 1. I'm not sure at all this should be a long-running request: - might put strain on the number of open connections (running requests) on the management - might be interrupted/dropped by firewalls and/or reverse proxies - I would rather see it as a separate endpoints for 1) initiate the job 2) is the job ready yet? 3) get job details. the "is the job ready yet?" could be optimized away to be very cheap to call 2. We should make sure this is logged in Audit Events (for the job being initiaited) 3. Theoretically, Audit Events could be sufficient for the whole thing to be fully functional when combined with just a new "initiate job" endpoint 4. There must be a way to download it through the Dashboard. The Upload Key will not suffice for the bundle to be accessed by anyone except the NetBird staff

saavagebueno commented

2025-11-20 07:05:32 -05:00