[Management/Client] Trigger debug bundle runs from API/Dashboard #2189

Open
opened 2025-11-20 07:05:31 -05:00 by saavagebueno · 7 comments
Owner

Originally created by @mlsmaycon on GitHub (Aug 15, 2025).

Originally assigned to: @aliamerj on GitHub.

As an administrator, I would like to trigger a debug bundle from within the dashboard to collect debug information for users who have connectivity issues. This trigger should have similar options to CLI commands, and the result of it should be an upload key that can be used to retrieve the uploaded bundle.

To implement this feature, we need to update the following components:

Dashboard

On the peers menu, users with Admin or Owner should be able to access a peer's detailed view and see a debug tab. In this debug tab, the user will be displayed a history of debug bundle upload key jobs, and also be able to trigger a new debug bundle or debug for Xm (max of 5 minutes).

The debug jobs history should have the following column:

Time | Status | Upload Key

When there are pending jobs, the user should not be able to create new requests until the job status changes.

If the peer is not connected, don't allow job creation.

REST API

For API, we should have the following endpoints and methods:

GET /api/peers/<peer_id>/jobs: returns the list of jobs with the following response:

[
  {
    "id": "lsknmclks",
    "created_at": timestamp(UTC),
    "completed_at": timestamp(UTC),
    "triggered_by": "user_id",
    "type": "bundle",
    "status": "<successed|pending|failed>",
    "failed_reason": "description of failure(e.g timeout or failed to run)",
    "result": "<upload_key>",
    "parameters": "{
            "bundle_for": bool,
           "bundle_for_time": int, 
          "log-file-count": int,
          "anonymize": bool
     }",
  }
]

GET /api/peers/<peer_id>/jobs/<job_id>: returns the job with id and has the following response:

{
  "id": "lsknmclks",
  "created_at": timestamp(UTC),
  "completed_at": timestamp(UTC),
  "triggered_by": "user_id",
  "type": "bundle",
  "status": "<successed|pending|failed>",
  "failed_reason": "description of failure(e.g timeout or failed to run)",
  "result": "<upload_key>",
  "parameters": "{
          "bundle_for": bool,
         "bundle_for_time": int, 
        "log-file-count": int,
        "anonymize": bool
   }",
}

POST /api/peers/<peer_id>/jobs: Triggers a job with the following payload:

{
  "type": "bundle",
  "parameters": "{
          "bundle_for": bool,
         "bundle_for_time": int, 
        "log-file-count": int,
        "anonymize": bool
   }",
}

Response:

{
  "id": "lsknmclks",
  "created_at": timestamp(UTC),
  "completed_at": "",
  "triggered_by": "user_id",
  "type": "bundle",
  "status": "<successed|pending|failed>",
  "failed_reason": "description of failure(e.g timeout or failed to run)",
  "result": "",
  "parameters": "{
          "bundle_for": bool,
         "bundle_for_time": int, 
        "log-file-count": int,
        "anonymize": bool
   }",
}

Management Server

Management should have generic job methods that route the calls to the bundle methods.

The bundle methods will validate user permissions, and then the input parameters. After that, the job should be sent to the client, and an audit event should be stored for a job triggered, where the peer ID, initiator, job ID, job parameters, and job type should be recorded. The job should be stored.

When getting job statuses, the methods should return a maximum of 10 jobs ordered by created_at timestamp.

If a job is created for a client that is not connected, we should fail the job immediately and return with this reason and status.

When retrieving a job that is on a pending state for more than 5 minutes(hardcoded for now), we should mark it as timed out

Store

We should have a peer_jobs table with all the job information needed to be recorded, plus the account and peer IDs

gRPC API

We need a new bi-directional stream for jobs that will allow us to expand it in the future and is backwards compatible.

service ManagementService {
...
    rpc Job(EncryptedMessage) returns (stream EncryptedMessage) {}
...
}

message JobRequest {
       JobType type = 1;
       bytes ID = 2;
       bytes Payload =3;
}

enum JobType {
     unknown=0; //placeholder
     debug=1;
}

enum JobStatus {
     unknown=0; //placeholder
     successed=1;
     failed=2;
}

message JobResponse{
       JobType type = 1;
       bytes ID = 2;
       JobStatus status=3;
       bytes Reason=4;
       bytes Result =5;
}

When receiving a job response, the status of the job should be updated.

Client

  1. Client should handle management servers that don't support the new jobs gRPC service, so we can connect new clients to old management servers without failures.
  2. We should introduce a flag allowing remote debug jobs. This means that this feature will be disabled by default in the first versions (we will update based on user feedback). This setting will be stored in the config file for the current profile. We should also add GUI support.
Originally created by @mlsmaycon on GitHub (Aug 15, 2025). Originally assigned to: @aliamerj on GitHub. As an administrator, I would like to trigger a debug bundle from within the dashboard to collect debug information for users who have connectivity issues. This trigger should have similar options to CLI commands, and the result of it should be an upload key that can be used to retrieve the uploaded bundle. To implement this feature, we need to update the following components: #### Dashboard On the peers menu, users with Admin or Owner should be able to access a peer's detailed view and see a debug tab. In this debug tab, the user will be displayed a history of debug bundle upload key jobs, and also be able to trigger a new debug bundle or debug for Xm (max of 5 minutes). The debug jobs history should have the following column: Time | Status | Upload Key When there are pending jobs, the user should not be able to create new requests until the job status changes. If the peer is not connected, don't allow job creation. #### REST API For API, we should have the following endpoints and methods: `GET /api/peers/<peer_id>/jobs`: returns the list of jobs with the following response: ```json [ { "id": "lsknmclks", "created_at": timestamp(UTC), "completed_at": timestamp(UTC), "triggered_by": "user_id", "type": "bundle", "status": "<successed|pending|failed>", "failed_reason": "description of failure(e.g timeout or failed to run)", "result": "<upload_key>", "parameters": "{ "bundle_for": bool, "bundle_for_time": int, "log-file-count": int, "anonymize": bool }", } ] ``` `GET /api/peers/<peer_id>/jobs/<job_id>`: returns the job with id and has the following response: ```json { "id": "lsknmclks", "created_at": timestamp(UTC), "completed_at": timestamp(UTC), "triggered_by": "user_id", "type": "bundle", "status": "<successed|pending|failed>", "failed_reason": "description of failure(e.g timeout or failed to run)", "result": "<upload_key>", "parameters": "{ "bundle_for": bool, "bundle_for_time": int, "log-file-count": int, "anonymize": bool }", } ``` `POST /api/peers/<peer_id>/jobs`: Triggers a job with the following payload: ```json { "type": "bundle", "parameters": "{ "bundle_for": bool, "bundle_for_time": int, "log-file-count": int, "anonymize": bool }", } ``` Response: ```json { "id": "lsknmclks", "created_at": timestamp(UTC), "completed_at": "", "triggered_by": "user_id", "type": "bundle", "status": "<successed|pending|failed>", "failed_reason": "description of failure(e.g timeout or failed to run)", "result": "", "parameters": "{ "bundle_for": bool, "bundle_for_time": int, "log-file-count": int, "anonymize": bool }", } ``` #### Management Server Management should have generic job methods that route the calls to the bundle methods. The bundle methods will validate user permissions, and then the input parameters. After that, the job should be sent to the client, and an audit event should be stored for a job triggered, where the peer ID, initiator, job ID, job parameters, and job type should be recorded. The job should be stored. When getting job statuses, the methods should return a maximum of 10 jobs ordered by created_at timestamp. If a job is created for a client that is not connected, we should fail the job immediately and return with this reason and status. When retrieving a job that is on a pending state for more than 5 minutes(hardcoded for now), we should mark it as timed out #### Store We should have a peer_jobs table with all the job information needed to be recorded, plus the account and peer IDs #### gRPC API We need a new bi-directional stream for jobs that will allow us to expand it in the future and is backwards compatible. ```proto service ManagementService { ... rpc Job(EncryptedMessage) returns (stream EncryptedMessage) {} ... } message JobRequest { JobType type = 1; bytes ID = 2; bytes Payload =3; } enum JobType { unknown=0; //placeholder debug=1; } enum JobStatus { unknown=0; //placeholder successed=1; failed=2; } message JobResponse{ JobType type = 1; bytes ID = 2; JobStatus status=3; bytes Reason=4; bytes Result =5; } ``` When receiving a job response, the status of the job should be updated. #### Client 1. Client should handle management servers that don't support the new jobs gRPC service, so we can connect new clients to old management servers without failures. 2. We should introduce a flag allowing remote debug jobs. This means that this feature will be disabled by default in the first versions (we will update based on user feedback). This setting will be stored in the config file for the current profile. We should also add GUI support.
Author
Owner

@nazarewk commented on GitHub (Aug 15, 2025):

  1. I'm not sure at all this should be a long-running request:
    • might put strain on the number of open connections (running requests) on the management
    • might be interrupted/dropped by firewalls and/or reverse proxies
    • I would rather see it as a separate endpoints for 1) initiate the job 2) is the job ready yet? 3) get job details.
      the "is the job ready yet?" could be optimized away to be very cheap to call
  2. We should make sure this is logged in Audit Events (for the job being initiaited)
  3. Theoretically, Audit Events could be sufficient for the whole thing to be fully functional when combined with just a new "initiate job" endpoint
  4. There must be a way to download it through the Dashboard. The Upload Key will not suffice for the bundle to be accessed by anyone except the NetBird staff
@nazarewk commented on GitHub (Aug 15, 2025): 1. I'm not sure at all this should be a long-running request: - might put strain on the number of open connections (running requests) on the management - might be interrupted/dropped by firewalls and/or reverse proxies - I would rather see it as a separate endpoints for 1) initiate the job 2) is the job ready yet? 3) get job details. the "is the job ready yet?" could be optimized away to be very cheap to call 2. We should make sure this is logged in Audit Events (for the job being initiaited) 3. Theoretically, Audit Events could be sufficient for the whole thing to be fully functional when combined with just a new "initiate job" endpoint 4. There must be a way to download it through the Dashboard. The Upload Key will not suffice for the bundle to be accessed by anyone except the NetBird staff
Author
Owner

@pappz commented on GitHub (Aug 16, 2025):

The result should be object instead of string. In case of new type of jobs will be usefull.

 {
    "id": "lsknmclks",
    "created_at": timestamp(UTC),
    "completed_at": timestamp(UTC),
    "triggered_by": "user_id",
    "type": "bundle",
    "status": "<successed|pending|failed>",
    "failed_reason": "description of failure(e.g timeout or failed to run)",
    "result":  {
        "uploadkey": "something-hash",
     },
    "parameters": "{
            "bundle_for": bool,
           "bundle_for_time": int, 
          "log-file-count": int,
          "anonymize": bool
     }",
  }
@pappz commented on GitHub (Aug 16, 2025): The result should be object instead of string. In case of new type of jobs will be usefull. ``` { "id": "lsknmclks", "created_at": timestamp(UTC), "completed_at": timestamp(UTC), "triggered_by": "user_id", "type": "bundle", "status": "<successed|pending|failed>", "failed_reason": "description of failure(e.g timeout or failed to run)", "result": { "uploadkey": "something-hash", }, "parameters": "{ "bundle_for": bool, "bundle_for_time": int, "log-file-count": int, "anonymize": bool }", } ```
Author
Owner

@pappz commented on GitHub (Aug 16, 2025):

The reason is not clear to me. Is it an error reason?

message JobResponse{
       JobType type = 1;
       bytes ID = 2;
       JobStatus status=3;
       bytes Reason=4;
       bytes Result =5;
}
@pappz commented on GitHub (Aug 16, 2025): The reason is not clear to me. Is it an error reason? ``` message JobResponse{ JobType type = 1; bytes ID = 2; JobStatus status=3; bytes Reason=4; bytes Result =5; } ```
Author
Owner

@pappz commented on GitHub (Aug 16, 2025):

The type definition should not be part of the Protobuf. We will extend the list of types in the future, and it makes no sense to upgrade our network communication language just because of that. Bytes are enough.

@pappz commented on GitHub (Aug 16, 2025): The type definition should not be part of the Protobuf. We will extend the list of types in the future, and it makes no sense to upgrade our network communication language just because of that. Bytes are enough.
Author
Owner

@pappz commented on GitHub (Aug 16, 2025):

I think "JobStatus" would be better replaced with "JobResult". Like in the command line. When I execute a command in CLI I wait for "result".

enum JobStatus {
     unknown=0; //placeholder
     successed=1;
     failed=2;
}
@pappz commented on GitHub (Aug 16, 2025): I think "JobStatus" would be better replaced with "JobResult". Like in the command line. When I execute a command in CLI I wait for "result". ``` enum JobStatus { unknown=0; //placeholder successed=1; failed=2; } ```
Author
Owner

@pappz commented on GitHub (Aug 16, 2025):

What should we do with a stuck "pending" task? If a peer received the job but somehow never sent a response.

@pappz commented on GitHub (Aug 16, 2025): What should we do with a stuck "pending" task? If a peer received the job but somehow never sent a response.
Author
Owner

@pascal-fischer commented on GitHub (Aug 25, 2025):

The reason is not clear to me. Is it an error reason?

Yes this is the reason why it failed. Basically, the error message in case status is failed

I think "JobStatus" would be better replaced with "JobResult". Like in the command line. When I execute a command in CLI I wait for "result".

This was to have a more explicit response. If status is successed you would expect result to be set. If status is failed you expect reason to be set.

What should we do with a stuck "pending" task? If a peer received the job but somehow never sent a response.

We will have a timeout on management after which we mark the job as failed (due to timeout). We do not retry, this needs to be triggered again by the user

The result should be object instead of string. In case of new type of jobs will be usefull.

The type definition should not be part of the Protobuf. We will extend the list of types in the future, and it makes no sense to upgrade our network communication language just because of that. Bytes are enough.

As per todays discussion we decided to not follow the fully generic approach but to to have pre-defined objects per workload type. This means we need to change the proto and the openapi specs to include pre-defined workloads with the allowed parameters and the expected result type.

For the API this means the object will change as follows:

{
  "id": "lsknmclks",
  "created_at": timestamp(UTC),
  "completed_at": timestamp(UTC),
  "triggered_by": "user_id",
  "status": "<successed|pending|failed>",
  "failed_reason": "description of failure(e.g timeout or failed to run)",
  "workload": {
      "type": "bundle",
      "parameters": {
          "bundle_for": bool,
          "bundle_for_time": int, 
          "log-file-count": int,
          "anonymize": bool
       },
       "result": "<upload_key>"
   },
}

and for POST

{
  "workload": {
      "type": "bundle",
      "parameters": {
          "bundle_for": bool,
          "bundle_for_time": int, 
          "log-file-count": int,
          "anonymize": bool
       }
   },
}

while the workload here is interchangeable based on the type we wanna request.

For protobuf we would end up with something like:

message JobCreateRequest {
  bytes ID = 1;

  oneof workload_parameters {
    BundleParameters bundle = 10;
    //OtherParameters other = 11;
  }
}

enum JobStatus {
  unknown=0; //placeholder
  successed=1;
  failed=2;
}

message JobResponse{
  bytes ID = 1;
  JobStatus status=2;
  bytes Reason=3;
  oneof workload_results {
    BundleResult bundle = 10;
    //OtherResult other = 11;
  }
}

message BundleParameters {
  bool   bundle_for       = 1;
  int64  bundle_for_time  = 2;
  int32  log_file_count   = 3;
  bool   anonymize        = 4;
}

message BundleResult {
  string upload_key = 1;
}
@pascal-fischer commented on GitHub (Aug 25, 2025): > The reason is not clear to me. Is it an error reason? Yes this is the reason why it failed. Basically, the error message in case status is failed > I think "JobStatus" would be better replaced with "JobResult". Like in the command line. When I execute a command in CLI I wait for "result". This was to have a more explicit response. If status is `successed` you would expect result to be set. If status is `failed` you expect reason to be set. > What should we do with a stuck "pending" task? If a peer received the job but somehow never sent a response. We will have a timeout on management after which we mark the job as failed (due to timeout). We do not retry, this needs to be triggered again by the user > The result should be object instead of string. In case of new type of jobs will be usefull. > The type definition should not be part of the Protobuf. We will extend the list of types in the future, and it makes no sense to upgrade our network communication language just because of that. Bytes are enough. As per todays discussion we decided to not follow the fully generic approach but to to have pre-defined objects per workload type. This means we need to change the proto and the openapi specs to include pre-defined workloads with the allowed parameters and the expected result type. For the API this means the object will change as follows: ``` { "id": "lsknmclks", "created_at": timestamp(UTC), "completed_at": timestamp(UTC), "triggered_by": "user_id", "status": "<successed|pending|failed>", "failed_reason": "description of failure(e.g timeout or failed to run)", "workload": { "type": "bundle", "parameters": { "bundle_for": bool, "bundle_for_time": int, "log-file-count": int, "anonymize": bool }, "result": "<upload_key>" }, } ``` and for POST ``` { "workload": { "type": "bundle", "parameters": { "bundle_for": bool, "bundle_for_time": int, "log-file-count": int, "anonymize": bool } }, } ``` while the workload here is interchangeable based on the type we wanna request. For protobuf we would end up with something like: ``` message JobCreateRequest { bytes ID = 1; oneof workload_parameters { BundleParameters bundle = 10; //OtherParameters other = 11; } } enum JobStatus { unknown=0; //placeholder successed=1; failed=2; } message JobResponse{ bytes ID = 1; JobStatus status=2; bytes Reason=3; oneof workload_results { BundleResult bundle = 10; //OtherResult other = 11; } } message BundleParameters { bool bundle_for = 1; int64 bundle_for_time = 2; int32 log_file_count = 3; bool anonymize = 4; } message BundleResult { string upload_key = 1; } ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/netbird#2189