* [management,proxy] Agent network: per-account LLM gateway (policy, metering, multi-provider) (#6555)
* [agent-network] Shared proto, OpenAPI schema, and generated types
* [agent-network] Management: store, manager, synthesizer, policy engine, provider catalog, HTTP/gRPC API
Adds the account-scoped agent-network module: provider/policy/budget CRUD and
store, the reverse-proxy service synthesizer, policy selection + limit
enforcement, the provider catalog (incl. Vertex AI and AWS Bedrock entries),
and the management HTTP + proxy gRPC surfaces.
* [management] Fix agent-network proxy-peer fan-out on affected-peer recompute
The affected-peers resolver loaded only persisted reverse-proxy services, but
agent-network services are synthesized on demand and never persisted. As a
result the embedded proxy peer was never folded into the affected set when a
client's group changed, so the proxy received no network-map update for a newly
authorised client and rejected its handshake until a full resync (restart).
loadProxyServices now merges the synthesized agent-network services (injected
via a registration hook to avoid an import cycle), so proxy peers learn newly
authorised clients immediately.
* [proxy] Reverse-proxy middleware framework, chain, and request plumbing
The per-target middleware chain (slots, dispatcher, mutation gate, metadata
merger), body capture, access-log terminal sink, and the proxy wiring that
builds + runs chains for synthesized agent-network services.
* [proxy] LLM parsers, pricing, and builtin middlewares (OpenAI, Anthropic, Vertex AI, AWS Bedrock)
Request/response parsers and SSE/event-stream metering, the embedded pricing
table, and the builtin middleware set: request parser, router, policy
limit-check/record, cost meter, guardrail, identity inject, response parser.
Includes the path-routed providers — Google Vertex AI (keyfile:: service-account
OAuth minting) and AWS Bedrock (bearer auth, invoke/converse/streaming, optional
/bedrock prefix) — plus the Models allowlist and unmeterable-publisher deny.
* [proxy] IPv6 in-place apply and TCP accept-loop hardening on netstack listeners
* [agent-network] End-to-end test suite, module docs, and deployment preset
* [agent-network] Fix codespell typos and exclude false positives
- labelgen word pool: vermillion -> vermilion, racoon -> raccoon.
- codespell ignore list: add flate (Go compress/flate package), recordin
(a test-local identifier), and unparseable (a valid alternative spelling used
consistently across identifiers + a metadata-value constant).
* [management] Set LastSeen on injected proxy peer in realstack test (MySQL strict-mode)
The injected embedded proxy peer had a PeerStatus with a zero LastSeen, which
serializes to '0000-00-00' and is rejected by MySQL in strict mode (SQLite
tolerates it). Set LastSeen to a valid time so SaveAccount succeeds on both
engines.
* [agent-network] Remove e2e shell-script suite from this branch
The end-to-end shell scripts under scripts/e2e/ are maintained in a separate
testing suite and are not part of this change set.
* [agent-network] Polish module docs: remove internal review scaffolding, fix links, verify diagrams
Strip PR-review framing, commit references, absolute paths, and stale internal
references from the agent-network module docs; fix broken relative links; verify
all diagrams against the current architecture. Remove the internal AI-reviewer
prompt file.
* [management] Refine session expiration handling to support 3-state encoding for SSO deadlines
* [agent-network] Relocate agentnetwork package to internals/modules
Move management/server/agentnetwork (and its catalog/, labelgen/, types/
subpackages) to management/internals/modules/agentnetwork, alongside the
reverse-proxy module, and rewrite all importers. Pure relocation: package names,
the synthesizer + affectedpeers registration hook, and store access (shared
store.Store) are unchanged, so no import cycle is introduced (affectedpeers
still depends only on the agentnetwork/types leaf).
* [agent-network] Co-locate HTTP handlers in the module (RegisterEndpoints)
Move the agent-network HTTP handlers from server/http/handlers/agentnetwork into
the module at internals/modules/agentnetwork/handlers (package handlers) and
rename the entrypoint AddEndpoints -> RegisterEndpoints, matching the
reverse-proxy module convention. Wiring in http/handler.go updated accordingly.
* Update getting started to point to rc when agent network enabled
* Add a reference to a commercial license
* Fix docs localhost link
* Fix docs localhost link
* Add private services domain note
* [management] Add agent-network telemetry metrics (#6561)
Surface agent-network adoption and usage in the self-hosted metrics
worker: distinct accounts, providers, policies, budget rules, accounts
with log collection enabled, and aggregated input/output tokens plus
cost.
Tokens and cost are summed from agent_network_request_usage (the
always-written per-request ledger) so the figures are accurate
regardless of the log-collection toggle and carry no double-counting.
All values come from a handful of indexed aggregate queries run only on
the worker's periodic tick.
Adds store.AgentNetworkMetrics with GetAgentNetworkMetrics on the Store
interface, the SqlStore implementation, and a zero-valued FileStore stub.
* Update NetBird server and proxy image versions to 0.74.0-rc.2
* [management,proxy] Reduce agent-network cognitive complexity (#6566)
Address the SonarCloud quality-gate findings in new agent-network code
by extracting focused helpers. No behavior change.
- synthesizer.go: split buildIdentityInjectConfigJSON into per-shape
rule builders; extract mergeGuardrail from mergeGuardrails to cut
nesting depth.
- llm_identity_inject: extract injectionEmitsAnything validation
predicate from New.
- llm_response_parser/streaming.go: extract applyOpenAIStreamUsage and
applyAnthropicStreamUsage (via a named anthropicStreamUsage type) and
simplify the OpenAI scanner loop.
- reverseproxy.go: decompose ServeHTTP into serveRouteError,
buildTargetContext, serveDirect, serveWithChain, captureRequestForChain,
serveDeny, newResponseWriter, observeResponse, and forwardUpstream,
preserving the defer ordering so response observation still reads the
captured writer before it is released.
* [management] Move agent-network access-log ingest into the agentnetwork module (#6568)
The agent-network access-log ingest path (metaKey wire contract, flatten,
usage derivation, and the dual-write of the usage ledger + settings-gated
full row) lived in the reverseproxy accesslogs manager, even though the
agentnetwork module already owns the rest of that domain — types, read
(ListAccessLogs / GetUsageOverview), the budget-counter writes, and
retention cleanup.
Move it next to the rest: a stateless agentnetwork.IngestAccessLog(ctx,
store, entry) that the reverseproxy SaveAccessLog delegates to when the
entry is agent-network. Removes the agentNetworkTypes import from the
reverseproxy manager. No behavior change; the write/read table separation
is unchanged.
Adds real-store coverage for the disable->enable log-collection toggle
(usage ledger always written, full row gated) plus the metadata parse and
group-dedup helpers, which previously had no dedicated tests.
* Add session view support in the access log
* [management,proxy] Container-based agent-network e2e harness (#6577)
* [e2e] Add container-based agent-network e2e harness (Pillar 1)
Introduce a self-contained, OIDC-free e2e harness that stands up NetBird
in containers, so suites no longer depend on the hand-maintained Tilt
stack or a real IdP.
- harness brings up the combined server (management + signal + relay +
STUN + embedded IdP) in a single container built from
combined/Dockerfile.multistage, and mints an admin PAT through the
unauthenticated /api/setup bootstrap (NB_SETUP_PAT_ENABLED). API access
goes through the existing shared/management/client/rest typed client.
- the image is built via the docker CLI (BuildKit) so the Dockerfile's
cache mounts are honored; testcontainers then runs the tagged image.
- everything is behind the `e2e` build tag so normal builds and unit
tests never pull in testcontainers.
Adds BuildKit cache mounts to combined/Dockerfile.multistage so source
changes recompile incrementally rather than from scratch.
Pillar 1 proven by TestCombinedBootstrap: server builds, boots, mints a
PAT, and the PAT authenticates a real management API call.
* [e2e] Add management-side agent-network scenarios (Pillar 2)
Port the API-driven agent-network scenarios from the bash suites to Go,
sharing one combined server per package run (TestMain) with each test
owning its resource cleanup. Drives the /api/agent-network/* endpoints
through the shared REST client's NewRequest primitive with the generated
api types.
Scenarios:
- provider lifecycle (create/get/list/delete + 404 after delete)
- provider validation (missing api_key, unknown catalog id → 4xx)
- settings collection-toggle round-trip with cluster/subdomain immutability
- policy window floor (reject <60s enabled limit, accept at 60s)
- consumption read endpoint returns an array
All deterministic and dependency-free (dummy provider keys; no upstream
calls), so they run headless in CI.
* [e2e] Add live chat-through-proxy scenario (Pillar 3)
Stand up the full agent-network data path in containers and drive a real
chat-completion through the gateway:
- harness: a shared docker network (combined server reachable by alias),
a proxy container built from the published reverse-proxy image
(NB_PROXY_PRIVATE, NB_PROXY_ALLOW_INSECURE, NB_RELAY_TRANSPORT=ws to match
the combined server's WS-multiplexed relay) with a generated self-signed
wildcard cert, and a netbird client container that joins via a setup key.
- the combined image, proxy image, and client image default to the
published rc.2 releases (overridable via NB_E2E_*_IMAGE; a bare local tag
is built from source instead). Geolocation download is disabled so the
server starts without external fetches.
- one shared domain is used for the management exposed address, the proxy
domain, and the agent-network cluster; the proxy token is minted via the
server CLI (global) to match the manual install.
TestChatCompletionThroughProxy provisions provider+policy+group+setup key,
runs proxy+client, drives an OpenAI chat-completion through the tunnel, and
asserts a 200 plus the ingested access-log row. Requires OPENAI_TOKEN
(skips otherwise). The provider must be created with enabled=true explicitly
— the create default is false despite the API doc.
* [e2e] Run the live chat scenario across a provider matrix
Replace the single-provider chat test with a data-driven matrix that runs
the same scenario through every provider whose credentials are present in
the environment (keys/URLs sourced from ~/.llm-keys locally, Actions
secrets in CI):
- OpenAI (chat), Anthropic (messages), Vercel, OpenRouter, Cloudflare
(OpenAI-compatible gateways), and Bedrock (path-routed, bearer, via the
messages shape) — covering both wire shapes and the gateway routing.
- all providers are created enabled with a unique model string so the
proxy's connect-time snapshot carries them all and model->provider
routing is unambiguous (provider toggles after connect don't reconcile
to a connected proxy).
- the client supports both wire shapes (/v1/chat/completions and
/v1/messages); Cloudflare gets the openai provider segment appended to
its gateway URL.
Each provider must return 200 through the tunnel and produce an ingested
access-log row. Vertex is intentionally excluded from the uniform matrix:
it needs a bespoke rawPredict request shape rather than the shared
chat/messages path, so it warrants a dedicated scenario.
* [ci] Add manual workflow for the agent-network e2e suite
The e2e suite (build tag `e2e`) stands up the combined server + proxy +
client in Docker and drives live chat-completions, so it is slow and needs
provider credentials. Gate it out of normal CI (it already is, via the
build tag) and run it on demand via workflow_dispatch. Provider scenarios
skip when their secret is unset, so it degrades gracefully.
* [e2e] Add Vertex to the provider matrix; run e2e on ubuntu-latest
Vertex (Anthropic-on-Vertex) doesn't share the chat/messages wire shapes:
the model travels in a rawPredict path and the proxy mints the service
account's OAuth token. Add a Vertex client method that posts
/v1/projects/<project>/locations/<region>/publishers/anthropic/models/<model>:rawPredict
with the Vertex anthropic_version body, and wire it into the matrix as a
path-routed provider (created without a models array). It is keyed off
GOOGLE_VERTEX_SA_BASE64 + GOOGLE_VERTEX_PROJECT (region defaults to
"global", model to a pinned claude snapshot, both overridable).
Also bump the e2e workflow runner to ubuntu-latest and add the Vertex
secrets.
* Add docker/docker and docker/go-connections as direct dependencies in go.mod
* [ci] Trigger agent-network e2e workflow on push to main and pull requests
* [e2e] Fix proxy cert permission denied on Linux CI runners
The proxy bind-mounts a temp dir of self-signed certs. MkdirTemp creates
it 0700 and the key was 0600, which Docker Desktop on macOS ignores but a
non-root proxy container on Linux runners cannot traverse/read, so the
cert watcher failed with "open /certs/tls.crt: permission denied" and the
container exited. Widen the cert dir to 0755 and write the throwaway key
0644 so the proxy uid can read the bind-mounted material.
* [e2e] Build images from source by default instead of pulling rc.2
The agent-network code under test lives in this branch, so the e2e should
exercise it rather than a frozen published release. Flip the harness
default: combined/proxy/client are now built from their in-repo
Dockerfiles (combined/Dockerfile.multistage, proxy/Dockerfile.multistage,
e2e/harness/Dockerfile.client) under local tags. Pulling a published image
stays available by setting NB_E2E_*_IMAGE to a registry reference.
Builds now go through buildx --load so the Dockerfile cache mounts are
honored and the result is loaded for testcontainers. The CI workflow adds
a container-driver builder and a local layer cache (NB_E2E_BUILDX_CACHE)
persisted via actions/cache, which caches the base/apt/dep-download layers
across runs. The Go compile still re-runs each time, as BuildKit mount
caches cannot be exported to the GitHub cache.
* [e2e] Cover real providers in lifecycle + assert real consumption metering
- TestProviderLifecycle now runs per available real provider (create → get →
list → delete → 404) instead of a single dummy provider, exercising each
catalog's create and field round-trip. Create is offline, so it stays fast
and burns no provider quota; falls back to a synthetic OpenAI provider when
no keys are set.
- TestProvidersMatrix attaches a token limit (high caps, 60s window) to its
policy, which switches on usage metering, and asserts consumption rows are
recorded with positive token counts after the live traffic. Consumption is
account-scoped (keyed by source group / user and window, not per provider),
so the assertion is aggregate.
- TestProviderValidation gains invalid-upstream and blank-name cases. Create
validation is uniform across catalogs (no per-provider required-field rules),
so per-provider rejection cases would be redundant.
* [e2e] Assert session id propagates per provider
Each matrix request now sends a unique session id as the universal
x-session-id header and asserts it round-trips into that provider's
access-log row. This guards the session-grouping contract end to end for
every provider (header extraction runs in llm_request_parser ahead of the
parser-specific body extraction, so it is provider-agnostic).
* [e2e] Drop accidentally committed sync-phases dashboard
netbird-sync-phases.json was swept into the Pillar 1 commit by a broad
git add; it belongs to the unrelated sync-phases metrics work, not this
e2e harness. Remove it from the branch so the PR diff is scoped to the
e2e changes.
* [e2e] Revert accidentally committed sync-phase ingest spec
The netbird_sync_phase measurement spec in metrics ingest was swept into
the Pillar 1 commit; it belongs to the unrelated sync-phases metrics work,
not this e2e harness. Its emission side never landed here, so the spec was
orphaned anyway. Restore ingest/main.go to its origin/main state.
* Fix golint issues
* Fix sonar
* Add access log session test
* Fix access log tests
---------
Co-authored-by: braginini <bangvalo@gmail.com>
Co-authored-by: Zoltan Papp <zoltan.pmail@gmail.com>
* [management] Fetch complete user data in ValidateTunnelPeer
Previously the `ValidateTunnelPeer` method used by the ProxyService
would fetch user information from the database if the connected peer
was associated with a user ID, but it would not consult the IdP data
for cached info from JWT claims like email. This caused the value of
the injected `X-Netbird-User` header to always display the peer ID and
never the user email associated with the peer as expected.
This change adds an optional IdP manager to the ProxyService and
fetches the complete user data from it if present.
* [management] Refactor ValidateTunnelPeer principal info gathering
This refactors the gathering of info on proxy tunnel peer principals
into its own method to keep the complexity down and make Sonar happy.
* fix(proxy): gate tunnel-peer fast-path on inbound listener marker
forwardWithTunnelPeer previously accepted any RFC1918 / ULA / CGNAT
source IP, so a public client whose address happened to fall in those
ranges could bypass the configured operator auth scheme by colliding
with a known tunnel IP. The fast-path is now gated on
TunnelLookupFromContext(r.Context()) being present — that context value
is attached only by the per-account inbound (overlay) listener, so the
host-facing listener never enters this branch.
Tests updated to reflect the new requirement: requests that don't
carry the inbound marker now fall through to the regular auth flow.
* fix(proxy): harden inbound listener resource + startup-ctx handling
Three correctness fixes on the per-account inbound path, with tests:
- Close the logrus ErrorLog PipeWriter on tearDown. WriterLevel hands
back an *io.PipeWriter backed by a pipe + scanner goroutine that the
caller owns; the two writers per account (https + plain) were never
closed, leaking the pipe and goroutine on every teardown.
- Run the post-Start hooks on context.Background(). runClientStartup
is launched in a goroutine from AddPeer and was inheriting the
caller's request-scoped ctx, so a cancelled request could abort the
inbound bring-up or fail the management status notification. The
tail is split into notifyClientReady so the contract is testable.
Tests cover the PipeWriter close behaviour and assert the readyHandler
+ NotifyStatus calls receive a non-cancelled background context.
* feat(proxy): short-circuit peer-own-target loops with 421
When a peer that hosts the target of a private service dials its own
service URL the request was being looped through the proxy and back
over WireGuard to the same peer — twice the WG round-trip for no
benefit, with no signal to the caller that something was wrong.
Add isSelfTargetLoop to ReverseProxy.ServeHTTP: when the request
arrived on the per-account overlay listener (IsOverlayOrigin) and the
source tunnel IP matches the target host, refuse the request with 421
Misdirected Request and a body pointing the operator at the backend
directly.
The gate is scoped to overlay origin so requests on the public
listener that happen to share a source IP with the target host are
forwarded normally.
* fix(management): private-service validation + tunnel-IP lookup semantics
- Require an explicit port for L4 cluster targets. validateL4Target
exempted TargetTypeCluster from the port check, but buildPathMappings
serializes every L4 target via net.JoinHostPort(host, port) — port=0
shipped a ":0" upstream. Cluster targets use the same Host/Port
fields, so the same requirement applies.
- GetPeerByIP returns NotFound on a tunnel-IP miss instead of mapping
every error to Internal. The proxy's ValidateTunnelPeer probes IPs
that legitimately aren't in the roster; the miss is expected and now
distinguishable from a real store failure.
- Thread ctx into getClusterCapability's gorm query so a cancelled
request doesn't keep the store busy.
Tests updated for the L4-cluster port requirement and the GetPeerByIP
NotFound path.
* fix(client): include offlinePeers in PeerStateByIP lookup
ReplaceOfflinePeers moves peers into d.offlinePeers but PeerStateByIP
only scanned d.peers. Callers (the local DNS filter via
localPeerConnectivity, embed.Client.IdentityForIP used by the
proxy's tunnel-peer validator) were treating known-but-offline peers
as unknown, which:
- causes the DNS filter to keep returning records pointing at peers
that have no live tunnel, AND
- makes the proxy's local-roster check deny a request from such a
peer rather than letting the cached management RPC carry the
authorisation decision.
Search both slices in PeerStateByIP. Adds a unit test for the IPv4
and IPv6 offline-match paths.
* fix(rest): reject empty Delete path params in reverse-proxy clients
ReverseProxyClustersAPI.Delete and ReverseProxyTokensAPI.Delete passed
the path parameter into url.PathEscape without an empty check.
PathEscape("") returns "" which collapses the request onto the
collection endpoint ("/api/reverse-proxies/clusters/" /
"/api/reverse-proxies/proxy-tokens/"), so a caller bug delete with no
id reached a routable URL with surprising semantics (typically 405).
Short-circuit with a typed error before the request is built. Tests
mount a handler on the collection path that fails the test if hit, so
the regression is impossible to reintroduce silently.
* chore(api,ci,docs,test): private-service schema, proto-check, fixups
Non-functional cleanups and contract/CI hardening around the
private-service work:
API schema (openapi.yml):
- Require a non-empty access_groups and mode=http when private=true,
on both Service and ServiceRequest, mirroring
validatePrivateRequirements. mode stays optional-but-constrained
(empty defaults to http server-side), matching runtime.
CI (proto-version-check.yml):
- Cover renamed .pb.go files (read base via previous_filename).
- Match protoc-gen-go-grpc version headers (optional "- " prefix and
-gen-go-grpc suffix) so grpc-generated files are in scope.
Docs / comments:
- Reword Config field docs to say defaults are applied at Server.Start
(initDefaults), not New.
- Rename the obsolete --private-inbound flag to --private across
comments and the proto doc.
Pre-existing test fixups surfaced by review:
- Repair the integration-tagged validate_session_test.go (SignToken
signature growth + new Manager interface methods).
- Fix the CI-skip boolean precedence so Windows isn't skipped
unconditionally.
- Guard the router.HTTPListener type assertion with comma-ok.
* fix(proxy): background ctx for already-started AddPeer notification
The earlier ctx fix covered the async runClientStartup path but missed
the synchronous branch: when a service is added to an already-started
client, AddPeer called NotifyStatus with the caller's request-scoped
ctx. A cancelled request/stream could drop the connected notification
to management. Use context.Background() here too, matching
notifyClientReady.
Extends TestNetBird_AddPeer_ExistingStartedClient_NotifiesStatus to
pass a pre-cancelled caller ctx and assert the notification still ran
on a non-cancelled context.
* use the cmd context for roundtripper
Adds a new "private" service mode for the reverse proxy: services reachable exclusively over the embedded WireGuard tunnel, gated by per-peer group membership instead of operator auth schemes.
Wire contract
- ProxyMapping.private (field 13): the proxy MUST call ValidateTunnelPeer and fail closed; operator schemes are bypassed.
- ProxyCapabilities.private (4) + supports_private_service (5): capability gate. Management never streams private mappings to proxies that don't claim the capability; the broadcast path applies the same filter via filterMappingsForProxy.
- ValidateTunnelPeer RPC: resolves an inbound tunnel IP to a peer, checks the peer's groups against service.AccessGroups, and mints a session JWT on success. checkPeerGroupAccess fails closed when a private service has empty AccessGroups.
- ValidateSession/ValidateTunnelPeer responses now carry peer_group_ids + peer_group_names so the proxy can authorise policy-aware middlewares without an extra management round-trip.
- ProxyInboundListener + SendStatusUpdate.inbound_listener: per-account inbound listener state surfaced to dashboards.
- PathTargetOptions.direct_upstream (11): bypass the embedded NetBird client and dial the target via the proxy host's network stack for upstreams reachable without WireGuard.
Data model
- Service.Private (bool) + Service.AccessGroups ([]string, JSON- serialised). Validate() rejects bearer auth on private services. Copy() deep-copies AccessGroups. pgx getServices loads the columns.
- DomainConfig.Private threaded into the proxy auth middleware. Request handler routes private services through forwardWithTunnelPeer and returns 403 on validation failure.
- Account-level SynthesizePrivateServiceZones (synthetic DNS) and injectPrivateServicePolicies (synthetic ACL) gate on len(svc.AccessGroups) > 0.
Proxy
- /netbird proxy --private (embedded mode) flag; Config.Private in proxy/lifecycle.go.
- Per-account inbound listener (proxy/inbound.go) binding HTTP/HTTPS on the embedded NetBird client's WireGuard tunnel netstack.
- proxy/internal/auth/tunnel_cache: ValidateTunnelPeer response cache with single-flight de-duplication and per-account eviction.
- Local peerstore short-circuit: when the inbound IP isn't in the account roster, deny fast without an RPC.
- proxy/server.go reports SupportsPrivateService=true and redacts the full ProxyMapping JSON from info logs (auth_token + header-auth hashed values now only at debug level).
Identity forwarding
- ValidateSessionJWT returns user_id, email, method, groups, group_names. sessionkey.Claims carries Email + Groups + GroupNames so the proxy can stamp identity onto upstream requests without an extra management round-trip on every cookie-bearing request.
- CapturedData carries userEmail / userGroups / userGroupNames; the proxy stamps X-NetBird-User and X-NetBird-Groups on r.Out from the authenticated identity (strips client-supplied values first to prevent spoofing).
- AccessLog.UserGroups: access-log enrichment captures the user's group memberships at write time so the dashboard can render group context without reverse-resolving stale memberships.
OpenAPI/dashboard surface
- ReverseProxyService gains private + access_groups; ReverseProxyCluster gains private + supports_private. ReverseProxyTarget target_type enum gains "cluster". ServiceTargetOptions gains direct_upstream. ProxyAccessLog gains user_groups.
The cluster listing now answers three questions in one round-trip
instead of forcing the dashboard to cross-reference the domains API:
which clusters can this account see, are they currently up, and what
do they support. The ProxyCluster wire type drops the boolean
self_hosted in favour of a `type` enum (`account` / `shared`) plus
explicit `online`, `supports_custom_ports`, `require_subdomain`, and
`supports_crowdsec` fields.
Store query reworked so offline clusters still appear (no last_seen
WHERE), with online and connected_proxies both derived from the
existing 2-min active window via portable CASE expressions; the
1-hour heartbeat reaper still removes long-stale rows. Service
manager enriches each cluster with the capability flags via the
existing per-cluster lookups (CapabilityProvider now also exposes
ClusterSupportsCrowdSec).
GetActiveClusterAddresses* keep their tight 2-min filter so service
routing and domain enumeration aren't pulled into the wider window.
The hard cut removes self_hosted from the response — the dashboard is
the only consumer and is updated in the matching PR; no transitional
field is shipped.
Adds a cross-engine regression test asserting offline clusters
surface, connected_proxies counts only fresh proxies, and
account-scoped BYOP clusters never leak across accounts.
* Add support for legacy IDP cache environment variable
* Centralize cache store creation to reuse a single Redis connection pool
Each cache consumer (IDP cache, token store, PKCE store, secrets manager,
EDR validator) was independently calling NewStore, creating separate Redis
clients with their own connection pools — up to 1400 potential connections
from a single management server process.
Introduce a shared CacheStore() singleton on BaseServer that creates one
store at boot and injects it into all consumers. Consumer constructors now
receive a store.StoreInterface instead of creating their own.
For Redis mode, all consumers share one connection pool (1000 max conns).
For in-memory mode, all consumers share one GoCache instance.
* Update management-integrations module to latest version
* sync go.sum
* Export `GetAddrFromEnv` to allow reuse across packages
* Update management-integrations module version in go.mod and go.sum
* Update management-integrations module version in go.mod and go.sum
The internal Target model uses a plain bool for ProxyProtocol,
which was always serialized to the API response as false even
when not configured. Only set the API field when true so it
gets omitted via omitempty when unset.
* **New Features**
* Access logs now include bytes_upload and bytes_download (API and schemas updated, fields required).
* Certificate issuance duration is now recorded as a metric.
* **Refactor**
* Metrics switched from Prometheus client to OpenTelemetry-backed meters; health endpoint now exposes OpenMetrics via OTLP exporter.
* **Tests**
* Metric tests updated to use OpenTelemetry Prometheus exporter and MeterProvider.
The expose tracker used sync.Map for in-memory TTL tracking of active expose sessions, which broke and lost all sessions on restart.
Replace with SQL-backed operations that reuse the existing meta_last_renewed_at column:
- Add store methods: RenewEphemeralService, GetExpiredEphemeralServices, CountEphemeralServicesByPeer, EphemeralServiceExists
- Move duplicate/limit checks inside a transaction with row-level locking (SELECT ... FOR UPDATE) to prevent concurrent bypass
- Reaper re-checks expiry under row lock to avoid deleting a just-renewed service and prevent duplicate event emission
- Add composite index on (source, source_peer) for efficient queries
- Batch-limit and column-select the reaper query to avoid DB/GC spikes
- Filter out malformed rows with empty source_peer
Consolidate all expose business logic (validation, permission checks, TTL tracking, reaping) into the manager layer, making the gRPC layer a pure transport adapter that only handles proto conversion and authentication.
- Add ExposeServiceRequest/ExposeServiceResponse domain types with validation in the reverseproxy package
- Move expose tracker (TTL tracking, reaping, per-peer limits) from gRPC server into manager/expose_tracker.go
- Internalize tracking in CreateServiceFromPeer, RenewServiceFromPeer, and new StopServiceFromPeer so callers don't manage tracker state
- Untrack ephemeral services in DeleteService/DeleteAllServices to keep tracker in sync when services are deleted via API
- Simplify gRPC expose handlers to parse, auth, convert, delegate
- Remove tracker methods from Manager interface (internal detail)
CLI: new expose command to publish a local port with flags for PIN, password, user groups, custom domain, name prefix and protocol (HTTP default).
Management/API: create/renew/stop expose sessions (streamed status), automatic naming/domain, TTL renewals, background expiration, new management RPCs and client methods.
UI/API: account settings now include peer_expose_enabled and peer_expose_groups; new activity codes for peer expose events.
Bug Fixes
Network and DNS updates now defer service and reverse-proxy reloads until after account updates complete, preventing inconsistent proxy state and race conditions.
Chores
Removed automatic peer/broadcast updates immediately following bulk service reloads.
Tests
Added a test ensuring network-range changes complete without deadlock.