Prometheus Integration

The Prometheus integration lets an external monitoring stack scrape platform-health metrics from the appliance. It exposes appliance liveness and component state only — no traffic data ever leaves through this endpoint.

The integration is admin-controlled and revocable. Administrators mint long-lived bearer tokens, hand them to the operator of the monitoring stack, and revoke them whenever a token is no longer needed. Each token authenticates one job in Prometheus’s scrape configuration.

Prometheus tokens are managed from Integrations > Prometheus in the sidebar, which lands on the page documented below. The parent landing page is documented in the Integrations Panel chapter.

The Hard Scope: Platform Health Only

The scrape endpoint exposes appliance health — never DHCP traffic.

This is a deliberate scoping decision and it is enforced by the appliance, not by operator discipline. The metrics surfaced here let an external monitoring stack answer questions like:

Is the appliance up?
Is its connection to ClickHouse healthy?
Is the alarm engine ticking on schedule?
Is the notification dispatcher backing up?
Is the vector sink’s circuit breaker open?
How many active alarms are there, by severity?

The endpoint does not expose:

Any client MAC address, DUID, IP address, or vendor class.
Any per-device counter, per-rule counter, or per-event timeseries.
Any DHCP packet count broken down by anything other than aggregate alarm state.
Any LLM verdict, automation rule firing, or action history.

If you need rich traffic counters, use the on-appliance Statistics & Reports (chapter 19) page. If you need historical event-level data, use the Report Builder (chapter 19). Prometheus is for “is the appliance healthy?” — nothing more.

Why the hard scope? Exposing per-MAC or per-rule counters on a scrape endpoint creates a side channel that an operator with monitoring access could use to reconstruct subscriber traffic. The appliance protects subscriber privacy by keeping per-device data inside the GUI’s authenticated views. The Prometheus surface is platform-health only by design and any change to that scope is a release-note-level event.

Token Management Page

Integrations > Prometheus Scrape Tokens lists every minted token and lets administrators create or revoke them.

The page is reachable from the sidebar via Integrations, then Manage tokens on the Prometheus card. See the Integrations Panel chapter for the landing page that hosts that card. The route shows a header, an error banner when something fails, the Create Token button on the right, and a table of existing tokens below.

Token List

The table shows every minted token, including revoked ones, so administrators have a full audit trail of who has held a token for what window. Tokens are never deleted — revocation flips the status but leaves the row in place.

Column	Description
Name	Operator-facing label set at creation time. Free text, up to 200 characters. Shown here and in the audit log.
Created	When the token was minted.
Last Used	When the token last authenticated a scrape request, or an em-dash if it has never been used.
Last IP	The source IP of the most recent successful scrape. Useful for confirming the right Prometheus job is using the token.
Expires	The expiry timestamp, or `Never` if no expiry was set.
Status	`Active`, `Expired`, or `Revoked`. Coloured green / amber / red respectively.
Actions	A trash-can icon to revoke. The icon is hidden for already-revoked tokens.

The cleartext token value is not on this page and cannot be retrieved from this page. The appliance stores only a hash; the value itself is shown exactly once at creation time.

Creating a Token

Click Create Token in the top-right to open the creation modal.

The modal collects two fields:

Field	Required	Description
Name	Yes	A human-readable label. Use something the operator of the monitoring stack will recognise — e.g. “Production Prometheus” or “Grafana Cloud Staging”. This is the label that appears in the table above, in the audit log, and on bell notifications when the token is created or revoked.
Expires	No	An optional expiry date and time, picked from a date-time control. Leave empty for a non-expiring token. After this moment passes, the token is rejected immediately on the next scrape attempt and its status flips to Expired in the table.

Click Create Token to submit. On success, the appliance closes the creation modal and opens a second modal — described below — that shows the cleartext token value exactly once.

The cancel button discards the entries and closes the modal without minting a token. There is no “create as draft” state — until you confirm, nothing is created.

The cleartext value is shown for the only time it will ever be visible. Save it now.

After a successful creation, a second modal appears with three elements:

A loud amber warning: “This is the only time the token value will be shown. Copy it now — it cannot be retrieved later. The database stores only a hash of the token, not the value itself.”
A read-only text box holding the cleartext token, with a Copy button to the right. The button changes to Copied! for two seconds on success.
A summary of the Name and Expires values you set, so you can confirm them before closing.

The single button at the bottom — “I have saved my token” — closes the modal. Once closed, the cleartext value is unrecoverable. If the token is lost between creation and storage, the only remedy is to revoke this token and mint a new one.

What happens if I close the modal too early? Revoke the token immediately, then mint a fresh one. There is no risk in revoking — the token has no real consumer yet — and there is no way to recover the cleartext value once the modal is closed.

Revoking a Token

Click the trash-can icon on a token’s row to revoke it.

A confirmation dialog appears: “Are you sure you want to revoke X? The token will stop authenticating against the Prometheus endpoint immediately. This cannot be undone.”

On confirm:

The token’s status flips to Revoked in the table.
The token is rejected on its next scrape attempt — usually within seconds. Prometheus will record a scrape failure on its next cycle.
The revocation event is written to the audit log.

Revocation is permanent. There is no “un-revoke” — to restore service to a Prometheus job whose token was revoked in error, mint a new token and update Prometheus’s scrape configuration.

Operational habit: revoke a token the moment its job is decommissioned, and rotate tokens on a schedule. The page’s Last Used column makes stale tokens easy to spot.

The Scrape Endpoint

One bearer-token-authenticated route returns the Prometheus exposition.

Property	Value
URL	`https://<appliance-host>/api/metrics/prometheus`
Method	`GET`
Auth	`Authorization: Bearer <scrape-token>`
Content-Type	`text/plain; version=0.0.4` (standard Prometheus exposition format)
Mounted when	`prometheus.enabled: true` in `config.yaml`. If the flag is unset or false, the route is not mounted at all and the URL returns 404.

Token Type Enforcement

Scrape tokens are a distinct token type from the JWTs used for the GUI and the rest of the API. The appliance enforces this both ways:

The scrape route accepts only scrape tokens. A valid JWT sent to /api/metrics/prometheus is rejected with 401.
Every other route accepts only JWTs. A valid scrape token sent to any other route is rejected with 401.

This separation means a stolen JWT cannot scrape and a stolen scrape token cannot do anything other than read platform-health. It is structural — operators do not have to remember to use the right token; the wrong token always fails.

Example Prometheus Scrape Job

The exact configuration syntax for Prometheus is documented at prometheus.io. A typical job that scrapes this appliance looks like this:

scrape_configs:
  - job_name: dhcp-dpi
    metrics_path: /api/metrics/prometheus
    scheme: https
    scrape_interval: 30s
    scrape_timeout: 10s
    authorization:
      type: Bearer
      credentials: <paste-the-cleartext-token-here>
    static_configs:
      - targets:
          - appliance.example.com

If the appliance’s TLS certificate is not signed by a CA Prometheus trusts, configure tls_config.ca_file (or, for testing only, tls_config.insecure_skip_verify: true).

Scrape interval recommendation: 30 seconds is plenty. The metrics described below change slowly — alarm counts, queue depths, gauges — and there is no value in scraping more often. The appliance does not store the scrapes itself; raising the interval just doubles Prometheus’s ingest load.

What the Endpoint Exposes

Five families of metrics, all named with the dhcp_dpi_ prefix so they are easy to filter in your monitoring stack.

Platform Information

Metric	Type	Description
`dhcp_dpi_build_info{version, commit, go_version}`	Gauge	Always 1. Labels carry the running release. Useful as a join key on every other metric and as a simple presence check.
`dhcp_dpi_process_start_time_seconds`	Gauge	Unix epoch seconds at which the appliance started. Subtract from `time()` for an uptime metric.

The default Go runtime collector and process collector are also exposed under their standard names (go_*, process_*) so you can chart CPU, memory, and goroutine counts alongside the appliance-specific gauges.

Alarm Engine

Metric	Type	Description
`dhcp_dpi_alarms_active{severity}`	Gauge	Current count of firing or acknowledged alarms by severity. Labels: `critical`, `warning`, `info`.
`dhcp_dpi_alarms_state_transitions_total{from, to}`	Counter	Total alarm state transitions since process start. Labels match the Alarms (chapter 11) lifecycle: `firing`, `acknowledged`, `resolved`.
`dhcp_dpi_alarm_engine_tick_age_seconds`	Gauge	Seconds since the last alarm engine evaluation. Spikes here mean the engine has stalled — alert on values above a small multiple of the configured tick interval.

A reasonable Grafana panel: stack the three dhcp_dpi_alarms_active series and overlay the tick-age gauge. A growing critical count or a tick-age that keeps climbing both warrant a page.

Notification Dispatcher

Metric	Type	Description
`dhcp_dpi_notifications_delivered_total{channel, result}`	Counter	Total notifications delivered. Labels: `channel` is `vector` or `bell`; `result` is `ok` or `error`.
`dhcp_dpi_notification_queue_depth`	Gauge	Current depth of the dispatcher’s internal queue. Sustained non-zero values indicate the appliance is producing notifications faster than the configured channel can deliver them.
`dhcp_dpi_vector_circuit_breaker_open`	Gauge	1 if the vector sink’s circuit breaker is currently open, else 0. The breaker opens when the vector destination is unreachable for too long.

The shipped guidance is to alert on either of:

rate(dhcp_dpi_notifications_delivered_total{result="error"}[5m]) > 0 sustained for a few minutes.
dhcp_dpi_vector_circuit_breaker_open == 1 for any sustained window.

ClickHouse

Metric	Type	Description
`dhcp_dpi_clickhouse_up{instance}`	Gauge	1 if ClickHouse responded within the staleness window (120 seconds by default), else 0. `instance` is the configured ClickHouse host.
`dhcp_dpi_clickhouse_query_errors_total{operation}`	Counter	Total errors returned from ClickHouse calls, labelled by the operation that failed (e.g. `insert_dhcp_events`, `insert_counter_stats`, `insert_set_stats`).
`dhcp_dpi_clickhouse_batch_insert_seconds`	Histogram	Time spent inserting a batch into ClickHouse. Twelve buckets from 1ms to 10s. Use `histogram_quantile()` for percentile dashboards.

Suggested alerting: page on dhcp_dpi_clickhouse_up == 0 for more than two minutes, and warn on a non-trivial p99 of dhcp_dpi_clickhouse_batch_insert_seconds.

Go Runtime and Process

The shipped collectors expose the standard Go runtime metrics — CPU seconds, memory residency, GC pause percentiles, goroutine count — and the process collector exposes file descriptors, virtual memory, and the OS-level start time. These are present so a single Grafana dashboard can chart appliance health without joining against another exporter.

Grafana Dashboard Hints

A minimal four-panel dashboard covers everything this endpoint exposes.

Panel	Query (suggested)	What it tells you
Appliance uptime	`time() - dhcp_dpi_process_start_time_seconds`	Time since the appliance booted.
Active alarms by severity	`dhcp_dpi_alarms_active` stacked by `severity`	Current alarm load.
ClickHouse up	`dhcp_dpi_clickhouse_up`	Database connectivity.
Batch insert p99	`histogram_quantile(0.99, rate(dhcp_dpi_clickhouse_batch_insert_seconds_bucket[5m]))`	Insert latency tail.

Add a row for notification health:

Panel	Query (suggested)	What it tells you
Notification delivery rate	`sum by (channel, result) (rate(dhcp_dpi_notifications_delivered_total[5m]))`	Healthy delivery vs error rate per channel.
Notification queue depth	`dhcp_dpi_notification_queue_depth`	Dispatcher backpressure.
Vector breaker	`dhcp_dpi_vector_circuit_breaker_open`	0 or 1 indicator.

Use dhcp_dpi_build_info as a table panel showing the running version, commit, and Go runtime — handy during a release.

Dashboard import: the appliance does not ship a packaged Grafana dashboard JSON. The four-panel layout above takes a few minutes to assemble in the Grafana UI and tracks what the endpoint actually exposes; a packaged dashboard that drifted from the endpoint would be worse than no dashboard at all.

Enabling and Disabling the Integration

Enable the surface in config.yaml; the page in the GUI is always visible to administrators.

The Prometheus endpoint is gated by a single core-config flag in config.yaml. When the flag is false (the shipped default) the route is not mounted and no metrics are collected — the integration is truly zero-cost when off.

State	Behaviour
`prometheus.enabled: false`	`/api/metrics/prometheus` returns 404. The Token Management page still loads (so administrators can review existing tokens), but the Integrations landing page reports the integration as Disabled and links to this section.
`prometheus.enabled: true`	The route is mounted, scrape tokens authenticate against it, and all five metric families are populated and exposed.

After flipping the flag, restart the appliance. The change does not apply at runtime — Prometheus is a core, not operational, setting because it affects route registration. See Key Concepts > Core vs Operational Config (chapter 03) for the distinction.

Audit and Notifications

Every token lifecycle event is recorded.

Each of the following triggers an entry in the audit log visible in Sessions & Audit (chapter 30):

Token creation, with the administrator’s username and the chosen Name and Expires values.
Token revocation, with the administrator’s username.
Scrape authentication failures (wrong token type, expired token, revoked token, malformed Authorization header).

Notification rules in Alarms (chapter 11) can be configured to surface these audit events on the bell or on an external sink — for example, a critical-severity rule on creation events for the security team.

The Last Used and Last IP columns on the Token Management page provide quick visibility without leaving the page; the audit log is the authoritative record for compliance review.

Security Notes

A short list of habits that keep the integration boring.

Mint one token per consumer. Sharing a token between two Prometheus instances makes the Last Used and Last IP columns less useful and means a single revocation cuts off both consumers.
Set an expiry. Even a one-year expiry forces an annual rotation. Long-lived bearer tokens accumulate risk; expiry caps that risk.
Rotate proactively. When the administrator who minted a token leaves the team, revoke and re-mint. The current Last Used value lets you find tokens that are still in use.
Confine TLS to a known CA. Configure Prometheus’s tls_config.ca_file against your internal CA so a misconfigured DNS entry cannot redirect scrapes to an attacker’s endpoint.
Treat tokens like passwords. Store them in your monitoring stack’s secret store, not in plaintext config files committed to version control.

The token is the only secret. There is no signing, no client certificate, no IP allow-list — just the bearer token. That is intentional because it lines up with how every Prometheus job is wired in practice. Revocation is fast and unambiguous; rely on it rather than on layered controls.