Prometheus Integration
The Prometheus integration lets an external monitoring stack scrape platform-health metrics from the appliance. It exposes appliance liveness and component state only — no traffic data ever leaves through this endpoint.
The integration is admin-controlled and revocable. Administrators mint long-lived bearer tokens, hand them to the operator of the monitoring stack, and revoke them whenever a token is no longer needed. Each token authenticates one job in Prometheus’s scrape configuration.
Prometheus tokens are managed from Integrations > Prometheus in the sidebar, which lands on the page documented below. The parent landing page is documented in the Integrations Panel chapter.
The Hard Scope: Platform Health Only
Section titled “The Hard Scope: Platform Health Only”The scrape endpoint exposes appliance health — never DHCP traffic.
This is a deliberate scoping decision and it is enforced by the appliance, not by operator discipline. The metrics surfaced here let an external monitoring stack answer questions like:
- Is the appliance up?
- Is its connection to ClickHouse healthy?
- Is the alarm engine ticking on schedule?
- Is the notification dispatcher backing up?
- Is the vector sink’s circuit breaker open?
- How many active alarms are there, by severity?
The endpoint does not expose:
- Any client MAC address, DUID, IP address, or vendor class.
- Any per-device counter, per-rule counter, or per-event timeseries.
- Any DHCP packet count broken down by anything other than aggregate alarm state.
- Any LLM verdict, automation rule firing, or action history.
If you need rich traffic counters, use the on-appliance Statistics & Reports (chapter 19) page. If you need historical event-level data, use the Report Builder (chapter 19). Prometheus is for “is the appliance healthy?” — nothing more.
Why the hard scope? Exposing per-MAC or per-rule counters on a scrape endpoint creates a side channel that an operator with monitoring access could use to reconstruct subscriber traffic. The appliance protects subscriber privacy by keeping per-device data inside the GUI’s authenticated views. The Prometheus surface is platform-health only by design and any change to that scope is a release-note-level event.
Token Management Page
Section titled “Token Management Page”Integrations > Prometheus Scrape Tokens lists every minted token and lets administrators create or revoke them.
The page is reachable from the sidebar via Integrations, then Manage tokens on the Prometheus card. See the Integrations Panel chapter for the landing page that hosts that card. The route shows a header, an error banner when something fails, the Create Token button on the right, and a table of existing tokens below.
Token List
Section titled “Token List”The table shows every minted token, including revoked ones, so administrators have a full audit trail of who has held a token for what window. Tokens are never deleted — revocation flips the status but leaves the row in place.
| Column | Description |
|---|---|
| Name | Operator-facing label set at creation time. Free text, up to 200 characters. Shown here and in the audit log. |
| Created | When the token was minted. |
| Last Used | When the token last authenticated a scrape request, or an em-dash if it has never been used. |
| Last IP | The source IP of the most recent successful scrape. Useful for confirming the right Prometheus job is using the token. |
| Expires | The expiry timestamp, or Never if no expiry was set. |
| Status | Active, Expired, or Revoked. Coloured green / amber / red respectively. |
| Actions | A trash-can icon to revoke. The icon is hidden for already-revoked tokens. |
The cleartext token value is not on this page and cannot be retrieved from this page. The appliance stores only a hash; the value itself is shown exactly once at creation time.
Creating a Token
Section titled “Creating a Token”Click Create Token in the top-right to open the creation modal.
The modal collects two fields:
| Field | Required | Description |
|---|---|---|
| Name | Yes | A human-readable label. Use something the operator of the monitoring stack will recognise — e.g. “Production Prometheus” or “Grafana Cloud Staging”. This is the label that appears in the table above, in the audit log, and on bell notifications when the token is created or revoked. |
| Expires | No | An optional expiry date and time, picked from a date-time control. Leave empty for a non-expiring token. After this moment passes, the token is rejected immediately on the next scrape attempt and its status flips to Expired in the table. |
Click Create Token to submit. On success, the appliance closes the creation modal and opens a second modal — described below — that shows the cleartext token value exactly once.
The cancel button discards the entries and closes the modal without minting a token. There is no “create as draft” state — until you confirm, nothing is created.
The Once-Only Reveal Modal
Section titled “The Once-Only Reveal Modal”The cleartext value is shown for the only time it will ever be visible. Save it now.
After a successful creation, a second modal appears with three elements:
- A loud amber warning: “This is the only time the token value will be shown. Copy it now — it cannot be retrieved later. The database stores only a hash of the token, not the value itself.”
- A read-only text box holding the cleartext token, with a Copy button to the right. The button changes to Copied! for two seconds on success.
- A summary of the Name and Expires values you set, so you can confirm them before closing.
The single button at the bottom — “I have saved my token” — closes the modal. Once closed, the cleartext value is unrecoverable. If the token is lost between creation and storage, the only remedy is to revoke this token and mint a new one.
What happens if I close the modal too early? Revoke the token immediately, then mint a fresh one. There is no risk in revoking — the token has no real consumer yet — and there is no way to recover the cleartext value once the modal is closed.
Revoking a Token
Section titled “Revoking a Token”Click the trash-can icon on a token’s row to revoke it.
A confirmation dialog appears: “Are you sure you want to revoke X? The token will stop authenticating against the Prometheus endpoint immediately. This cannot be undone.”
On confirm:
- The token’s status flips to Revoked in the table.
- The token is rejected on its next scrape attempt — usually within seconds. Prometheus will record a scrape failure on its next cycle.
- The revocation event is written to the audit log.
Revocation is permanent. There is no “un-revoke” — to restore service to a Prometheus job whose token was revoked in error, mint a new token and update Prometheus’s scrape configuration.
Operational habit: revoke a token the moment its job is decommissioned, and rotate tokens on a schedule. The page’s
Last Usedcolumn makes stale tokens easy to spot.
The Scrape Endpoint
Section titled “The Scrape Endpoint”One bearer-token-authenticated route returns the Prometheus exposition.
| Property | Value |
|---|---|
| URL | https://<appliance-host>/api/metrics/prometheus |
| Method | GET |
| Auth | Authorization: Bearer <scrape-token> |
| Content-Type | text/plain; version=0.0.4 (standard Prometheus exposition format) |
| Mounted when | prometheus.enabled: true in config.yaml. If the flag is unset or false, the route is not mounted at all and the URL returns 404. |
Token Type Enforcement
Section titled “Token Type Enforcement”Scrape tokens are a distinct token type from the JWTs used for the GUI and the rest of the API. The appliance enforces this both ways:
- The scrape route accepts only scrape tokens. A valid JWT sent to
/api/metrics/prometheusis rejected with 401. - Every other route accepts only JWTs. A valid scrape token sent to any other route is rejected with 401.
This separation means a stolen JWT cannot scrape and a stolen scrape token cannot do anything other than read platform-health. It is structural — operators do not have to remember to use the right token; the wrong token always fails.
Example Prometheus Scrape Job
Section titled “Example Prometheus Scrape Job”The exact configuration syntax for Prometheus is documented at prometheus.io. A typical job that scrapes this appliance looks like this:
scrape_configs: - job_name: dhcp-dpi metrics_path: /api/metrics/prometheus scheme: https scrape_interval: 30s scrape_timeout: 10s authorization: type: Bearer credentials: <paste-the-cleartext-token-here> static_configs: - targets: - appliance.example.comIf the appliance’s TLS certificate is not signed by a CA Prometheus trusts, configure tls_config.ca_file (or, for testing only, tls_config.insecure_skip_verify: true).
Scrape interval recommendation: 30 seconds is plenty. The metrics described below change slowly — alarm counts, queue depths, gauges — and there is no value in scraping more often. The appliance does not store the scrapes itself; raising the interval just doubles Prometheus’s ingest load.
What the Endpoint Exposes
Section titled “What the Endpoint Exposes”Five families of metrics, all named with the dhcp_dpi_ prefix so they are easy to filter in your monitoring stack.
Platform Information
Section titled “Platform Information”| Metric | Type | Description |
|---|---|---|
dhcp_dpi_build_info{version, commit, go_version} | Gauge | Always 1. Labels carry the running release. Useful as a join key on every other metric and as a simple presence check. |
dhcp_dpi_process_start_time_seconds | Gauge | Unix epoch seconds at which the appliance started. Subtract from time() for an uptime metric. |
The default Go runtime collector and process collector are also exposed under their standard names (go_*, process_*) so you can chart CPU, memory, and goroutine counts alongside the appliance-specific gauges.
Alarm Engine
Section titled “Alarm Engine”| Metric | Type | Description |
|---|---|---|
dhcp_dpi_alarms_active{severity} | Gauge | Current count of firing or acknowledged alarms by severity. Labels: critical, warning, info. |
dhcp_dpi_alarms_state_transitions_total{from, to} | Counter | Total alarm state transitions since process start. Labels match the Alarms (chapter 11) lifecycle: firing, acknowledged, resolved. |
dhcp_dpi_alarm_engine_tick_age_seconds | Gauge | Seconds since the last alarm engine evaluation. Spikes here mean the engine has stalled — alert on values above a small multiple of the configured tick interval. |
A reasonable Grafana panel: stack the three dhcp_dpi_alarms_active series and overlay the tick-age gauge. A growing critical count or a tick-age that keeps climbing both warrant a page.
Notification Dispatcher
Section titled “Notification Dispatcher”| Metric | Type | Description |
|---|---|---|
dhcp_dpi_notifications_delivered_total{channel, result} | Counter | Total notifications delivered. Labels: channel is vector or bell; result is ok or error. |
dhcp_dpi_notification_queue_depth | Gauge | Current depth of the dispatcher’s internal queue. Sustained non-zero values indicate the appliance is producing notifications faster than the configured channel can deliver them. |
dhcp_dpi_vector_circuit_breaker_open | Gauge | 1 if the vector sink’s circuit breaker is currently open, else 0. The breaker opens when the vector destination is unreachable for too long. |
The shipped guidance is to alert on either of:
rate(dhcp_dpi_notifications_delivered_total{result="error"}[5m]) > 0sustained for a few minutes.dhcp_dpi_vector_circuit_breaker_open == 1for any sustained window.
ClickHouse
Section titled “ClickHouse”| Metric | Type | Description |
|---|---|---|
dhcp_dpi_clickhouse_up{instance} | Gauge | 1 if ClickHouse responded within the staleness window (120 seconds by default), else 0. instance is the configured ClickHouse host. |
dhcp_dpi_clickhouse_query_errors_total{operation} | Counter | Total errors returned from ClickHouse calls, labelled by the operation that failed (e.g. insert_dhcp_events, insert_counter_stats, insert_set_stats). |
dhcp_dpi_clickhouse_batch_insert_seconds | Histogram | Time spent inserting a batch into ClickHouse. Twelve buckets from 1ms to 10s. Use histogram_quantile() for percentile dashboards. |
Suggested alerting: page on dhcp_dpi_clickhouse_up == 0 for more than two minutes, and warn on a non-trivial p99 of dhcp_dpi_clickhouse_batch_insert_seconds.
Go Runtime and Process
Section titled “Go Runtime and Process”The shipped collectors expose the standard Go runtime metrics — CPU seconds, memory residency, GC pause percentiles, goroutine count — and the process collector exposes file descriptors, virtual memory, and the OS-level start time. These are present so a single Grafana dashboard can chart appliance health without joining against another exporter.
Grafana Dashboard Hints
Section titled “Grafana Dashboard Hints”A minimal four-panel dashboard covers everything this endpoint exposes.
| Panel | Query (suggested) | What it tells you |
|---|---|---|
| Appliance uptime | time() - dhcp_dpi_process_start_time_seconds | Time since the appliance booted. |
| Active alarms by severity | dhcp_dpi_alarms_active stacked by severity | Current alarm load. |
| ClickHouse up | dhcp_dpi_clickhouse_up | Database connectivity. |
| Batch insert p99 | histogram_quantile(0.99, rate(dhcp_dpi_clickhouse_batch_insert_seconds_bucket[5m])) | Insert latency tail. |
Add a row for notification health:
| Panel | Query (suggested) | What it tells you |
|---|---|---|
| Notification delivery rate | sum by (channel, result) (rate(dhcp_dpi_notifications_delivered_total[5m])) | Healthy delivery vs error rate per channel. |
| Notification queue depth | dhcp_dpi_notification_queue_depth | Dispatcher backpressure. |
| Vector breaker | dhcp_dpi_vector_circuit_breaker_open | 0 or 1 indicator. |
Use dhcp_dpi_build_info as a table panel showing the running version, commit, and Go runtime — handy during a release.
Dashboard import: the appliance does not ship a packaged Grafana dashboard JSON. The four-panel layout above takes a few minutes to assemble in the Grafana UI and tracks what the endpoint actually exposes; a packaged dashboard that drifted from the endpoint would be worse than no dashboard at all.
Enabling and Disabling the Integration
Section titled “Enabling and Disabling the Integration”Enable the surface in config.yaml; the page in the GUI is always visible to administrators.
The Prometheus endpoint is gated by a single core-config flag in config.yaml. When the flag is false (the shipped default) the route is not mounted and no metrics are collected — the integration is truly zero-cost when off.
| State | Behaviour |
|---|---|
prometheus.enabled: false | /api/metrics/prometheus returns 404. The Token Management page still loads (so administrators can review existing tokens), but the Integrations landing page reports the integration as Disabled and links to this section. |
prometheus.enabled: true | The route is mounted, scrape tokens authenticate against it, and all five metric families are populated and exposed. |
After flipping the flag, restart the appliance. The change does not apply at runtime — Prometheus is a core, not operational, setting because it affects route registration. See Key Concepts > Core vs Operational Config (chapter 03) for the distinction.
Audit and Notifications
Section titled “Audit and Notifications”Every token lifecycle event is recorded.
Each of the following triggers an entry in the audit log visible in Sessions & Audit (chapter 30):
- Token creation, with the administrator’s username and the chosen Name and Expires values.
- Token revocation, with the administrator’s username.
- Scrape authentication failures (wrong token type, expired token, revoked token, malformed Authorization header).
Notification rules in Alarms (chapter 11) can be configured to surface these audit events on the bell or on an external sink — for example, a critical-severity rule on creation events for the security team.
The Last Used and Last IP columns on the Token Management page provide quick visibility without leaving the page; the audit log is the authoritative record for compliance review.
Security Notes
Section titled “Security Notes”A short list of habits that keep the integration boring.
- Mint one token per consumer. Sharing a token between two Prometheus instances makes the
Last UsedandLast IPcolumns less useful and means a single revocation cuts off both consumers. - Set an expiry. Even a one-year expiry forces an annual rotation. Long-lived bearer tokens accumulate risk; expiry caps that risk.
- Rotate proactively. When the administrator who minted a token leaves the team, revoke and re-mint. The current
Last Usedvalue lets you find tokens that are still in use. - Confine TLS to a known CA. Configure Prometheus’s
tls_config.ca_fileagainst your internal CA so a misconfigured DNS entry cannot redirect scrapes to an attacker’s endpoint. - Treat tokens like passwords. Store them in your monitoring stack’s secret store, not in plaintext config files committed to version control.
The token is the only secret. There is no signing, no client certificate, no IP allow-list — just the bearer token. That is intentional because it lines up with how every Prometheus job is wired in practice. Revocation is fast and unambiguous; rely on it rather than on layered controls.
See Also
Section titled “See Also”- Integrations Panel — landing page that hosts the Prometheus card and any future integrations
- Statistics & Reports (chapter 19) — on-appliance counters and the Report Builder for traffic-level data that does not belong on this endpoint
- Alarms (chapter 11) — alarm lifecycle and the source keys reflected in
dhcp_dpi_alarms_active - Sessions & Audit (chapter 30) — where token lifecycle events are recorded
- Authentication (chapter 29) — how JWTs and scrape tokens differ
- Troubleshooting > Prometheus Scrape Endpoint (chapter 41) — symptoms specific to this integration