Troubleshooting

A symptom-driven reference for the most common things that go wrong in operation. Every entry answers “I see X, what do I do?” rather than describing how the system works in the abstract.

Use this chapter when something is misbehaving and you want a fast pointer to the right control or chapter. Each section is organised as a table: symptom on the left, likely cause and remedy on the right. Cross-links lead to the chapter where the underlying feature is documented in full.

If your symptom is not listed here, the Glossary and the Architecture chapter (chapter 01) together cover almost every concept the GUI exposes.

Before you change anything destructive: if your appliance is in a recoverable bad state, check Firewall Guidance (chapter 22) first. It documents the supported escape hatches for clearing rules and restoring service without rebooting.

Ingest Path — Packets Not Being Processed

Symptoms that the processor is running but DHCP traffic is not flowing through it.

Symptom	Likely cause	Remedy
No new rows in the DHCP Stream (chapter 09) at all	nftables ruleset not loaded, or the queue chain is not jumping into the processor	Re-deploy the ruleset from the shipped `nft-v2.sh` and confirm the appliance is on the inline path described in Deployment Modes (chapter 04). See NFTables Deployment (chapter 06) for the deploy checklist.
Dashboard event-rate panel reads zero but the DHCP server is clearly answering on the wire	Processor is bypassing inspection because the upstream chain is missing the queue jump	Reload the ruleset; verify the queue number in `config.yaml` matches the one used in the active ruleset. The two must agree.
DHCP Stream shows DISCOVER and REQUEST but never OFFER / ACK / REPLY	Expected behaviour — the inspector hooks prerouting, so it sees only client-to-server messages. Server responses are recorded by other chains for Firewall Decisions (chapter 10), not the Stream	No action. The Stream is a one-direction view by design.
Devices appear “blocked” but new events still arrive from them	Block applies at the verdict stage; events are recorded before that. Deny additionally suppresses event emission	If you want to silence events as well as drop packets, use Deny instead of Block. See Key Concepts (chapter 03).
A specific MAC is missing from the Stream	The device’s mark may be in `llm_denied_marks` (Deny suppresses event recording) or the device may be using DHCPv6 with no extractable MAC	Check Device Details (chapter 14) for the device. DUID-only DHCPv6 clients are tracked by DUID, not MAC — see the DHCPv6 note in chapter 14.
Stream traffic appears, then suddenly stops	The processor service crashed and the queue is now in fall-through mode (packets passing without inspection)	Check the appliance’s service status from a host terminal session on the appliance. Restart the dhcp-processor service.
Stream is full of duplicate events	An L2 broadcast environment is delivering the same DHCP packet to the inspector twice, or the relay path is also unicasting	Confirm the deployment matches the topology described in chapter 04. The shipped reference deployments hook a single point of capture.

nftables Sets — Enforcement Not Behaving as Expected

Symptoms that the kernel-level sets, marks, or per-message-type chains are not doing what you asked.

Symptom	Likely cause	Remedy
Operator blocked a MAC; device is still talking	Mark collision (two MACs share their last 3 bytes) or the action has not yet reached the kernel	Check Device Details > Enforcement Status (chapter 14). If “Mark Confidence” reads `low (collision)`, the mark is shared with another device. Cleanup and re-issue the action; the Firewall Manager (chapter 20) explains the collision risk in full.
The same device cycles in and out of `blocked_macs` every couple of minutes	Per-client rate limit keeps tripping. The set timeout is 2 minutes on the shipped profile	Either tune the rate-limit threshold in the Firewall Manager (chapter 20) or issue a long-lived enforcement action (Block / Throttle / Deny) so the device is not re-evaluated against the per-client cap.
A device that should be trusted keeps getting throttled	Behavioural set ladder evaluates `llm_allowed_marks` first, but the device is not in that set	Open Device Details, click Allow, and confirm. Allow takes precedence over every other set in the ladder. See Key Concepts (chapter 03) for the ladder order.
Sets fill up to the 1M cap and entries get evicted	Network is far larger than the shipped defaults assume, or the timeouts are too long for the rate of new entries	Raise the per-set size cap in the Firewall Manager (chapter 20), or shorten timeouts so old entries drop out faster. The Manager exposes both.
A rule fired but no mark made it to the set	Action Manager could not reach nftables, or the action was rejected because of a validation error	Check the Actions (chapter 15) history page for the failed action; its row will show the failure reason.
Cannot remove a device from `llm_allowed_marks` by clicking Cleanup	Cleanup removes the device from every behavioural set; the entry should clear within a second	Refresh Device Details. If the entry persists, the appliance may have lost its nftables connection — restart the dhcp-processor service.
Enforcement looks correct in the GUI but the device is still reaching the DHCP server	The appliance is in mirror mode, not inline. Mirror mode is observation-only — it does not enforce	Switch to inline. See Deployment Modes (chapter 04).

When in doubt about the live ruleset: the Flow Visualizer (chapter 21) renders the chains and sets exactly as they are running. Use it to confirm that the ladder the Manager shows matches the kernel.

Automation Rules — Runaway Detection Sets

Symptoms that a single automation rule is so broad it has filled a kernel set with hundreds of thousands of entries, and the system is now choking on the size.

Symptom	Likely cause	Remedy
Dashboard widgets that show nftables set sizes — LLM Action Sets, Rate-limited MACs — are frozen at a timestamp many minutes in the past, while other widgets keep refreshing	Periodic counter collection is timing out on a very large set dump	See remedies below. The freeze is a downstream effect, not the root cause.
`nft list ruleset` from a host terminal session hangs for tens of seconds, or appears to stall completely	Same — `nft list` walks every element of every set, and one set is now in the high six figures	Check set sizes (below). The kernel is fine; the dump is what is slow.
The Automation (chapter 16) Recent Executions table shows the last run for a rule many minutes ago, even though the rule’s interval is much shorter	The rule is mid-flight on a single execution that detected hundreds of thousands of devices; the execution row is only written when the run finishes	If the rule is misconfigured, disable it from the Automation page. The in-flight run is killed at the next service restart.
An automation rule’s Test Results preview reports a detection count one or two orders of magnitude higher than you expected	The rule’s thresholds are too loose for the network’s MAC diversity. A common pattern: threshold logic set to OR with a low `min_unique_ips`, on a Deny-action rule	Switch the rule to AND logic, or raise `min_unique_ips` and `min_request_count` until the preview matches what you actually want to act on. See Automation (chapter 16) for the threshold semantics.
New denies or blocks keep landing every cycle for devices that already have active enforcement	Expected. The rule re-detects them on every tick; the action manager deduplicates, but the detection cost is still paid	Tighten the rule as above, or shorten the rule’s lookback window so older detections roll out of view sooner.
You suspect a runaway rule but cannot tell which one	The Automation (chapter 16) page lists every enabled rule with its last preview count	Open each rule’s Test Results in turn. Any single rule reporting a six-figure preview is your candidate.

Confirming the diagnosis. The counters.interval_secs value in config.yaml (default 30) is also the upper bound on how long the collector waits before declaring its full set dump a failure. When a set crosses roughly half a million elements, that 30-second budget is no longer sufficient and every poll silently times out — freezing every widget that reads from it.

Remedies, in order of escalation:

Tighten the rule. Disable the offending rule on the Automation page, or edit it to switch OR → AND or raise its thresholds, then re-enable. The enforcement entries the rule already created stay in the kernel until they reach their action timeout (typically 24 hours for a Deny). For most situations this is the right answer — set growth halts immediately and observability returns within one collector cycle.
Adjust the collector for scale, if you genuinely need a rule that detects this many devices. Raise counters.interval_secs in config.yaml so the collector has enough headroom — for sets in the high six figures, five minutes (300) is comfortable. The trade-off is that the set-size widgets refresh once every five minutes instead of every thirty seconds; for slowly-changing kernel-state graphs this is an acceptable cadence. Restart the dhcp-processor service for the change to take effect.
Last resort — flush the DHCP DPI table and reapply. When (1) and (2) are not enough — for example, a set is so bloated with orphan entries from an earlier misconfiguration that even a five-minute interval still times out — the Firewall Manager (chapter 20) Apply: flush table control (yellow play arrow) clears every set in the DHCP DPI table in one transaction. This wipes every active enforcement immediately. Follow it straight away with Database Actions Sync → Reapply Actions in the same chapter to rebuild the kernel sets from the mac_actions table — that restores every Block, Deny, Throttle, Allow and Monitor the database still has on record. Anything that was in nftables but not in the database (typically orphans from a runaway automation cascade) does not come back, which is usually the point. For the few seconds between flush and reapply, no enforcement is active — schedule this during a quiet window if you can, and avoid the Apply: flush ruleset mode (red play arrow) unless you also want to wipe every other table the host is running, including ones managed by other software.

Preventing recurrence. Use the Preview button on every new or edited automation rule before enabling it. Any preview reporting tens of thousands or more devices is the moment to stop and reconsider the thresholds — not after the kernel set has filled.

LLM — Analysis Not Running or Returning Garbage

Symptoms involving the LLM backend, anomaly detection cycle, or Device Analysis page.

Symptom	Likely cause	Remedy
Run Analysis button is greyed out on a device	Analysis cooldown has not yet expired for this device	The remaining cooldown is shown next to the button in Device Details > LLM Analysis (chapter 14). Wait it out or shorten the cooldown in operational config.
Analyze button is available but every run fails	LLM backend is unreachable or returning malformed JSON	Verify the LLM endpoint URL in `config.yaml`, then check LLM Setup (chapter 23) for the connection checklist. The endpoint must respond on the configured host:port.
Analyses come back with empty indicators / evidence	The model is too small or too quantised to follow the structured-output instructions	Try a larger model. The LLM Setup (chapter 23) chapter lists supported backends and recommended model sizes.
Anomaly detection cycle is silent — no scheduled analyses ever run	The cycle is disabled in operational config	Open Settings and verify both “LLM enabled” (core, in `config.yaml`) and “LLM active” (operational, in the GUI). Both must be true. See Key Concepts > Enable vs Active (chapter 03).
Risk scores are wildly inconsistent across re-runs of the same device	Model temperature is non-zero, or the analysis configuration was changed mid-run	Reduce temperature to 0 in operational config, and review the analysis configuration.
Auto-actions never execute even though analyses produce high risk scores	Auto-execute thresholds are not configured, or the action category is disabled	Configure the thresholds in Automated Actions (chapter 25).
LLM response includes obvious hallucinated MAC addresses or rule names	The analysis context is leaking from one analysis into another, or it is too large	Review and tighten the analysis configuration.
Device Analysis page is permanently empty after running an analysis	The analysis ran but the result row was rejected on insert	Check the Statistics (chapter 19) page for LLM error counters, then try again. Persistent failures usually indicate a model output that does not match the expected schema.

API — Endpoints Returning the Wrong Thing

Symptoms for operators querying the REST API directly or building scripts on top of it.

Symptom	Likely cause	Remedy
401 Unauthorized on every endpoint	Token expired, or the session was revoked	Re-authenticate. See Authentication (chapter 29). Long-running scripts should refresh the token before expiry.
401 on `/api/metrics/prometheus` specifically	Wrong token type — JWTs are rejected on that route by design; only scrape tokens authenticate it	Mint a scrape token in Prometheus Integration (chapter 40).
403 Forbidden on an admin endpoint	The user is not in the admin role	Have an administrator assign the role in User Management (chapter 27).
Response payload is missing fields the documentation shows	The appliance is on an older release than the documentation	Check the About modal for the running version. Upgrade if necessary; the docs always describe the latest release.
Slow responses on history endpoints	The time range queried is very wide, hitting raw event tables instead of aggregated views	Narrow the time range or use the dashboard summary endpoints. The Statistics (chapter 19) chapter covers the aggregated views.
Endpoint returns 5xx after every reload of the appliance	Backend service has not yet finished startup	Wait one minute and retry. Startup includes database migrations, prompt-template load, and automation-rule load.
404 on a documented endpoint	Feature flag in `config.yaml` is `false`, so the route is not mounted	Toggle the flag, restart the appliance. Routes are mounted only when their parent feature is enabled.
WebSocket disconnects every few seconds	WebSocket rate limit is set too low, or a proxy in front of the appliance is closing idle connections	Raise the operational rate limit; configure your reverse proxy to allow long-lived upgrades.

GUI — Page Looks Wrong or Will Not Load

Symptoms in the browser. The GUI is a single-page app served from the appliance.

Symptom	Likely cause	Remedy
White screen after login	Browser is caching a previous build of the GUI	Hard-refresh the browser tab (Shift+Reload). If the symptom persists, clear site data for the appliance hostname.
Sidebar items missing for a user	The user is not in a role that grants those pages	Adjust the role in User Management (chapter 27).
About modal shows no version string	Build manifest was not embedded — usually a development build	Rebuild the GUI on the production build host. See the appliance release notes.
A page loads but every panel reads “Loading…” forever	The API is unreachable from the browser; the GUI cannot fetch its data	Check that the API listener is up. A host terminal session on the appliance is a convenient way to verify locally.
Dashboard charts are blank during business hours	No data in the queried time window, or aggregated views have not caught up after a restart	Wait one aggregation cycle. The Dashboard (chapter 07) chapter explains the data sources for each panel.
Time pickers offer the wrong default range	The user’s profile is set to a different default	Adjust in the user’s profile page.
Buttons appear but do nothing on click	Browser console will show an error; the most common cause is a stale browser tab held open across an upgrade	Reload the tab.
Numbers in two panels disagree by a small amount	Panels query different aggregation levels (1-minute, 5-minute, hourly)	Expected. The dashboard mixes time-granularity to balance accuracy and load. See chapter 07.

Packet Capture — Captures Not Starting or Files Missing

Symptoms in the Packet Capture (chapter 32) tool.

Symptom	Likely cause	Remedy
Start Capture button is disabled	The user is not in a role that allows captures	Admin role is required. See User Management (chapter 27).
Capture starts but no packets are recorded	Filter expression is too narrow, or the chosen interface sees no traffic	Loosen the filter. The Packet Capture chapter (chapter 32) shows valid filter syntax.
Capture file does not appear in the file list	The shipped capture binary lost permission to write its output directory	The directory path is set in `config.yaml`. The directory and every parent must be readable and writable by the appliance service account, including for traversal.
Download returns a 404	The file was already auto-rotated and removed	Adjust the rotation policy in operational config so files live long enough for the operator workflow.
Live capture stream is choppy	Network is busier than the live decode pipeline can keep up with	Filter more aggressively, or capture to file and download for offline analysis.
Capture binary is reported as missing	The appliance is running without the optional capture binary in its `PATH`	Install the capture binary on the appliance (the release notes name the package), then restart the dhcp-processor service.

License — Activation, Renewal, and Binding

Symptoms involving the license file, activation flow, or feature gating.

Symptom	Likely cause	Remedy
Some sidebar items have a padlock icon and are not clickable	Feature is gated by the current license tier	Review your license in License Management (chapter 37). Upgrading the tier unlocks the gated pages.
Banner reads “License expired” and enforcement actions are read-only	The license file’s expiry date is in the past	Renew the license, then upload the new file via License Management (chapter 37).
Banner reads “License binding mismatch”	The appliance hardware fingerprint differs from the one the license was issued for	This usually means the appliance was migrated to new hardware. See Installation Binding (chapter 38) for the re-binding procedure.
License page shows “Unknown tier”	License file is malformed or signed by an unrecognised key	Re-download the license from the licensor portal and re-upload.
IPv6 controls are missing even though the appliance is processing DHCPv6	IPv6 is gated in the GUI but not in the backend; the license tier does not include IPv6 GUI	Upgrade the tier, or use the API directly for v6 work. The backend never blocks v6 packets — only the GUI hides v6 controls.
Save is greyed out on License page	The user is not an administrator	License changes are admin-only. See User Management (chapter 27).

OAuth and Authentication

Symptoms involving login, SSO, sessions, and audit.

Symptom	Likely cause	Remedy
OAuth login redirects to a generic error page	OAuth provider rejected the redirect URI, or the appliance’s configured client secret is wrong	Verify the redirect URI matches what is registered with the provider. See Authentication > OAuth (chapter 29).
Password login works but OAuth users cannot reach admin pages	Role mapping is missing — the appliance does not know how to translate OAuth group claims to local roles	Configure the mapping in User Management (chapter 27).
User locked out after a few wrong passwords	The shipped policy locks the account temporarily	An administrator can unlock the user in User Management (chapter 27).
Session expired warning appears mid-session	Token rotation failed because the browser was offline or the appliance was restarted	Re-authenticate. The system never destroys session state silently.
Operator’s actions are not appearing in the audit log	Their user record was deleted and recreated, so the audit rows attach to the old user ID	Audit rows are immutable. The new user ID will be on subsequent actions. See Sessions & Audit (chapter 30).
OAuth login lands the user as a viewer despite group membership	Group claim is named differently in the provider’s token than the mapping expects	Use the audit log to inspect the raw claims, then update the mapping.

ClickHouse Connectivity

Symptoms involving the data store.

Symptom	Likely cause	Remedy
Banner reads “ClickHouse unreachable”	The appliance cannot connect to ClickHouse on the configured host:port	Verify the host is up and the credentials in `config.yaml` are valid. See Alarms (chapter 11) for the `system:clickhouse:unreachable` source key.
Banner reads “ClickHouse storage low”	Data directory is below the warning threshold	Free disk space or expand the volume. The alarm engine raises `system:clickhouse:storage_warning` and escalates to `_critical` past a second threshold.
Queries on history pages are very slow	ClickHouse server is under load or its background merges have fallen behind	The Statistics (chapter 19) chapter documents which aggregated views to query for fast results. Avoid querying raw event tables for long time windows.
Some configuration changes silently revert	A configuration table uses the merge-on-read engine and the operator is reading without forcing a merge	This is a backend issue, not an operator-fixable symptom — file a support case. The shipped code already accounts for this; persistent reverts indicate a bug.
Operator made a change in the GUI but the appliance still uses the old value	Operational settings are read from the database on demand; some subsystems cache their settings briefly	Wait a minute or restart the affected feature toggle (active → inactive → active).
ClickHouse alarms keep firing and resolving in a tight loop	The connectivity check is racing a flapping network path	Stabilise the network, or widen the alarm’s evaluation interval. See Alarms (chapter 11).

Support Tunnel

Symptoms in the vendor support backchannel.

Symptom	Likely cause	Remedy
Open Support Tunnel button is greyed out	Support backchannel is disabled in `config.yaml`	Administrator must enable it and restart the appliance. See Support Backchannel (chapter 34).
Tunnel opens but the support engineer reports they cannot reach the GUI	Reverse-tunnel ports are blocked by an upstream firewall	Outbound SSH on the configured port must be allowed. Chapter 31 lists every port the tunnel uses.
Feedback submission returns “session expired”	The tunnel times out after the per-session idle limit	Re-open the tunnel and retry. Idle limits are operational settings.
Screenshot attached to a feedback report is blank	Browser blocked the screen-capture permission for this site	Re-grant the permission in the browser, then retry. The system cannot capture screens without it.
Diagnostics bundle is missing files	The bundle is allowlisted — only specific log files and configuration excerpts are collected	This is by design. See chapter 34 for the bundle’s exact contents.
Operator wants to close a tunnel they did not open	Sessions are owned by the user who opened them, but admins can revoke any session	Use Revoke Session in Sessions & Audit (chapter 30).

dpictl and the Web Console

Symptoms when using the shipped dpictl command-line tool or its web console.

Symptom	Likely cause	Remedy
`dpictl` reports “permission denied”	The tool is not setuid, or the user is not in the appliance admin group	The tool reads the same `config.yaml` as the daemon.
Console hangs after entering a long command	The command produces an output stream longer than the console buffer	Use `dpictl` from a host terminal session; the console is best for short commands.
A console command modifies state but the GUI does not reflect the change	The GUI caches some lookups for a few seconds	Refresh the page.
`dpictl` cannot reach ClickHouse from the appliance host	The tool uses the credentials in `config.yaml`; ClickHouse may be on a separate machine	Verify reachability and credentials. The same connectivity issues that affect the daemon also affect dpictl.

Prometheus Scrape Endpoint

Symptoms specific to the platform-health Prometheus endpoint. The endpoint exposes appliance health only — never traffic.

Symptom	Likely cause	Remedy
`/api/metrics/prometheus` returns 404	`prometheus.enabled: true` is not set in `config.yaml`, so the route is not mounted	Set the flag and restart the appliance. The integration is zero-cost when disabled, so the route only appears when enabled. See Prometheus Integration (chapter 40).
Endpoint returns 401 even with a valid JWT	JWTs are not accepted on this endpoint by design; only scrape tokens authenticate it	Mint a scrape token. See chapter 40.
Scrape works in `curl` but Prometheus reports “context deadline exceeded”	Prometheus scrape timeout is shorter than the appliance’s response time during ClickHouse pressure	Increase the scrape timeout in Prometheus’s job configuration.
Metrics list is shorter than expected	This endpoint exposes only platform-health metrics — alarm counts, dispatcher depth, build info, ClickHouse up, vector breaker	This is by design. Per-MAC, per-rule, and per-event data is never exposed here. See chapter 40 for the complete list.
`dhcp_dpi_clickhouse_up` reads 0 even though ClickHouse looks healthy	The gauge reflects whether ClickHouse responded within the staleness window. Any error or timeout flips it	Inspect ClickHouse directly; the gauge is a faithful reflection of the appliance’s connectivity, not of ClickHouse in isolation.
Token suddenly stops working	Token was revoked, expired, or its expiry was set in the past	Mint a new token in Prometheus Integration (chapter 40). Revoked tokens cannot be reinstated.

When Nothing Else Works

Emergency procedures that recover service without losing data.

Situation	Recovery path
Appliance is unreachable from the network	Use the local console or out-of-band management interface. The shipped install ships a recovery shell. See Installation (chapter 05).
Enforcement is blocking everything (including the operator’s own workstation)	Use the Firewall Guidance (chapter 22) chapter’s emergency clear procedure. It documents how to drop the active ruleset to permissive without rebooting.
Configuration database is corrupted	Restore from the most recent backup. The appliance keeps `config.yaml` outside the database, so core config survives. Operational config and history are in the database.
Operator has forgotten the admin password	Reset using the shipped recovery tool from the appliance shell. See Authentication (chapter 29).
nftables ruleset is broken on disk	Re-deploy from `nft-v2.sh` — it is idempotent. See NFTables Deployment (chapter 06).
Operator wants to start clean	The shipped `install.sh` supports re-running with the `--reset` flag. This destroys all history and configuration. Confirm a backup exists first.

Open a support case if a symptom persists after working through this chapter. The Support Backchannel (chapter 34) is the supported way to attach diagnostics, screenshots, and context to a case in one step.