Troubleshooting
A symptom-driven reference for the most common things that go wrong in operation. Every entry answers “I see X, what do I do?” rather than describing how the system works in the abstract.
Use this chapter when something is misbehaving and you want a fast pointer to the right control or chapter. Each section is organised as a table: symptom on the left, likely cause and remedy on the right. Cross-links lead to the chapter where the underlying feature is documented in full.
If your symptom is not listed here, the Glossary and the Architecture chapter (chapter 01) together cover almost every concept the GUI exposes.
Before you change anything destructive: if your appliance is in a recoverable bad state, check Firewall Guidance (chapter 22) first. It documents the supported escape hatches for clearing rules and restoring service without rebooting.
Ingest Path — Packets Not Being Processed
Section titled “Ingest Path — Packets Not Being Processed”Symptoms that the processor is running but DHCP traffic is not flowing through it.
| Symptom | Likely cause | Remedy |
|---|---|---|
| No new rows in the DHCP Stream (chapter 09) at all | nftables ruleset not loaded, or the queue chain is not jumping into the processor | Re-deploy the ruleset from the shipped nft-v2.sh and confirm the appliance is on the inline path described in Deployment Modes (chapter 04). See NFTables Deployment (chapter 06) for the deploy checklist. |
| Dashboard event-rate panel reads zero but the DHCP server is clearly answering on the wire | Processor is bypassing inspection because the upstream chain is missing the queue jump | Reload the ruleset; verify the queue number in config.yaml matches the one used in the active ruleset. The two must agree. |
| DHCP Stream shows DISCOVER and REQUEST but never OFFER / ACK / REPLY | Expected behaviour — the inspector hooks prerouting, so it sees only client-to-server messages. Server responses are recorded by other chains for Firewall Decisions (chapter 10), not the Stream | No action. The Stream is a one-direction view by design. |
| Devices appear “blocked” but new events still arrive from them | Block applies at the verdict stage; events are recorded before that. Deny additionally suppresses event emission | If you want to silence events as well as drop packets, use Deny instead of Block. See Key Concepts (chapter 03). |
| A specific MAC is missing from the Stream | The device’s mark may be in llm_denied_marks (Deny suppresses event recording) or the device may be using DHCPv6 with no extractable MAC | Check Device Details (chapter 14) for the device. DUID-only DHCPv6 clients are tracked by DUID, not MAC — see the DHCPv6 note in chapter 14. |
| Stream traffic appears, then suddenly stops | The processor service crashed and the queue is now in fall-through mode (packets passing without inspection) | Check the appliance’s service status from a host terminal session on the appliance. Restart the dhcp-processor service. |
| Stream is full of duplicate events | An L2 broadcast environment is delivering the same DHCP packet to the inspector twice, or the relay path is also unicasting | Confirm the deployment matches the topology described in chapter 04. The shipped reference deployments hook a single point of capture. |
nftables Sets — Enforcement Not Behaving as Expected
Section titled “nftables Sets — Enforcement Not Behaving as Expected”Symptoms that the kernel-level sets, marks, or per-message-type chains are not doing what you asked.
| Symptom | Likely cause | Remedy |
|---|---|---|
| Operator blocked a MAC; device is still talking | Mark collision (two MACs share their last 3 bytes) or the action has not yet reached the kernel | Check Device Details > Enforcement Status (chapter 14). If “Mark Confidence” reads low (collision), the mark is shared with another device. Cleanup and re-issue the action; the Firewall Manager (chapter 20) explains the collision risk in full. |
The same device cycles in and out of blocked_macs every couple of minutes | Per-client rate limit keeps tripping. The set timeout is 2 minutes on the shipped profile | Either tune the rate-limit threshold in the Firewall Manager (chapter 20) or issue a long-lived enforcement action (Block / Throttle / Deny) so the device is not re-evaluated against the per-client cap. |
| A device that should be trusted keeps getting throttled | Behavioural set ladder evaluates llm_allowed_marks first, but the device is not in that set | Open Device Details, click Allow, and confirm. Allow takes precedence over every other set in the ladder. See Key Concepts (chapter 03) for the ladder order. |
| Sets fill up to the 1M cap and entries get evicted | Network is far larger than the shipped defaults assume, or the timeouts are too long for the rate of new entries | Raise the per-set size cap in the Firewall Manager (chapter 20), or shorten timeouts so old entries drop out faster. The Manager exposes both. |
| A rule fired but no mark made it to the set | Action Manager could not reach nftables, or the action was rejected because of a validation error | Check the Actions (chapter 15) history page for the failed action; its row will show the failure reason. |
Cannot remove a device from llm_allowed_marks by clicking Cleanup | Cleanup removes the device from every behavioural set; the entry should clear within a second | Refresh Device Details. If the entry persists, the appliance may have lost its nftables connection — restart the dhcp-processor service. |
| Enforcement looks correct in the GUI but the device is still reaching the DHCP server | The appliance is in mirror mode, not inline. Mirror mode is observation-only — it does not enforce | Switch to inline. See Deployment Modes (chapter 04). |
When in doubt about the live ruleset: the Flow Visualizer (chapter 21) renders the chains and sets exactly as they are running. Use it to confirm that the ladder the Manager shows matches the kernel.
Automation Rules — Runaway Detection Sets
Section titled “Automation Rules — Runaway Detection Sets”Symptoms that a single automation rule is so broad it has filled a kernel set with hundreds of thousands of entries, and the system is now choking on the size.
| Symptom | Likely cause | Remedy |
|---|---|---|
| Dashboard widgets that show nftables set sizes — LLM Action Sets, Rate-limited MACs — are frozen at a timestamp many minutes in the past, while other widgets keep refreshing | Periodic counter collection is timing out on a very large set dump | See remedies below. The freeze is a downstream effect, not the root cause. |
nft list ruleset from a host terminal session hangs for tens of seconds, or appears to stall completely | Same — nft list walks every element of every set, and one set is now in the high six figures | Check set sizes (below). The kernel is fine; the dump is what is slow. |
| The Automation (chapter 16) Recent Executions table shows the last run for a rule many minutes ago, even though the rule’s interval is much shorter | The rule is mid-flight on a single execution that detected hundreds of thousands of devices; the execution row is only written when the run finishes | If the rule is misconfigured, disable it from the Automation page. The in-flight run is killed at the next service restart. |
| An automation rule’s Test Results preview reports a detection count one or two orders of magnitude higher than you expected | The rule’s thresholds are too loose for the network’s MAC diversity. A common pattern: threshold logic set to OR with a low min_unique_ips, on a Deny-action rule | Switch the rule to AND logic, or raise min_unique_ips and min_request_count until the preview matches what you actually want to act on. See Automation (chapter 16) for the threshold semantics. |
| New denies or blocks keep landing every cycle for devices that already have active enforcement | Expected. The rule re-detects them on every tick; the action manager deduplicates, but the detection cost is still paid | Tighten the rule as above, or shorten the rule’s lookback window so older detections roll out of view sooner. |
| You suspect a runaway rule but cannot tell which one | The Automation (chapter 16) page lists every enabled rule with its last preview count | Open each rule’s Test Results in turn. Any single rule reporting a six-figure preview is your candidate. |
Confirming the diagnosis. The counters.interval_secs value in config.yaml (default 30) is also the upper bound on how long the collector waits before declaring its full set dump a failure. When a set crosses roughly half a million elements, that 30-second budget is no longer sufficient and every poll silently times out — freezing every widget that reads from it.
Remedies, in order of escalation:
-
Tighten the rule. Disable the offending rule on the Automation page, or edit it to switch OR → AND or raise its thresholds, then re-enable. The enforcement entries the rule already created stay in the kernel until they reach their action timeout (typically 24 hours for a Deny). For most situations this is the right answer — set growth halts immediately and observability returns within one collector cycle.
-
Adjust the collector for scale, if you genuinely need a rule that detects this many devices. Raise
counters.interval_secsinconfig.yamlso the collector has enough headroom — for sets in the high six figures, five minutes (300) is comfortable. The trade-off is that the set-size widgets refresh once every five minutes instead of every thirty seconds; for slowly-changing kernel-state graphs this is an acceptable cadence. Restart the dhcp-processor service for the change to take effect. -
Last resort — flush the DHCP DPI table and reapply. When (1) and (2) are not enough — for example, a set is so bloated with orphan entries from an earlier misconfiguration that even a five-minute interval still times out — the Firewall Manager (chapter 20) Apply: flush table control (yellow play arrow) clears every set in the DHCP DPI table in one transaction. This wipes every active enforcement immediately. Follow it straight away with Database Actions Sync → Reapply Actions in the same chapter to rebuild the kernel sets from the
mac_actionstable — that restores every Block, Deny, Throttle, Allow and Monitor the database still has on record. Anything that was in nftables but not in the database (typically orphans from a runaway automation cascade) does not come back, which is usually the point. For the few seconds between flush and reapply, no enforcement is active — schedule this during a quiet window if you can, and avoid the Apply: flush ruleset mode (red play arrow) unless you also want to wipe every other table the host is running, including ones managed by other software.
Preventing recurrence. Use the Preview button on every new or edited automation rule before enabling it. Any preview reporting tens of thousands or more devices is the moment to stop and reconsider the thresholds — not after the kernel set has filled.
LLM — Analysis Not Running or Returning Garbage
Section titled “LLM — Analysis Not Running or Returning Garbage”Symptoms involving the LLM backend, anomaly detection cycle, or Device Analysis page.
| Symptom | Likely cause | Remedy |
|---|---|---|
| Run Analysis button is greyed out on a device | Analysis cooldown has not yet expired for this device | The remaining cooldown is shown next to the button in Device Details > LLM Analysis (chapter 14). Wait it out or shorten the cooldown in operational config. |
| Analyze button is available but every run fails | LLM backend is unreachable or returning malformed JSON | Verify the LLM endpoint URL in config.yaml, then check LLM Setup (chapter 23) for the connection checklist. The endpoint must respond on the configured host:port. |
| Analyses come back with empty indicators / evidence | The model is too small or too quantised to follow the structured-output instructions | Try a larger model. The LLM Setup (chapter 23) chapter lists supported backends and recommended model sizes. |
| Anomaly detection cycle is silent — no scheduled analyses ever run | The cycle is disabled in operational config | Open Settings and verify both “LLM enabled” (core, in config.yaml) and “LLM active” (operational, in the GUI). Both must be true. See Key Concepts > Enable vs Active (chapter 03). |
| Risk scores are wildly inconsistent across re-runs of the same device | Model temperature is non-zero, or the analysis configuration was changed mid-run | Reduce temperature to 0 in operational config, and review the analysis configuration. |
| Auto-actions never execute even though analyses produce high risk scores | Auto-execute thresholds are not configured, or the action category is disabled | Configure the thresholds in Automated Actions (chapter 25). |
| LLM response includes obvious hallucinated MAC addresses or rule names | The analysis context is leaking from one analysis into another, or it is too large | Review and tighten the analysis configuration. |
| Device Analysis page is permanently empty after running an analysis | The analysis ran but the result row was rejected on insert | Check the Statistics (chapter 19) page for LLM error counters, then try again. Persistent failures usually indicate a model output that does not match the expected schema. |
API — Endpoints Returning the Wrong Thing
Section titled “API — Endpoints Returning the Wrong Thing”Symptoms for operators querying the REST API directly or building scripts on top of it.
| Symptom | Likely cause | Remedy |
|---|---|---|
| 401 Unauthorized on every endpoint | Token expired, or the session was revoked | Re-authenticate. See Authentication (chapter 29). Long-running scripts should refresh the token before expiry. |
401 on /api/metrics/prometheus specifically | Wrong token type — JWTs are rejected on that route by design; only scrape tokens authenticate it | Mint a scrape token in Prometheus Integration (chapter 40). |
| 403 Forbidden on an admin endpoint | The user is not in the admin role | Have an administrator assign the role in User Management (chapter 27). |
| Response payload is missing fields the documentation shows | The appliance is on an older release than the documentation | Check the About modal for the running version. Upgrade if necessary; the docs always describe the latest release. |
| Slow responses on history endpoints | The time range queried is very wide, hitting raw event tables instead of aggregated views | Narrow the time range or use the dashboard summary endpoints. The Statistics (chapter 19) chapter covers the aggregated views. |
| Endpoint returns 5xx after every reload of the appliance | Backend service has not yet finished startup | Wait one minute and retry. Startup includes database migrations, prompt-template load, and automation-rule load. |
| 404 on a documented endpoint | Feature flag in config.yaml is false, so the route is not mounted | Toggle the flag, restart the appliance. Routes are mounted only when their parent feature is enabled. |
| WebSocket disconnects every few seconds | WebSocket rate limit is set too low, or a proxy in front of the appliance is closing idle connections | Raise the operational rate limit; configure your reverse proxy to allow long-lived upgrades. |
GUI — Page Looks Wrong or Will Not Load
Section titled “GUI — Page Looks Wrong or Will Not Load”Symptoms in the browser. The GUI is a single-page app served from the appliance.
| Symptom | Likely cause | Remedy |
|---|---|---|
| White screen after login | Browser is caching a previous build of the GUI | Hard-refresh the browser tab (Shift+Reload). If the symptom persists, clear site data for the appliance hostname. |
| Sidebar items missing for a user | The user is not in a role that grants those pages | Adjust the role in User Management (chapter 27). |
| About modal shows no version string | Build manifest was not embedded — usually a development build | Rebuild the GUI on the production build host. See the appliance release notes. |
| A page loads but every panel reads “Loading…” forever | The API is unreachable from the browser; the GUI cannot fetch its data | Check that the API listener is up. A host terminal session on the appliance is a convenient way to verify locally. |
| Dashboard charts are blank during business hours | No data in the queried time window, or aggregated views have not caught up after a restart | Wait one aggregation cycle. The Dashboard (chapter 07) chapter explains the data sources for each panel. |
| Time pickers offer the wrong default range | The user’s profile is set to a different default | Adjust in the user’s profile page. |
| Buttons appear but do nothing on click | Browser console will show an error; the most common cause is a stale browser tab held open across an upgrade | Reload the tab. |
| Numbers in two panels disagree by a small amount | Panels query different aggregation levels (1-minute, 5-minute, hourly) | Expected. The dashboard mixes time-granularity to balance accuracy and load. See chapter 07. |
Packet Capture — Captures Not Starting or Files Missing
Section titled “Packet Capture — Captures Not Starting or Files Missing”Symptoms in the Packet Capture (chapter 32) tool.
| Symptom | Likely cause | Remedy |
|---|---|---|
| Start Capture button is disabled | The user is not in a role that allows captures | Admin role is required. See User Management (chapter 27). |
| Capture starts but no packets are recorded | Filter expression is too narrow, or the chosen interface sees no traffic | Loosen the filter. The Packet Capture chapter (chapter 32) shows valid filter syntax. |
| Capture file does not appear in the file list | The shipped capture binary lost permission to write its output directory | The directory path is set in config.yaml. The directory and every parent must be readable and writable by the appliance service account, including for traversal. |
| Download returns a 404 | The file was already auto-rotated and removed | Adjust the rotation policy in operational config so files live long enough for the operator workflow. |
| Live capture stream is choppy | Network is busier than the live decode pipeline can keep up with | Filter more aggressively, or capture to file and download for offline analysis. |
| Capture binary is reported as missing | The appliance is running without the optional capture binary in its PATH | Install the capture binary on the appliance (the release notes name the package), then restart the dhcp-processor service. |
License — Activation, Renewal, and Binding
Section titled “License — Activation, Renewal, and Binding”Symptoms involving the license file, activation flow, or feature gating.
| Symptom | Likely cause | Remedy |
|---|---|---|
| Some sidebar items have a padlock icon and are not clickable | Feature is gated by the current license tier | Review your license in License Management (chapter 37). Upgrading the tier unlocks the gated pages. |
| Banner reads “License expired” and enforcement actions are read-only | The license file’s expiry date is in the past | Renew the license, then upload the new file via License Management (chapter 37). |
| Banner reads “License binding mismatch” | The appliance hardware fingerprint differs from the one the license was issued for | This usually means the appliance was migrated to new hardware. See Installation Binding (chapter 38) for the re-binding procedure. |
| License page shows “Unknown tier” | License file is malformed or signed by an unrecognised key | Re-download the license from the licensor portal and re-upload. |
| IPv6 controls are missing even though the appliance is processing DHCPv6 | IPv6 is gated in the GUI but not in the backend; the license tier does not include IPv6 GUI | Upgrade the tier, or use the API directly for v6 work. The backend never blocks v6 packets — only the GUI hides v6 controls. |
| Save is greyed out on License page | The user is not an administrator | License changes are admin-only. See User Management (chapter 27). |
OAuth and Authentication
Section titled “OAuth and Authentication”Symptoms involving login, SSO, sessions, and audit.
| Symptom | Likely cause | Remedy |
|---|---|---|
| OAuth login redirects to a generic error page | OAuth provider rejected the redirect URI, or the appliance’s configured client secret is wrong | Verify the redirect URI matches what is registered with the provider. See Authentication > OAuth (chapter 29). |
| Password login works but OAuth users cannot reach admin pages | Role mapping is missing — the appliance does not know how to translate OAuth group claims to local roles | Configure the mapping in User Management (chapter 27). |
| User locked out after a few wrong passwords | The shipped policy locks the account temporarily | An administrator can unlock the user in User Management (chapter 27). |
| Session expired warning appears mid-session | Token rotation failed because the browser was offline or the appliance was restarted | Re-authenticate. The system never destroys session state silently. |
| Operator’s actions are not appearing in the audit log | Their user record was deleted and recreated, so the audit rows attach to the old user ID | Audit rows are immutable. The new user ID will be on subsequent actions. See Sessions & Audit (chapter 30). |
| OAuth login lands the user as a viewer despite group membership | Group claim is named differently in the provider’s token than the mapping expects | Use the audit log to inspect the raw claims, then update the mapping. |
ClickHouse Connectivity
Section titled “ClickHouse Connectivity”Symptoms involving the data store.
| Symptom | Likely cause | Remedy |
|---|---|---|
| Banner reads “ClickHouse unreachable” | The appliance cannot connect to ClickHouse on the configured host:port | Verify the host is up and the credentials in config.yaml are valid. See Alarms (chapter 11) for the system:clickhouse:unreachable source key. |
| Banner reads “ClickHouse storage low” | Data directory is below the warning threshold | Free disk space or expand the volume. The alarm engine raises system:clickhouse:storage_warning and escalates to _critical past a second threshold. |
| Queries on history pages are very slow | ClickHouse server is under load or its background merges have fallen behind | The Statistics (chapter 19) chapter documents which aggregated views to query for fast results. Avoid querying raw event tables for long time windows. |
| Some configuration changes silently revert | A configuration table uses the merge-on-read engine and the operator is reading without forcing a merge | This is a backend issue, not an operator-fixable symptom — file a support case. The shipped code already accounts for this; persistent reverts indicate a bug. |
| Operator made a change in the GUI but the appliance still uses the old value | Operational settings are read from the database on demand; some subsystems cache their settings briefly | Wait a minute or restart the affected feature toggle (active → inactive → active). |
| ClickHouse alarms keep firing and resolving in a tight loop | The connectivity check is racing a flapping network path | Stabilise the network, or widen the alarm’s evaluation interval. See Alarms (chapter 11). |
Support Tunnel
Section titled “Support Tunnel”Symptoms in the vendor support backchannel.
| Symptom | Likely cause | Remedy |
|---|---|---|
| Open Support Tunnel button is greyed out | Support backchannel is disabled in config.yaml | Administrator must enable it and restart the appliance. See Support Backchannel (chapter 34). |
| Tunnel opens but the support engineer reports they cannot reach the GUI | Reverse-tunnel ports are blocked by an upstream firewall | Outbound SSH on the configured port must be allowed. Chapter 31 lists every port the tunnel uses. |
| Feedback submission returns “session expired” | The tunnel times out after the per-session idle limit | Re-open the tunnel and retry. Idle limits are operational settings. |
| Screenshot attached to a feedback report is blank | Browser blocked the screen-capture permission for this site | Re-grant the permission in the browser, then retry. The system cannot capture screens without it. |
| Diagnostics bundle is missing files | The bundle is allowlisted — only specific log files and configuration excerpts are collected | This is by design. See chapter 34 for the bundle’s exact contents. |
| Operator wants to close a tunnel they did not open | Sessions are owned by the user who opened them, but admins can revoke any session | Use Revoke Session in Sessions & Audit (chapter 30). |
dpictl and the Web Console
Section titled “dpictl and the Web Console”Symptoms when using the shipped dpictl command-line tool or its web console.
| Symptom | Likely cause | Remedy |
|---|---|---|
dpictl reports “permission denied” | The tool is not setuid, or the user is not in the appliance admin group | The tool reads the same config.yaml as the daemon. |
| Console hangs after entering a long command | The command produces an output stream longer than the console buffer | Use dpictl from a host terminal session; the console is best for short commands. |
| A console command modifies state but the GUI does not reflect the change | The GUI caches some lookups for a few seconds | Refresh the page. |
dpictl cannot reach ClickHouse from the appliance host | The tool uses the credentials in config.yaml; ClickHouse may be on a separate machine | Verify reachability and credentials. The same connectivity issues that affect the daemon also affect dpictl. |
Prometheus Scrape Endpoint
Section titled “Prometheus Scrape Endpoint”Symptoms specific to the platform-health Prometheus endpoint. The endpoint exposes appliance health only — never traffic.
| Symptom | Likely cause | Remedy |
|---|---|---|
/api/metrics/prometheus returns 404 | prometheus.enabled: true is not set in config.yaml, so the route is not mounted | Set the flag and restart the appliance. The integration is zero-cost when disabled, so the route only appears when enabled. See Prometheus Integration (chapter 40). |
| Endpoint returns 401 even with a valid JWT | JWTs are not accepted on this endpoint by design; only scrape tokens authenticate it | Mint a scrape token. See chapter 40. |
Scrape works in curl but Prometheus reports “context deadline exceeded” | Prometheus scrape timeout is shorter than the appliance’s response time during ClickHouse pressure | Increase the scrape timeout in Prometheus’s job configuration. |
| Metrics list is shorter than expected | This endpoint exposes only platform-health metrics — alarm counts, dispatcher depth, build info, ClickHouse up, vector breaker | This is by design. Per-MAC, per-rule, and per-event data is never exposed here. See chapter 40 for the complete list. |
dhcp_dpi_clickhouse_up reads 0 even though ClickHouse looks healthy | The gauge reflects whether ClickHouse responded within the staleness window. Any error or timeout flips it | Inspect ClickHouse directly; the gauge is a faithful reflection of the appliance’s connectivity, not of ClickHouse in isolation. |
| Token suddenly stops working | Token was revoked, expired, or its expiry was set in the past | Mint a new token in Prometheus Integration (chapter 40). Revoked tokens cannot be reinstated. |
When Nothing Else Works
Section titled “When Nothing Else Works”Emergency procedures that recover service without losing data.
| Situation | Recovery path |
|---|---|
| Appliance is unreachable from the network | Use the local console or out-of-band management interface. The shipped install ships a recovery shell. See Installation (chapter 05). |
| Enforcement is blocking everything (including the operator’s own workstation) | Use the Firewall Guidance (chapter 22) chapter’s emergency clear procedure. It documents how to drop the active ruleset to permissive without rebooting. |
| Configuration database is corrupted | Restore from the most recent backup. The appliance keeps config.yaml outside the database, so core config survives. Operational config and history are in the database. |
| Operator has forgotten the admin password | Reset using the shipped recovery tool from the appliance shell. See Authentication (chapter 29). |
| nftables ruleset is broken on disk | Re-deploy from nft-v2.sh — it is idempotent. See NFTables Deployment (chapter 06). |
| Operator wants to start clean | The shipped install.sh supports re-running with the --reset flag. This destroys all history and configuration. Confirm a backup exists first. |
Open a support case if a symptom persists after working through this chapter. The Support Backchannel (chapter 34) is the supported way to attach diagnostics, screenshots, and context to a case in one step.
See Also
Section titled “See Also”- Architecture (chapter 01) — how the pieces fit together
- Key Concepts (chapter 03) — marks, sets, actions
- Alarms (chapter 11) — system health monitoring and notification rules
- Statistics & Reports (chapter 19) — counters and the Report Builder for forensic queries
- Firewall Guidance (chapter 22) — emergency recovery procedures
- Prometheus Integration (chapter 40) — external platform-health monitoring
- Glossary — definitions of every named concept