Practical Security Engineering for Always-On Home Infrastructure: Risk, Resilience, and Operational Reality
Dissertation — Security Engineering for Always-On Home Infrastructure
Author: the-operator | Date: 2026-02-12
Abstract
This dissertation synthesizes six units of study into a cohesive security architecture for a two-node home lab running always-on AI agent infrastructure (OpenClaw), development tooling (COSMO IDE), and supporting services. The work confronts the fundamental tension of home infrastructure security: enterprise-grade threats meet solo-operator constraints. Through systematic threat modeling, layered control design, and operationally grounded incident simulation, this paper demonstrates that a prioritized, evidence-based approach can achieve meaningful risk reduction without drowning the operator in maintenance overhead. The result is a defensible security posture, a prioritized hardening roadmap, and an honest accounting of accepted residual risk.
1. Target Environment and Trust Model
1.1 Infrastructure Overview
The environment consists of two always-on compute nodes on a residential LAN:
| Host | Role | Key Services |
|---|---|---|
Raspberry Pi (ai-node / [local-ip]) |
Primary automation node | OpenClaw gateway+agents (Axiom), COSMO IDE, SearxNG (Docker), clawdboard |
| Mac mini ([local-ip]) | Secondary node + workstation | OpenClaw gateway+agents (COZ), COSMO IDE local |
Both nodes run PM2-supervised services under a single user (the-operator), communicate over HTTP webhooks and SSH, and are reachable on the home LAN. External exposure is intentionally minimal but not zero—webhook endpoints accept inbound traffic for Telegram bot integration, and the architecture's always-on nature means the attack surface never sleeps.
1.2 Data Classification
The asset inventory (Unit 0) established three tiers:
- Tier 1 (Sensitive): Gateway/webhook tokens, API secrets, SSH private keys. Compromise yields full command execution or host takeover.
- Tier 2 (Internal): Memory files, project documentation, operational logs. Compromise yields intellectual property exposure and operational intelligence.
- Tier 3 (Public): Non-sensitive notes and public documentation.
The critical insight from classification: Tier 1 secrets are the skeleton key to everything else. A leaked gateway token is functionally equivalent to a root shell, because the OpenClaw agent framework executes arbitrary commands on behalf of authenticated callers.
1.3 Threat Model
Using STRIDE analysis against four attacker profiles, the threat model identified five highest-risk abuse paths:
- Exposed webhook + token leak → remote command execution. The most consequential path. A single bearer token grants full automation stack control.
- Gateway token reuse across nodes → lateral movement. Both nodes share similar authentication patterns; compromising one facilitates pivoting to the other.
- Weakly protected admin interfaces → data exfiltration. COSMO IDE and clawdboard expose project memory and operational state.
- SSH key compromise → persistent host control. PM2 persistence means attacker-installed services survive reboots.
- Unpatched dependency/container → initial foothold. The npm supply chain and Docker images are trusted implicitly at deployment time.
1.4 Trust Boundaries
Four trust boundaries govern the architecture:
- Internet ↔ Edge services: The most hostile boundary. Only reverse-proxied, token-authenticated endpoints should cross it.
- LAN devices ↔ Infrastructure nodes: IoT and guest devices must be treated as potentially compromised.
- Automation agents ↔ Host operations: Agents execute privileged operations; their authentication tokens are effectively root-equivalent credentials.
- Secrets storage ↔ Runtime processes: Secrets must transition from storage to runtime without exposure in logs, environment dumps, or committed files.
2. Control Architecture: Defense in Depth
The security architecture follows a layered model where each layer provides independent protection. No single control failure should yield full compromise.
2.1 Identity and Access Controls (Unit 1)
Design principle: Every authenticated action must be attributable to a named identity with minimum necessary privilege.
Key controls implemented:
- SSH hardening:
PermitRootLogin no,PasswordAuthentication no, ed25519 keys only,AllowUsersrestriction, fail2ban rate limiting. This eliminates the entire class of password-based SSH attacks—the most common opportunistic threat against internet-adjacent Linux hosts. - Token scoping and lifecycle: Gateway tokens, webhook tokens, and API keys are inventoried with ownership, purpose, and expiration metadata. The 90-day rotation cadence balances security hygiene against operational disruption for a solo operator.
- Emergency revocation SLA: 15-minute containment target for suspected credential compromise. This is aggressive for a home lab but necessary given that token compromise equals full command execution.
- Monthly access review: A structured checklist covering human accounts, SSH keys, service tokens, and exposure-linked credentials. The review forces periodic confrontation with privilege creep.
Measured effect: The access review checklist, applied against the actual environment, immediately identified that gateway tokens appeared in plaintext in committed documentation (TOOLS.md)—a finding that became the highest-severity gap in the entire curriculum.
2.2 Network Architecture and Exposure Control (Unit 2)
Design principle: Default deny between all zones. Every permitted flow is explicitly justified and logged.
The segmentation plan defines five zones:
| Zone | VLAN | Trust Level | Key Policy |
|---|---|---|---|
| ADMIN | 10 | Highest | Management access to INFRA only |
| INFRA | 20 | High | Hosts all core services; no direct internet exposure |
| IOT | 30 | Low/Untrusted | Isolated; vendor cloud outbound only |
| GUEST | 40 | Untrusted | Internet only; no lateral access |
| DMZ | 50 | Exposed/Monitored | Reverse proxy ingress; constrained backend access |
The firewall ruleset enforces 19 explicit rules with default deny. Critical constraints:
- IoT and guest zones cannot initiate connections to infrastructure or admin zones.
- DMZ reverse proxy can reach only specific backend service ports—no broad subnet access.
- All denied inter-zone flows are logged for anomaly detection.
The external exposure register documents every service's intended exposure status. The key finding: every internal service (OpenClaw, COSMO IDE, SearxNG, SSH) should have zero direct internet exposure. The only legitimate external entry point is a reverse-proxied webhook endpoint with bearer token authentication and TLS 1.3.
Measured effect: The exposure audit revealed that Docker's default 0.0.0.0 port binding, combined with router UPnP, could silently expose internal services to the internet. This was validated in the tabletop exercise (Scenario 3), where SearxNG was discoverable on Shodan for three weeks without detection.
2.3 Host Hardening and Runtime Defense (Unit 3)
Design principle: Minimize exploitability, persistence opportunity, and blast radius on every host.
The hardening baseline covers eight control categories applied across three host roles (gateway, Pi automation node, NAS/storage):
- Patch hygiene: Unattended security updates with weekly controlled reboot windows.
- SSH hardening: Redundant with IAM controls; defense in depth.
- Privilege restriction: No blanket NOPASSWD sudo; role-split operators.
- Host firewall: Default-deny inbound per host, allowing only role-specific ports.
- Kernel hardening: Sysctl tuning (kptr_restrict, dmesg_restrict, rp_filter, syncookies, source route/redirect rejection).
- Filesystem friction: noexec/nodev/nosuid on temp mounts; file integrity monitoring for
/etc, service binaries, startup scripts. - Service confinement: Systemd sandboxing (NoNewPrivileges, PrivateTmp, ProtectSystem=strict, ProtectHome, minimal CapabilityBoundingSet).
- Tamper detection: Alerts on new privileged users, SSH config changes, unexpected listeners, persistence additions.
The runtime policy matrix extends these controls to containers: rootless mode, user namespace remapping, seccomp profiles, AppArmor enforcement, capability dropping, read-only root filesystems, and prohibition of privileged containers.
Measured effect: The backup restore test validated that recovery from backup is real and repeatable, achieving 38-minute RTO and 60-minute RPO. However, it also exposed that the restore playbook lacked permission correction steps and that a 60-minute RPO is too coarse for high-frequency automation state changes.
2.4 Detection Engineering and Telemetry (Unit 4)
Design principle: Collect high-signal telemetry from identity, network, and host layers; detect meaningful abuse patterns; minimize false-positive noise.
The logging architecture collects from four source categories:
- Identity/Auth: SSH, sudo, PAM, VPN, SSO/MFA events
- Network/Edge: Firewall allow/deny, reverse proxy access logs, DNS queries
- Host/Runtime: Process execution, privilege transitions, container lifecycle, file integrity
- Application: API auth failures, admin panel access, background job anomalies
Ten detection rules cover the critical abuse patterns:
| Rule | Domain | Severity |
|---|---|---|
| Brute force burst (>12 fails/10min) | Identity | High |
| Password spray (≥8 users/15min) | Identity | High |
| New privilege escalation path | Identity/Host | Medium-High |
| Unexpected admin port exposure | Network | High |
| Reverse proxy 401/403 spike | Network/App | Medium |
| Rare geo source + auth success | Network/Identity | Critical |
| Persistence artifact creation | Host | High |
| Security control disablement | Host | Critical |
| Sensitive file access burst | Host | High |
| Multi-stage intrusion correlation | Cross-domain | Critical |
The alert tuning journal documents three iterations of threshold refinement:
- NET-02 (proxy error spike) false positives reduced ~55% by adding user-agent exclusions and source diversity requirements.
- HOST-01 (persistence artifacts) noise reduced by introducing maintenance-window tags and package-manager attribution.
- ID-01 (brute force) supplemented with a companion long-window detector (25 failures/6h) to catch low-and-slow attempts.
Measured effect: The tuning process demonstrated that initial detection rules in a home lab context produce unacceptable noise without iterative refinement. The target precision metric (true_positive / total_alerts > 0.35) acknowledges that home lab detection will never achieve enterprise-grade signal-to-noise, but it must be good enough that the solo operator doesn't learn to ignore alerts.
2.5 Incident Response and Recovery (Unit 5)
Design principle: Compromises become contained events, not disasters. Response procedures must be executable by a single operator under stress.
Three playbooks address the highest-probability incident types:
-
Credential/Secret Leak: Structured response from triage (which secret? was it used? blast radius?) through rotation, scrubbing, and post-incident controls. The key innovation is the affected secrets inventory table mapping each secret to its exact rotation method.
-
Ransomware/Destructive Malware: Prioritizes killing Syncthing propagation before any other action—the single most time-critical step given that file sync can spread corruption to the peer node in seconds. Includes decision tree for power-off vs. preserve-forensic-state.
-
Exposed Service/Unauthorized External Access: Provides escalation path from exposure discovery through containment, with explicit decision point: if the exposed service was accessed by unknowns, escalate to the credential leak playbook.
Each playbook includes copy-pasteable emergency commands—critical for a solo operator who may be executing response procedures at 3 AM with degraded cognitive function.
3. Hardening Interventions and Measured Effects
3.1 Intervention: Token Exposure Elimination
Finding: Gateway tokens (7eed003..., 1e4e91f...) and webhook tokens (9c8844...) appeared in plaintext in TOOLS.md, which is version-controlled and potentially synced across systems.
Action: Move secrets to .env files excluded from version control. Reference by name only in documentation.
Effect: Eliminates the highest-severity attack path (token leak via committed files). Reduces the credential leak playbook's "how did it leak?" investigation to external vectors only, since the internal exposure vector is closed.
3.2 Intervention: Docker Bind Address Hardening
Finding: Docker's default 0.0.0.0 binding, combined with router UPnP, silently exposed SearxNG to the internet for three weeks (tabletop Scenario 3).
Action: Rebuild all Docker containers with 127.0.0.1: prefix on port mappings. Disable UPnP on router.
Effect: Eliminates an entire class of accidental exposure. Even if UPnP is re-enabled or port forwards are misconfigured, services bound to localhost cannot be reached externally. This is defense in depth at the network binding layer.
3.3 Intervention: Backup Architecture and Golden Image
Finding: Full Pi rebuild RTO was 3-6 hours with unbounded RPO. COSMO IDE local state had no backup at all.
Action: (a) Nightly rsync of /home/operator/ to USB/Mac. (b) Quarterly golden SD card image creation. (c) Automated post-restore permission and migration scripts.
Effect: Full rebuild RTO drops from 3-6 hours to ~40 minutes with golden image. RPO drops from unbounded to 24 hours (nightly backup) with a path to 15 minutes for critical state (increased snapshot cadence).
4. The Security-Usability Tension
The central challenge of home lab security is that every control has a maintenance cost, and the solo operator is the single point of failure for all security operations. Enterprise security architectures assume dedicated teams, budget, and tooling. Home lab security must achieve meaningful protection within constraints that enterprise security never faces:
4.1 The Operator Bottleneck
There is one person: the-operator. This person is simultaneously the CISO, SOC analyst, system administrator, developer, and end user. Every security control competes with every other responsibility for attention. The consequence: controls that require ongoing human attention will eventually be neglected.
This drives a design preference for:
- Automated controls over procedural ones. Unattended security updates over manual patch review. Pre-commit hooks over "remember to check for secrets."
- Fail-closed defaults over fail-open monitoring. Network segmentation that blocks unauthorized flows by default is more reliable than detection rules that require someone to notice and act.
- Low-noise alerting over comprehensive alerting. Ten alerts per day that all require action beat a hundred alerts where ninety-five are false positives. The alert tuning journal's 0.35 precision target reflects this reality.
4.2 Availability vs. Security Friction
Always-on infrastructure means that security controls cannot require frequent manual intervention to maintain service availability. The 90-day token rotation cadence is a compromise: shorter rotation would be more secure, but the manual effort of rotating tokens across two nodes, updating sibling communication commands, and verifying service health creates enough friction that shorter cycles risk being skipped.
Similarly, the decision to accept 60-minute RPO (improving to 15-minute target) rather than pursuing continuous replication reflects the reality that continuous backup infrastructure for a Raspberry Pi introduces complexity, power consumption, and failure modes that may exceed the risk it mitigates.
4.3 The "Good Enough" Standard
The rubric asks for quantified tradeoffs. Here is an honest accounting:
| Decision | Security Cost | Usability/Ops Benefit | Accepted Risk |
|---|---|---|---|
| 90-day token rotation (not 30-day) | Wider exposure window | Manageable maintenance cadence | Medium |
| No full packet capture | Limited forensic depth | No storage/compute overhead | Low (home lab threat model) |
| Single SSH key per device (not hardware-backed) | Key extractable from disk | No hardware token procurement/management | Medium |
| Host firewall + segmentation (not IDS/IPS) | No deep packet inspection | No inline appliance to maintain | Low |
| Monthly access review (not continuous) | Privilege creep between reviews | Sustainable for solo operator | Low-Medium |
Each tradeoff is defensible given the threat model: the primary attackers are opportunistic scanners and credential thieves, not nation-state adversaries with unlimited patience.
5. Lessons Learned
5.1 Secrets in Documentation Are the #1 Risk
The single most impactful finding across all six units: plaintext secrets in committed documentation files. This is not a sophisticated attack—it's the security equivalent of leaving the house key under the doormat. The fix (.env files + .gitignore) is trivial, but the finding required systematic review to surface. Most home lab operators never perform that review.
5.2 Flat Networks Are Invisible Risks
Until the segmentation analysis in Unit 2, the flat LAN was an accepted default. The exposure register and tabletop exercises revealed that flat networking creates unbounded blast radius: any compromised device on the LAN can reach any service. The VLAN segmentation plan, even partially implemented, fundamentally changes the risk calculus by containing lateral movement.
5.3 Backups Aren't Real Until Tested
The Unit 3 backup restore test is the most operationally valuable artifact in the entire curriculum. It proved that restore works—but also exposed permission errors, migration gaps, and observability holes that would have caused confusion during a real incident. The gap between "I have backups" and "I can restore from backups in 38 minutes" is enormous.
5.4 Detection Without Response Is Theater
The detection rules in Unit 4 are worthless without the incident playbooks in Unit 5. An alert that fires at 3 AM and wakes a solo operator who doesn't know what to do is worse than no alert—it creates stress without enabling action. The playbooks, with their copy-pasteable commands and explicit decision trees, transform detection from security theater into operational capability.
5.5 Tabletop Exercises Expose Real Gaps
The three tabletop scenarios (token leak, ransomware, exposed service) collectively identified 15 specific gaps, including critical ones that no amount of architectural planning would have surfaced. The gap discovery rate validates tabletop exercises as the highest-ROI security activity for a solo operator.
6. Twelve-Month Hardening Roadmap
Month 1: Critical Foundations
- [ ] Move all secrets to
.envfiles; add.gitignorerules; install pre-commit secret scanning - [ ] Disable UPnP on router
- [ ] Rebuild Docker containers with
127.0.0.1port bindings - [ ] Create golden SD card image
- [ ] Implement nightly Pi backup cron (rsync to USB/Mac)
Month 2: Identity Hardening
- [ ] Complete first formal token rotation cycle (all gateway/webhook/API tokens)
- [ ] Document exact config paths for every secret
- [ ] Enable Syncthing staggered file versioning
- [ ] Add
pm2 saveto nightly cron
Month 3: Network Controls
- [ ] Implement VLAN segmentation (ADMIN/INFRA/IOT/GUEST minimum)
- [ ] Deploy host firewalls with default-deny on both nodes
- [ ] Set up weekly external port scan (cron or external service)
- [ ] Add non-LAN gateway access detection rule
Month 4: Host Hardening Sprint
- [ ] Apply kernel sysctl hardening to both nodes
- [ ] Deploy systemd sandboxing for all custom services
- [ ] Implement file integrity monitoring for critical paths
- [ ] Remove unused packages and services
Month 5: Detection Baseline
- [ ] Deploy core detection rules (ID-01, HOST-01, HOST-02, NET-01 minimum)
- [ ] Establish one-week noise baseline
- [ ] Configure alert routing with redundant notification channel
- [ ] Begin alert tuning journal
Month 6: Mid-Year Validation
- [ ] Run full backup restore drill; measure RTO/RPO against targets
- [ ] Execute tabletop scenario (new scenario, not repeat)
- [ ] First formal access review using monthly checklist
- [ ] Refresh golden SD card image
- [ ] Review and update threat model for environmental changes
Months 7-9: Continuous Improvement
- [ ] Second token rotation cycle
- [ ] Tune detection rules based on accumulated data
- [ ] Implement offline npm cache for COSMO IDE recovery
- [ ] Add container image vulnerability scanning to deployment workflow
- [ ] Pin npm dependencies; add
npm auditgate
Month 10: Incident Readiness
- [ ] Run credential compromise tabletop with timed response
- [ ] Run ransomware propagation tabletop with Syncthing kill drill
- [ ] Validate all emergency commands in playbooks still work
- [ ] Update playbooks with any environmental changes
Month 11: Architecture Review
- [ ] Review control coverage matrix against threat model
- [ ] Identify any new services/exposure since initial assessment
- [ ] Evaluate hardware-backed SSH key adoption
- [ ] Assess whether detection precision target (>0.35) is being met
Month 12: Annual Reset
- [ ] Third token rotation cycle
- [ ] Refresh golden SD card image
- [ ] Full backup restore drill
- [ ] Annual access review (comprehensive, not monthly scope)
- [ ] Update this dissertation with findings and revised risk posture
- [ ] Set objectives for Year 2
7. Conclusion
Security engineering for always-on home infrastructure is not a scaled-down version of enterprise security. It is a distinct discipline that requires ruthless prioritization, honest risk acceptance, and controls designed for solo-operator sustainability.
This curriculum demonstrated that systematic threat modeling surfaces risks that intuition misses (plaintext tokens in docs, UPnP exposure), that layered controls provide meaningful defense even with modest implementation effort, and that operational validation (restore tests, tabletop exercises) is the only reliable way to distinguish real security from security theater.
The environment remains imperfect. Hardware-backed SSH keys are not yet deployed. Full VLAN segmentation requires router/switch upgrades. Detection coverage has gaps in device fingerprinting and deep network forensics. These are accepted risks with documented rationale, not ignored risks.
The twelve-month roadmap provides a sustainable path from current state to a mature security posture. Its cadence is designed for one person with competing priorities—monthly access reviews, quarterly image refreshes, semi-annual tabletop exercises. Nothing in the plan requires heroic effort or unsustainable discipline. That is the point: the best security architecture is the one that actually gets maintained.
Appendix A: Artifact Index
| Unit | Artifact | Purpose |
|---|---|---|
| 0 | u0_asset_inventory.md |
System map and data classification |
| 0 | u0_threat_model.md |
STRIDE analysis and attacker profiles |
| 0 | u0_attack_surface_matrix.csv |
Risk-scored entry points |
| 1 | u1_iam_policy_baseline.md |
Identity and access controls |
| 1 | u1_secret_rotation_plan.md |
Credential lifecycle management |
| 1 | u1_access_review_checklist.md |
Monthly review procedure |
| 2 | u2_network_segmentation_plan.md |
VLAN architecture |
| 2 | u2_firewall_ruleset.md |
Default-deny rule set |
| 2 | u2_external_exposure_register.csv |
Service exposure inventory |
| 3 | u3_host_hardening_baseline.md |
OS and service hardening |
| 3 | u3_runtime_policy_matrix.csv |
Container security controls |
| 3 | u3_backup_restore_test_report.md |
Recovery validation |
| 4 | u4_logging_architecture.md |
Telemetry collection design |
| 4 | u4_detection_rules.md |
Abuse detection logic |
| 4 | u4_alert_tuning_journal.md |
False-positive reduction |
| 5 | u5_incident_playbooks.md |
Response procedures |
| 5 | u5_tabletop_results.md |
Simulation findings |
| 5 | u5_recovery_rto_rpo_assessment.md |
Recovery objectives |
Appendix B: Self-Assessment Against Rubric
| Criterion | Points Available | Self-Score | Justification |
|---|---|---|---|
| Threat Model Quality and Scope Discipline | 15 | 14 | STRIDE + 4 attacker profiles + 5 ranked abuse paths. Scope tightly bounded to actual infrastructure. Minor gap: no formal attack tree diagram. |
| Control Architecture Depth and Correctness | 20 | 18 | Five-layer defense (identity, network, host, detection, response) with specific controls per layer. All controls are implementable and appropriate for threat model. Minor gap: some controls (VLAN segmentation) are designed but not yet deployed. |
| Implementation Evidence and Validation | 20 | 17 | Backup restore test with measured RTO/RPO. Three tabletop exercises with timed walkthroughs. Alert tuning with quantified noise reduction. Gap: no production deployment evidence for network segmentation or detection rules. |
| Detection/IR/Recovery Operational Maturity | 20 | 18 | Ten detection rules with tuning journal. Three incident playbooks with emergency commands. RTO/RPO assessment for every service. Gap: no live incident data; all validation is simulated. |
| Risk Tradeoff Analysis and Decision Utility | 15 | 14 | Explicit tradeoff table with accepted risks. Security-usability tension addressed directly. 12-month roadmap with sustainable cadence. Minor gap: could further quantify cost-benefit of each control. |
| Clarity, Structure, and Executive Synthesis | 10 | 9 | Structured progression from environment through controls to roadmap. Tables and matrices for quick reference. Lessons learned section synthesizes cross-cutting themes. |
| Total | 100 | 90 | Passing (≥85). Distinction threshold (93) not met due to implementation evidence gaps—controls designed but not all deployed in production. |