Dissertation: Architecture for a Resilient Two-Node Agent Network
Topic #15: Systems Design for Resilient Distributed Agents
Date: 2026-02-20
Synthesizes: All 8 units
Abstract
This document presents a concrete, implementable architecture for a resilient two-node AI agent network β specifically, the COZ (Mac) and Axiom (Raspberry Pi) deployment running OpenClaw. Drawing on distributed systems theory (consensus, fault taxonomy, event sourcing, capability security), it specifies failure scenarios, recovery strategies, observability patterns, and degradation policies. The goal: an agent network that survives partial failures, self-heals where possible, degrades gracefully where not, and remains observable throughout.
1. System Topology
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HOME NETWORK (LAN) β
β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β COZ (Mac) β β Axiom (Pi) β β
β β β β β β
β β βββββββββββββββββββ β β βββββββββββββββββββ β β
β β β Main Agent β β β β Main Agent β β β
β β β - Orchestrationβ β β β - Orchestrationβ β β
β β β - Browser β β β β - 24/7 cron β β β
β β β - Desktop β β β β - Headless ops β β β
β β ββββββββββ¬βββββββββ β β ββββββββββ¬βββββββββ β β
β β β β β β β β
β β ββββββββββΌβββββββββ β β ββββββββββΌβββββββββ β β
β β β Sub-agents β β β β Sub-agents β β β
β β β (sandboxed) β β β β (sandboxed) β β β
β β βββββββββββββββββββ β β βββββββββββββββββββ β β
β β β β β β
β β Services: β β Services: β β
β β - OpenClaw GW :18789β β - OpenClaw GW :18789β β
β β - COSMO IDE :4405 β β - COSMO IDE :4405 β β
β β - SearxNG :8888 β β - SearxNG :8888 β β
β β β β - Clawdboard :3300 β β
β βββββββββββββ¬ββββββββββββ βββββββββββββ¬ββββββββββββ β
β β β β
β ββββββββββββ¬βββββββββββββββββββ β
β β β
β ββββββββββββΌβββββββββββ β
β β Communication β β
β β - Webhooks (HTTP) β β
β β - SSH (file ops) β β
β β - Git (state sync) β β
β βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Design principles
- No single point of failure: Either agent can operate independently when the other is down.
- Shared-nothing: Each agent owns its local state. Coordination is via messages, not shared storage.
- Eventual consistency: Agents synchronize state asynchronously; temporary divergence is acceptable.
- Human as ultimate arbiter: the-operator can intervene at any point; the system never locks humans out.
2. Failure Scenarios & Recovery Strategies
Drawing from Unit 1 (fault taxonomy) and Unit 5 (state management):
Scenario Matrix
| # | Failure | Type (Unit 1) | Detection | Automated Recovery | Manual Escalation |
|---|---|---|---|---|---|
| F1 | Axiom power loss | Crash-stop | COZ webhook fails | COZ continues independently; Axiom recovers from STATE.json + artifacts on reboot | If >24h, the-operator checks Pi power |
| F2 | COZ sleep/shutdown | Crash-stop | Axiom webhook fails | Axiom continues independently; queues messages for COZ | Normal β Mac sleeps at night |
| F3 | Network partition (LAN) | Omission | Bidirectional webhook timeout | Both operate independently; reconcile on reconnection | If >1h during work hours, check router |
| F4 | STATE.json corruption | Byzantine (data) | Schema validation on read | Rebuild from artifacts directory (Unit 5 playbook) | If rebuild fails, alert the-operator |
| F5 | Sub-agent runaway | Performance | Timeout + heartbeat check | Kill sub-agent, re-queue task | If repeated (3x), disable feature |
| F6 | Disk full on Pi | Resource exhaustion | Disk check in heartbeat | Auto-cleanup: compress logs, trash old artifacts | If <5% free after cleanup, alert |
| F7 | Memory compaction data loss | Crash (timing) | Gap detection in daily files | Accept loss, note gap, continue | If critical info lost, check git |
| F8 | Webhook auth token leaked | Security | Anomalous requests (hard to detect) | Rotate token immediately | Full audit of actions during exposure window |
Recovery Priority Order
When multiple failures occur simultaneously:
1. Preserve human-facing responsiveness (F2 exempted β sleep is normal)
2. Protect state integrity (F4 first β without state, can't coordinate)
3. Restore communication (F3 β needed for coordination)
4. Resume work (F1, F5, F6 β can wait)
3. Consensus & Coordination Model
Drawing from Unit 2 (consensus) and Unit 3 (messaging):
Why NOT consensus
Classical consensus (Raft, Paxos) requires a quorum. With 2 nodes, quorum = 2 = all nodes. Any single failure blocks consensus. Consensus is the wrong model for a two-node system.
The actual model: Single-leader with failover
Normal: Axiom = primary for cron/scheduled work
COZ = primary for interactive/desktop work
Degraded: Surviving node takes on all roles
Each agent is the single leader for its domain. No contention, no split-brain for non-overlapping concerns.
For overlapping concerns (shared state):
Last-writer-wins (LWW) with vector timestamps:
- Each agent tags state updates with [agent_name, sequence_number]
- On reconnection, compare sequence numbers
- Higher sequence wins; ties broken by agent priority (configurable)
This is a CRDT (Conflict-free Replicated Data Type) β specifically an LWW-Register. From Unit 2: use CRDTs when availability > consistency, which is exactly the case here.
Messaging semantics
Per Unit 3:
- Webhook calls: At-most-once delivery (fire and forget with timeout)
- Idempotent handlers: All webhook endpoints are idempotent β receiving the same message twice produces the same result
- Sequence numbers: Monotonic per-channel counter detects gaps (lost messages) and duplicates
4. Supervision Architecture
Drawing from Unit 4 (supervision & self-healing):
βββββββββββββββββββββββββββββββββββββββ
β ROOT SUPERVISOR β
β (OpenClaw Gateway Process) β
β Restart: systemd/PM2 β
ββββββββββββββββ¬βββββββββββββββββββββββ
β
ββββββββββββ΄βββββββββββ
β β
βββββΌβββββββββββ βββββββΌβββββββββ
β Main Agent β β Cron Schedulerβ
β Strategy: β β Strategy: β
β restart β β restart β
βββββ¬βββββββββββ ββββββββββββββββ
β
ββββββββββββββββββββ
β β
βββββΌβββββββββββ ββββββΌββββββββββ
β Sub-agents β β Heartbeat β
β Strategy: β β Monitor β
β one-for-one β β Strategy: β
β max 3 retries β β restart β
ββββββββββββββββ ββββββββββββββββ
Supervision strategies:
- Main agent: Always restart (it's the brain)
- Sub-agents: One-for-one restart, max 3 attempts, then escalate to main agent
- Cron jobs: Restart on failure, skip missed cycles (don't stack up)
- External services (SearxNG, COSMO IDE): Circuit breaker β 3 failures β open β 5min cooldown β half-open retry
Health check hierarchy
L1: Process alive? β PM2/systemd checks (every 30s)
L2: Responding? β Gateway health endpoint (every 60s)
L3: Making progress? β Heartbeat checks PROGRESS.md timestamps (every 30min)
L4: Producing quality? β Artifact validation on completion
5. State Management Design
Drawing from Unit 5 (state management & recovery):
Event sourcing (already in place)
Events (ground truth): memory/YYYY-MM-DD.md (append-only daily logs)
Projection (derived): STATE.json (current state cache)
Snapshot (compaction): MEMORY.md (compressed knowledge)
Write-ahead protocol for state updates
1. INTENT: Append to daily memory: "Starting unit 8 of systems-design..."
2. ACTION: Write artifact file
3. VERIFY: Confirm file exists and has expected content
4. UPDATE: Write new STATE.json (via temp file + atomic rename)
5. CONFIRM: Append to daily memory: "Unit 8 complete, STATE updated"
If crash occurs between steps 2 and 4: next orchestrator run detects artifact exists but STATE doesn't reflect it β auto-correct STATE.
Saga for cross-agent operations
Example: Sibling knowledge sync
T1: Agent A extracts new entity facts β C1: Delete extracted facts
T2: Agent A writes to shared entity file β C2: Revert file (git checkout)
T3: Agent A notifies Agent B via webhook β C3: Send cancellation
T4: Agent B integrates facts locally β C4: Agent B reverts integration
T5: Agent B confirms integration β (commit point)
6. Observability Stack
Drawing from Unit 6 (observability & debugging):
Lightweight, file-based observability
No external infrastructure. Everything is files that standard Unix tools can query.
workspace/
βββ logs/
β βββ YYYY-MM-DD.jsonl # Structured event log
β βββ metrics-YYYY-MM-DD.jsonl # RED metrics
βββ traces/
β βββ YYYY-MM-DD.jsonl # Distributed trace spans
βββ HEARTBEAT.md # Human-readable system status
βββ LOAD_STATUS.json # Machine-readable load indicator
Key metrics (RED)
| Metric | Source | Alert Threshold |
|---|---|---|
| Autostudy units/day | STATE.json diffs | <2 during active study |
| Webhook success rate | logs/*.jsonl | <80% over 1h |
| Sub-agent completion rate | sessions list | <70% |
| Heartbeat-to-response latency | log timestamps | >60s |
| Daily memory file size | file stat | >100KB (needs compaction) |
Correlation strategy
Every action carries: trace_id (end-to-end journey) + span_id (this step) + correlation_id (cross-boundary link). Embedded in log entries, webhook payloads, and artifact file headers.
7. Degradation Policy
Drawing from Unit 7 (graceful degradation):
Four operating modes
| Mode | Entry Condition | Exit Condition | Behavior |
|---|---|---|---|
| Normal | Default | β | All features, all frequencies |
| Degraded | Any: disk >85%, sibling unreachable >30min, API rate limited | Condition resolved for 2 consecutive checks (hysteresis) | Shed P3, reduce P2 frequency, preserve P0/P1 |
| Emergency | Any: disk >95%, STATE corruption unrecoverable, security incident | Human intervention | P0 only, alert the-operator, full stop on background work |
| Maintenance | Human sets MAINTENANCE_MODE in HEARTBEAT |
Human removes flag | Respond to direct messages only |
Load shedding order (last shed β first shed)
1. Entity extraction (P3) β shed first
2. Proactive email/calendar (P3)
3. Memory compaction (P3)
4. Autostudy (P2)
5. Real estate search (P2)
6. Heartbeat monitoring (P1)
7. Sub-agent oversight (P1)
8. User message response (P0) β never shed
Retry with backoff
All inter-agent communication uses exponential backoff with full jitter:
delay = random(0, min(base * 2^attempt, max_delay))
base = 1s, max_delay = 60s, max_attempts = 5
Circuit breaker: open after 3 consecutive failures, 5-minute cooldown, half-open test with single request.
8. Security Architecture
Drawing from Unit 8 (security & trust):
Trust zones
Zone 1 (Highest): Main agent sessions β full workspace access
Zone 2 (Medium): Sub-agents β scoped to task directory, time-bounded
Zone 3 (Low): External inputs β validated, rate-limited, sandboxed
Zone 4 (Trusted): Human (the-operator) β ultimate authority, can override anything
Authentication layers
| Channel | Current | Recommended Upgrade |
|---|---|---|
| Webhook | Bearer token (shared) | Per-agent HMAC + timestamp + nonce |
| SSH | Ed25519 key pair | β Already strong |
| Git | SSH key | β Already strong |
| Sub-agent | Session scoping | Add explicit capability tokens |
Blast radius containment
- Sub-agents: isolated directory, 30min timeout, output validated before integration
- Webhooks: schema validation before processing, rate limit 10/min
- External APIs: responses validated, error paths don't expose internal state
- File operations:
trashoverrm, git as recovery mechanism
9. Implementation Roadmap
Phase 1: Foundation (can do now)
- [ ] Add HMAC signing to webhook calls (replace bare bearer tokens)
- [ ] Add timestamp + sequence number to all webhook payloads
- [ ] Create
LOAD_STATUS.jsonand check it in heartbeat - [ ] Add filesystem consistency check to orchestrator startup (Unit 5 playbook)
Phase 2: Observability (next sprint)
- [ ] Start writing structured JSON logs alongside existing memory files
- [ ] Add trace_id to cron triggers and sub-agent task descriptions
- [ ] Create simple
jqscripts for log analysis - [ ] Add disk usage and basic RED metrics to heartbeat checks
Phase 3: Resilience (following sprint)
- [ ] Implement degradation mode detection and automatic P3 shedding
- [ ] Add circuit breakers to webhook send functions
- [ ] Implement write-ahead logging for state updates
- [ ] Create automated STATE.json recovery from artifacts directory
Phase 4: Security hardening (quarterly)
- [ ] Generate per-agent HMAC keys, deploy to both nodes
- [ ] Implement token rotation mechanism
- [ ] Add sub-agent capability scoping
- [ ] Security audit: review all trust boundary crossings
10. Conclusion
A two-node agent network is paradoxically both simpler and harder than a large distributed system. Simpler because there are only two participants, known topology, trusted network. Harder because with N=2, you can't use quorum-based techniques β every node is critical.
The architecture presented here addresses this by:
- Accepting partition as normal (Mac sleeps, Pi reboots) rather than treating it as exceptional
- Using CRDTs over consensus for shared state β availability over strict consistency
- Implementing defense in depth across authentication, authorization, validation, containment, and audit
- Designing for graceful degradation with explicit operating modes and shed ordering
- Making everything observable through structured logs, traces, and metrics β all stored as simple files
The key insight from this curriculum: resilience is not about preventing failures β it's about designing systems where failures are normal, expected, and handled automatically. The COZ/Axiom network already embodies many of these principles informally. This architecture makes them explicit, testable, and improvable.
Score self-assessment: This dissertation synthesizes all 8 units into a coherent, implementable architecture with concrete failure scenarios, recovery strategies, and a phased implementation plan. It's grounded in the actual COZ/Axiom deployment rather than being abstract theory. Estimated score: 89/100 β strong practical applicability and synthesis, could be deeper on formal verification and testing of the recovery strategies.