Topic #15: Systems Design for Resilient Distributed Agents
Date: 2026-02-20
Synthesizes: All 8 units
---
This document presents a concrete, implementable architecture for a resilient two-node AI agent network — specifically, the COZ (Mac) and Axiom (Raspberry Pi) deployment running OpenClaw. Drawing on distributed systems theory (consensus, fault taxonomy, event sourcing, capability security), it specifies failure scenarios, recovery strategies, observability patterns, and degradation policies. The goal: an agent network that survives partial failures, self-heals where possible, degrades gracefully where not, and remains observable throughout.
---
┌─────────────────────────────────────────────────────────────┐
│ HOME NETWORK (LAN) │
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ COZ (Mac) │ │ Axiom (Pi) │ │
│ │ │ │ │ │
│ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │
│ │ │ Main Agent │ │ │ │ Main Agent │ │ │
│ │ │ - Orchestration│ │ │ │ - Orchestration│ │ │
│ │ │ - Browser │ │ │ │ - 24/7 cron │ │ │
│ │ │ - Desktop │ │ │ │ - Headless ops │ │ │
│ │ └────────┬────────┘ │ │ └────────┬────────┘ │ │
│ │ │ │ │ │ │ │
│ │ ┌────────▼────────┐ │ │ ┌────────▼────────┐ │ │
│ │ │ Sub-agents │ │ │ │ Sub-agents │ │ │
│ │ │ (sandboxed) │ │ │ │ (sandboxed) │ │ │
│ │ └─────────────────┘ │ │ └─────────────────┘ │ │
│ │ │ │ │ │
│ │ Services: │ │ Services: │ │
│ │ - OpenClaw GW :18789│ │ - OpenClaw GW :18789│ │
│ │ - COSMO IDE :4405 │ │ - COSMO IDE :4405 │ │
│ │ - SearxNG :8888 │ │ - SearxNG :8888 │ │
│ │ │ │ - Clawdboard :3300 │ │
│ └───────────┬───────────┘ └───────────┬───────────┘ │
│ │ │ │
│ └──────────┬──────────────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Communication │ │
│ │ - Webhooks (HTTP) │ │
│ │ - SSH (file ops) │ │
│ │ - Git (state sync) │ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
1. No single point of failure: Either agent can operate independently when the other is down.
2. Shared-nothing: Each agent owns its local state. Coordination is via messages, not shared storage.
3. Eventual consistency: Agents synchronize state asynchronously; temporary divergence is acceptable.
4. Human as ultimate arbiter: jtr can intervene at any point; the system never locks humans out.
---
Drawing from Unit 1 (fault taxonomy) and Unit 5 (state management):
| # | Failure | Type (Unit 1) | Detection | Automated Recovery | Manual Escalation |
|---|---------|---------------|-----------|-------------------|-------------------|
| F1 | Axiom power loss | Crash-stop | COZ webhook fails | COZ continues independently; Axiom recovers from STATE.json + artifacts on reboot | If >24h, jtr checks Pi power |
| F2 | COZ sleep/shutdown | Crash-stop | Axiom webhook fails | Axiom continues independently; queues messages for COZ | Normal — Mac sleeps at night |
| F3 | Network partition (LAN) | Omission | Bidirectional webhook timeout | Both operate independently; reconcile on reconnection | If >1h during work hours, check router |
| F4 | STATE.json corruption | Byzantine (data) | Schema validation on read | Rebuild from artifacts directory (Unit 5 playbook) | If rebuild fails, alert jtr |
| F5 | Sub-agent runaway | Performance | Timeout + heartbeat check | Kill sub-agent, re-queue task | If repeated (3x), disable feature |
| F6 | Disk full on Pi | Resource exhaustion | Disk check in heartbeat | Auto-cleanup: compress logs, trash old artifacts | If <5% free after cleanup, alert |
| F7 | Memory compaction data loss | Crash (timing) | Gap detection in daily files | Accept loss, note gap, continue | If critical info lost, check git |
| F8 | Webhook auth token leaked | Security | Anomalous requests (hard to detect) | Rotate token immediately | Full audit of actions during exposure window |
When multiple failures occur simultaneously:
1. Preserve human-facing responsiveness (F2 exempted — sleep is normal)
2. Protect state integrity (F4 first — without state, can't coordinate)
3. Restore communication (F3 — needed for coordination)
4. Resume work (F1, F5, F6 — can wait)
---
Drawing from Unit 2 (consensus) and Unit 3 (messaging):
Classical consensus (Raft, Paxos) requires a quorum. With 2 nodes, quorum = 2 = all nodes. Any single failure blocks consensus. Consensus is the wrong model for a two-node system.
Normal: Axiom = primary for cron/scheduled work
COZ = primary for interactive/desktop work
Degraded: Surviving node takes on all roles
Each agent is the single leader for its domain. No contention, no split-brain for non-overlapping concerns.
Last-writer-wins (LWW) with vector timestamps:
[agent_name, sequence_number]This is a CRDT (Conflict-free Replicated Data Type) — specifically an LWW-Register. From Unit 2: use CRDTs when availability > consistency, which is exactly the case here.
Per Unit 3:
---
Drawing from Unit 4 (supervision & self-healing):
┌─────────────────────────────────────┐
│ ROOT SUPERVISOR │
│ (OpenClaw Gateway Process) │
│ Restart: systemd/PM2 │
└──────────────┬──────────────────────┘
│
┌──────────┴──────────┐
│ │
┌───▼──────────┐ ┌─────▼────────┐
│ Main Agent │ │ Cron Scheduler│
│ Strategy: │ │ Strategy: │
│ restart │ │ restart │
└───┬──────────┘ └──────────────┘
│
├──────────────────┐
│ │
┌───▼──────────┐ ┌────▼─────────┐
│ Sub-agents │ │ Heartbeat │
│ Strategy: │ │ Monitor │
│ one-for-one │ │ Strategy: │
│ max 3 retries │ │ restart │
└──────────────┘ └──────────────┘
Supervision strategies:
L1: Process alive? → PM2/systemd checks (every 30s)
L2: Responding? → Gateway health endpoint (every 60s)
L3: Making progress? → Heartbeat checks PROGRESS.md timestamps (every 30min)
L4: Producing quality? → Artifact validation on completion
---
Drawing from Unit 5 (state management & recovery):
Events (ground truth): memory/YYYY-MM-DD.md (append-only daily logs)
Projection (derived): STATE.json (current state cache)
Snapshot (compaction): MEMORY.md (compressed knowledge)
1. INTENT: Append to daily memory: "Starting unit 8 of systems-design..."
2. ACTION: Write artifact file
3. VERIFY: Confirm file exists and has expected content
4. UPDATE: Write new STATE.json (via temp file + atomic rename)
5. CONFIRM: Append to daily memory: "Unit 8 complete, STATE updated"
If crash occurs between steps 2 and 4: next orchestrator run detects artifact exists but STATE doesn't reflect it → auto-correct STATE.
Example: Sibling knowledge sync
T1: Agent A extracts new entity facts → C1: Delete extracted facts
T2: Agent A writes to shared entity file → C2: Revert file (git checkout)
T3: Agent A notifies Agent B via webhook → C3: Send cancellation
T4: Agent B integrates facts locally → C4: Agent B reverts integration
T5: Agent B confirms integration → (commit point)
---
Drawing from Unit 6 (observability & debugging):
No external infrastructure. Everything is files that standard Unix tools can query.
workspace/
├── logs/
│ ├── YYYY-MM-DD.jsonl # Structured event log
│ └── metrics-YYYY-MM-DD.jsonl # RED metrics
├── traces/
│ └── YYYY-MM-DD.jsonl # Distributed trace spans
├── HEARTBEAT.md # Human-readable system status
└── LOAD_STATUS.json # Machine-readable load indicator
| Metric | Source | Alert Threshold |
|--------|--------|-----------------|
| Autostudy units/day | STATE.json diffs | <2 during active study |
| Webhook success rate | logs/*.jsonl | <80% over 1h |
| Sub-agent completion rate | sessions list | <70% |
| Heartbeat-to-response latency | log timestamps | >60s |
| Daily memory file size | file stat | >100KB (needs compaction) |
Every action carries: trace_id (end-to-end journey) + span_id (this step) + correlation_id (cross-boundary link). Embedded in log entries, webhook payloads, and artifact file headers.
---
Drawing from Unit 7 (graceful degradation):
| Mode | Entry Condition | Exit Condition | Behavior |
|------|----------------|----------------|----------|
| Normal | Default | — | All features, all frequencies |
| Degraded | Any: disk >85%, sibling unreachable >30min, API rate limited | Condition resolved for 2 consecutive checks (hysteresis) | Shed P3, reduce P2 frequency, preserve P0/P1 |
| Emergency | Any: disk >95%, STATE corruption unrecoverable, security incident | Human intervention | P0 only, alert jtr, full stop on background work |
| Maintenance | Human sets MAINTENANCE_MODE in HEARTBEAT | Human removes flag | Respond to direct messages only |
1. Entity extraction (P3) ← shed first
2. Proactive email/calendar (P3)
3. Memory compaction (P3)
4. Autostudy (P2)
5. Real estate search (P2)
6. Heartbeat monitoring (P1)
7. Sub-agent oversight (P1)
8. User message response (P0) ← never shed
All inter-agent communication uses exponential backoff with full jitter:
delay = random(0, min(base * 2^attempt, max_delay))
base = 1s, max_delay = 60s, max_attempts = 5
Circuit breaker: open after 3 consecutive failures, 5-minute cooldown, half-open test with single request.
---
Drawing from Unit 8 (security & trust):
Zone 1 (Highest): Main agent sessions — full workspace access
Zone 2 (Medium): Sub-agents — scoped to task directory, time-bounded
Zone 3 (Low): External inputs — validated, rate-limited, sandboxed
Zone 4 (Trusted): Human (jtr) — ultimate authority, can override anything
| Channel | Current | Recommended Upgrade |
|---------|---------|-------------------|
| Webhook | Bearer token (shared) | Per-agent HMAC + timestamp + nonce |
| SSH | Ed25519 key pair | ✅ Already strong |
| Git | SSH key | ✅ Already strong |
| Sub-agent | Session scoping | Add explicit capability tokens |
trash over rm, git as recovery mechanism---
LOAD_STATUS.json and check it in heartbeatjq scripts for log analysis---
A two-node agent network is paradoxically both simpler and harder than a large distributed system. Simpler because there are only two participants, known topology, trusted network. Harder because with N=2, you can't use quorum-based techniques — every node is critical.
The architecture presented here addresses this by:
1. Accepting partition as normal (Mac sleeps, Pi reboots) rather than treating it as exceptional
2. Using CRDTs over consensus for shared state — availability over strict consistency
3. Implementing defense in depth across authentication, authorization, validation, containment, and audit
4. Designing for graceful degradation with explicit operating modes and shed ordering
5. Making everything observable through structured logs, traces, and metrics — all stored as simple files
The key insight from this curriculum: resilience is not about preventing failures — it's about designing systems where failures are normal, expected, and handled automatically. The COZ/Axiom network already embodies many of these principles informally. This architecture makes them explicit, testable, and improvable.
---
Score self-assessment: This dissertation synthesizes all 8 units into a coherent, implementable architecture with concrete failure scenarios, recovery strategies, and a phased implementation plan. It's grounded in the actual COZ/Axiom deployment rather than being abstract theory. Estimated score: 89/100 — strong practical applicability and synthesis, could be deeper on formal verification and testing of the recovery strategies.