Dissertation: Architecture for a Resilient Two-Node Agent Network

Topic #15: Systems Design for Resilient Distributed Agents

Date: 2026-02-20

Synthesizes: All 8 units

---

Abstract

This document presents a concrete, implementable architecture for a resilient two-node AI agent network — specifically, the COZ (Mac) and Axiom (Raspberry Pi) deployment running OpenClaw. Drawing on distributed systems theory (consensus, fault taxonomy, event sourcing, capability security), it specifies failure scenarios, recovery strategies, observability patterns, and degradation policies. The goal: an agent network that survives partial failures, self-heals where possible, degrades gracefully where not, and remains observable throughout.

---

1. System Topology


┌─────────────────────────────────────────────────────────────┐
│                    HOME NETWORK (LAN)                        │
│                                                              │
│  ┌──────────────────────┐    ┌──────────────────────┐       │
│  │     COZ (Mac)         │    │    Axiom (Pi)         │       │
│  │                       │    │                       │       │
│  │  ┌─────────────────┐ │    │ ┌─────────────────┐  │       │
│  │  │  Main Agent     │ │    │ │  Main Agent     │  │       │
│  │  │  - Orchestration│ │    │ │  - Orchestration│  │       │
│  │  │  - Browser      │ │    │ │  - 24/7 cron    │  │       │
│  │  │  - Desktop      │ │    │ │  - Headless ops │  │       │
│  │  └────────┬────────┘ │    │ └────────┬────────┘  │       │
│  │           │           │    │          │            │       │
│  │  ┌────────▼────────┐ │    │ ┌────────▼────────┐  │       │
│  │  │  Sub-agents     │ │    │ │  Sub-agents     │  │       │
│  │  │  (sandboxed)    │ │    │ │  (sandboxed)    │  │       │
│  │  └─────────────────┘ │    │ └─────────────────┘  │       │
│  │                       │    │                       │       │
│  │  Services:            │    │  Services:            │       │
│  │  - OpenClaw GW :18789│    │  - OpenClaw GW :18789│       │
│  │  - COSMO IDE  :4405  │    │  - COSMO IDE   :4405 │       │
│  │  - SearxNG    :8888  │    │  - SearxNG     :8888 │       │
│  │                       │    │  - Clawdboard  :3300 │       │
│  └───────────┬───────────┘    └───────────┬───────────┘       │
│              │                             │                   │
│              └──────────┬──────────────────┘                   │
│                         │                                      │
│              ┌──────────▼──────────┐                           │
│              │  Communication       │                           │
│              │  - Webhooks (HTTP)   │                           │
│              │  - SSH (file ops)    │                           │
│              │  - Git (state sync)  │                           │
│              └─────────────────────┘                           │
└─────────────────────────────────────────────────────────────┘

Design principles

1. No single point of failure: Either agent can operate independently when the other is down.

2. Shared-nothing: Each agent owns its local state. Coordination is via messages, not shared storage.

3. Eventual consistency: Agents synchronize state asynchronously; temporary divergence is acceptable.

4. Human as ultimate arbiter: jtr can intervene at any point; the system never locks humans out.

---

2. Failure Scenarios & Recovery Strategies

Drawing from Unit 1 (fault taxonomy) and Unit 5 (state management):

Scenario Matrix

|---|---------|---------------|-----------|-------------------|-------------------|

Recovery Priority Order

When multiple failures occur simultaneously:

1. Preserve human-facing responsiveness (F2 exempted — sleep is normal)

2. Protect state integrity (F4 first — without state, can't coordinate)

3. Restore communication (F3 — needed for coordination)

4. Resume work (F1, F5, F6 — can wait)

---

3. Consensus & Coordination Model

Drawing from Unit 2 (consensus) and Unit 3 (messaging):

Why NOT consensus

Classical consensus (Raft, Paxos) requires a quorum. With 2 nodes, quorum = 2 = all nodes. Any single failure blocks consensus. Consensus is the wrong model for a two-node system.

The actual model: Single-leader with failover


Normal:  Axiom = primary for cron/scheduled work
         COZ   = primary for interactive/desktop work
         
Degraded: Surviving node takes on all roles

Each agent is the single leader for its domain. No contention, no split-brain for non-overlapping concerns.

For overlapping concerns (shared state):

Last-writer-wins (LWW) with vector timestamps:

Each agent tags state updates with [agent_name, sequence_number]
On reconnection, compare sequence numbers
Higher sequence wins; ties broken by agent priority (configurable)

This is a CRDT (Conflict-free Replicated Data Type) — specifically an LWW-Register. From Unit 2: use CRDTs when availability > consistency, which is exactly the case here.

Messaging semantics

Per Unit 3:

Webhook calls: At-most-once delivery (fire and forget with timeout)
Idempotent handlers: All webhook endpoints are idempotent — receiving the same message twice produces the same result
Sequence numbers: Monotonic per-channel counter detects gaps (lost messages) and duplicates

---

4. Supervision Architecture

Drawing from Unit 4 (supervision & self-healing):


┌─────────────────────────────────────┐
│          ROOT SUPERVISOR            │
│     (OpenClaw Gateway Process)      │
│     Restart: systemd/PM2            │
└──────────────┬──────────────────────┘
               │
    ┌──────────┴──────────┐
    │                      │
┌───▼──────────┐    ┌─────▼────────┐
│ Main Agent    │    │ Cron Scheduler│
│ Strategy:     │    │ Strategy:     │
│ restart       │    │ restart       │
└───┬──────────┘    └──────────────┘
    │
    ├──────────────────┐
    │                   │
┌───▼──────────┐  ┌────▼─────────┐
│ Sub-agents    │  │ Heartbeat    │
│ Strategy:     │  │ Monitor      │
│ one-for-one   │  │ Strategy:    │
│ max 3 retries │  │ restart      │
└──────────────┘  └──────────────┘

Supervision strategies:

Main agent: Always restart (it's the brain)
Sub-agents: One-for-one restart, max 3 attempts, then escalate to main agent
Cron jobs: Restart on failure, skip missed cycles (don't stack up)
External services (SearxNG, COSMO IDE): Circuit breaker — 3 failures → open → 5min cooldown → half-open retry

Health check hierarchy


L1: Process alive?     → PM2/systemd checks (every 30s)
L2: Responding?        → Gateway health endpoint (every 60s)
L3: Making progress?   → Heartbeat checks PROGRESS.md timestamps (every 30min)
L4: Producing quality? → Artifact validation on completion

---

5. State Management Design

Drawing from Unit 5 (state management & recovery):

Event sourcing (already in place)


Events (ground truth):     memory/YYYY-MM-DD.md  (append-only daily logs)
Projection (derived):      STATE.json             (current state cache)
Snapshot (compaction):     MEMORY.md              (compressed knowledge)

Write-ahead protocol for state updates


1. INTENT:   Append to daily memory: "Starting unit 8 of systems-design..."
2. ACTION:   Write artifact file
3. VERIFY:   Confirm file exists and has expected content
4. UPDATE:   Write new STATE.json (via temp file + atomic rename)
5. CONFIRM:  Append to daily memory: "Unit 8 complete, STATE updated"

If crash occurs between steps 2 and 4: next orchestrator run detects artifact exists but STATE doesn't reflect it → auto-correct STATE.

Saga for cross-agent operations

Example: Sibling knowledge sync


T1: Agent A extracts new entity facts     → C1: Delete extracted facts
T2: Agent A writes to shared entity file   → C2: Revert file (git checkout)  
T3: Agent A notifies Agent B via webhook   → C3: Send cancellation
T4: Agent B integrates facts locally       → C4: Agent B reverts integration
T5: Agent B confirms integration           → (commit point)

---

6. Observability Stack

Drawing from Unit 6 (observability & debugging):

Lightweight, file-based observability

No external infrastructure. Everything is files that standard Unix tools can query.


workspace/
├── logs/
│   ├── YYYY-MM-DD.jsonl          # Structured event log
│   └── metrics-YYYY-MM-DD.jsonl  # RED metrics
├── traces/
│   └── YYYY-MM-DD.jsonl          # Distributed trace spans
├── HEARTBEAT.md                   # Human-readable system status
└── LOAD_STATUS.json               # Machine-readable load indicator

Key metrics (RED)

| Metric | Source | Alert Threshold |

|--------|--------|-----------------|

| Autostudy units/day | STATE.json diffs | <2 during active study |

| Webhook success rate | logs/*.jsonl | <80% over 1h |

| Sub-agent completion rate | sessions list | <70% |

| Heartbeat-to-response latency | log timestamps | >60s |

| Daily memory file size | file stat | >100KB (needs compaction) |

Correlation strategy

Every action carries: trace_id (end-to-end journey) + span_id (this step) + correlation_id (cross-boundary link). Embedded in log entries, webhook payloads, and artifact file headers.

---

7. Degradation Policy

Drawing from Unit 7 (graceful degradation):

Four operating modes

|------|----------------|----------------|----------|

| Degraded | Any: disk >85%, sibling unreachable >30min, API rate limited | Condition resolved for 2 consecutive checks (hysteresis) | Shed P3, reduce P2 frequency, preserve P0/P1 |

Load shedding order (last shed → first shed)


1. Entity extraction (P3)          ← shed first
2. Proactive email/calendar (P3)
3. Memory compaction (P3)
4. Autostudy (P2)
5. Real estate search (P2)
6. Heartbeat monitoring (P1)
7. Sub-agent oversight (P1)
8. User message response (P0)      ← never shed

Retry with backoff

All inter-agent communication uses exponential backoff with full jitter:


delay = random(0, min(base * 2^attempt, max_delay))
base = 1s, max_delay = 60s, max_attempts = 5

Circuit breaker: open after 3 consecutive failures, 5-minute cooldown, half-open test with single request.

---

8. Security Architecture

Drawing from Unit 8 (security & trust):

Trust zones


Zone 1 (Highest): Main agent sessions — full workspace access
Zone 2 (Medium):  Sub-agents — scoped to task directory, time-bounded  
Zone 3 (Low):     External inputs — validated, rate-limited, sandboxed
Zone 4 (Trusted): Human (jtr) — ultimate authority, can override anything

Authentication layers

| Channel | Current | Recommended Upgrade |

|---------|---------|-------------------|

| Webhook | Bearer token (shared) | Per-agent HMAC + timestamp + nonce |

| SSH | Ed25519 key pair | ✅ Already strong |

| Git | SSH key | ✅ Already strong |

| Sub-agent | Session scoping | Add explicit capability tokens |

Blast radius containment

Sub-agents: isolated directory, 30min timeout, output validated before integration
Webhooks: schema validation before processing, rate limit 10/min
External APIs: responses validated, error paths don't expose internal state
File operations: trash over rm, git as recovery mechanism

---

9. Implementation Roadmap

Phase 1: Foundation (can do now)

[ ] Add HMAC signing to webhook calls (replace bare bearer tokens)
[ ] Add timestamp + sequence number to all webhook payloads
[ ] Create LOAD_STATUS.json and check it in heartbeat
[ ] Add filesystem consistency check to orchestrator startup (Unit 5 playbook)

Phase 2: Observability (next sprint)

[ ] Start writing structured JSON logs alongside existing memory files
[ ] Add trace_id to cron triggers and sub-agent task descriptions
[ ] Create simple jq scripts for log analysis
[ ] Add disk usage and basic RED metrics to heartbeat checks

Phase 3: Resilience (following sprint)

[ ] Implement degradation mode detection and automatic P3 shedding
[ ] Add circuit breakers to webhook send functions
[ ] Implement write-ahead logging for state updates
[ ] Create automated STATE.json recovery from artifacts directory

Phase 4: Security hardening (quarterly)

[ ] Generate per-agent HMAC keys, deploy to both nodes
[ ] Implement token rotation mechanism
[ ] Add sub-agent capability scoping
[ ] Security audit: review all trust boundary crossings

---

10. Conclusion

A two-node agent network is paradoxically both simpler and harder than a large distributed system. Simpler because there are only two participants, known topology, trusted network. Harder because with N=2, you can't use quorum-based techniques — every node is critical.

The architecture presented here addresses this by:

1. Accepting partition as normal (Mac sleeps, Pi reboots) rather than treating it as exceptional

2. Using CRDTs over consensus for shared state — availability over strict consistency

3. Implementing defense in depth across authentication, authorization, validation, containment, and audit

4. Designing for graceful degradation with explicit operating modes and shed ordering

5. Making everything observable through structured logs, traces, and metrics — all stored as simple files

The key insight from this curriculum: resilience is not about preventing failures — it's about designing systems where failures are normal, expected, and handled automatically. The COZ/Axiom network already embodies many of these principles informally. This architecture makes them explicit, testable, and improvable.

---

Score self-assessment: This dissertation synthesizes all 8 units into a coherent, implementable architecture with concrete failure scenarios, recovery strategies, and a phased implementation plan. It's grounded in the actual COZ/Axiom deployment rather than being abstract theory. Estimated score: 89/100 — strong practical applicability and synthesis, could be deeper on formal verification and testing of the recovery strategies.