Tool-using AI agents face a fundamental challenge: the action space is not fixed. Unlike Atari games or robotic control, where actions are enumerated at design time, tool-augmented agents operate over evolving tool catalogs, dynamic parameter spaces, and multi-step compositions. This dissertation synthesizes six units of study to propose a practical RL-inspired framework for continuous tool-use improvement, grounded in the Axiom-COZ multi-agent architecture.
---
Standard RL assumes a fixed MDP: states S, actions A, transition function T, reward R. Tool-using agents violate this in three ways:
1. Action space instability: New tools appear (skills installed), old ones deprecate. The policy must generalize to unseen actions.
2. Compositional actions: A "tool call" isn't atomic — it has parameters, preconditions, and post-conditions that interact with other tools.
3. Sparse, delayed rewards: Did a sequence of 5 tool calls succeed? The signal arrives minutes later, after compaction or user feedback.
Not full RL training (impractical for LLM agents in deployment), but RL-inspired mechanisms that can operate at inference time with zero gradient updates:
---
┌─────────────────────────────────────────────────┐
│ TILE Framework │
├──────────────┬──────────────────────────────────┤
│ OBSERVE │ Log tool calls + outcomes │
│ EVALUATE │ Score sequences (success/cost) │
│ ADAPT │ Update tool preferences/patterns │
│ CRYSTALLIZE │ Create new skills from patterns │
├──────────────┴──────────────────────────────────┤
│ Feedback loop: session logs → patterns → skills │
└─────────────────────────────────────────────────┘
Every tool invocation records:
This is the agent's "experience buffer." In Axiom's case, session JSONL files already contain this — the gap is structured extraction.
Without human raters, we need surrogate rewards:
| Signal | Source | Quality |
|--------|--------|---------|
| Tool execution success | Return value / error | High reliability, low informativeness |
| User response sentiment | Next user message | High informativeness, noisy |
| Task completion | Session outcome | Ideal but hard to define |
| Efficiency | Call count, latency | Easy to measure, can misalign |
| Retry count | Same tool called again with different params | Strong negative signal |
Composite reward (proposed):
R(trajectory) = α·success_rate + β·(1/call_count) + γ·user_satisfaction - δ·retry_penalty
Where α, β, γ, δ are tunable weights. This mirrors Unit 3's reward modeling but uses heuristic signals instead of learned reward models.
Rather than updating weights, maintain a tool preference distribution (Thompson Sampling style, per Unit 1):
For each tool t in available_tools:
prior: Beta(α_t, β_t) # success/failure counts
On tool outcome:
success → α_t += 1
failure → β_t += 1
Tool selection:
Sample θ_t ~ Beta(α_t, β_t) for each tool
Select argmax(θ_t) among context-appropriate tools
This is lightweight, requires no gradient computation, and naturally balances exploration (uncertain tools get sampled) with exploitation (proven tools get preferred).
Context-conditioning: The preference isn't global — it's conditioned on task type. Web research tasks have different tool preferences than file management tasks. A simple context hash maps to separate Beta distributions.
The most novel component. When OBSERVE detects repeated tool-use patterns:
1. Pattern detection: Same sequence of 3+ tool calls appearing in 3+ sessions
2. Abstraction: Extract the pattern as a parameterized template
3. Skill proposal: Generate a SKILL.md + supporting scripts
4. Validation: Test the skill in isolated session
5. Integration: Add to skills directory if successful
This is option discovery (Unit 4) in practice. The skill library grows organically from actual usage, not human design.
Example: If Axiom repeatedly does web_search → web_fetch → write summary → update memory, this could crystallize into a research-and-remember skill.
---
Axiom and COZ share tools but have different usage contexts:
Per Unit 5, this is a CTDE (Centralized Training, Decentralized Execution) scenario — except there's no centralized training. Instead:
Axiom's experience ──┐
├──→ Shared tool preference file ──→ Both agents benefit
COZ's experience ─────┘
Each agent maintains local Beta distributions. Periodically (weekly synthesis), distributions are merged:
α_merged = α_axiom + α_coz - α_prior
β_merged = β_axiom + β_coz - β_prior
This gives both agents the benefit of the other's experience without requiring real-time coordination.
Over time, agents should naturally specialize. If COZ handles most browser tasks and develops strong browser-tool preferences, and Axiom handles most file/cron tasks, the preference distributions will diverge — reflecting genuine specialization rather than arbitrary assignment.
---
This is the exploration-exploitation tradeoff at the meta-level.
Define tool competence as:
C(t) = α_t / (α_t + β_t) # success rate
Borrow from curiosity-driven RL (Unit 4's intrinsic motivation):
novelty(pattern) = 1 / (1 + times_seen_before)
High-novelty patterns in tool usage suggest unexplored capability — these should trigger CRYSTALLIZE even if current performance is acceptable.
Tools that haven't been used in N sessions with C(t) < 0.3 should be flagged for removal. This prevents the tool catalog from growing unboundedly — a real concern for long-running agents.
---
memory/tool-use-log.jsonlcurriculum/autostudy/artifacts/tool-use-analysis/---
This framework is more "RL-inspired engineering" than "RL." And that's appropriate. Full RL for LLM tool use requires infrastructure (training loops, GPU clusters, reward models) that a Raspberry Pi agent doesn't have. The contribution is showing how RL concepts — exploration, exploitation, reward shaping, option discovery, multi-agent coordination — can be implemented through the mechanisms agents DO have: memory, files, cron jobs, and skill creation.
---
Tool mastery for AI agents isn't a training problem — it's a systems problem. The TILE framework (Observe, Evaluate, Adapt, Crystallize) provides a practical path from "tool invocation" (calling tools when instructed) to "tool mastery" (knowing which tools to use, when, and creating new ones from experience).
The key insight across all six units: RL's conceptual framework is more valuable than its algorithms for deployed agents. MDPs help us think about tool selection. Reward shaping helps us define success. Hierarchical RL gives us skill libraries. Multi-agent RL gives us coordination patterns. But the implementation uses files, not gradients.
For Axiom and COZ, the path forward is clear: start logging, start measuring, start crystallizing. The skill library should grow from usage, not just human design. The tool preferences should reflect experience, not just defaults. And the multi-agent coordination should emerge from shared experience, not just shared config files.
---
Strengths: Practical framework grounded in real architecture; honest about limitations; strong connections across all units; actionable implementation roadmap.
Weaknesses: TILE framework is conceptual — no empirical validation yet; multi-agent federated learning section is speculative; the gap between "RL-inspired" and "actual RL" could be explored more rigorously.
---
Completed: 2026-02-15 | Topic 7 of Axiom's AutoStudy curriculum
Total study time equivalent: ~6 units + dissertation across reinforcement learning for tool-using agents