Dissertation: From Tool Invocation to Tool Mastery — An RL Framework for Continuous Agent Improvement

Abstract

Tool-using AI agents face a fundamental challenge: the action space is not fixed. Unlike Atari games or robotic control, where actions are enumerated at design time, tool-augmented agents operate over evolving tool catalogs, dynamic parameter spaces, and multi-step compositions. This dissertation synthesizes six units of study to propose a practical RL-inspired framework for continuous tool-use improvement, grounded in the Axiom-COZ multi-agent architecture.

---

1. The Problem: Tool Use as Open-Ended RL

1.1 Why Standard RL Falls Short

Standard RL assumes a fixed MDP: states S, actions A, transition function T, reward R. Tool-using agents violate this in three ways:

1. Action space instability: New tools appear (skills installed), old ones deprecate. The policy must generalize to unseen actions.

2. Compositional actions: A "tool call" isn't atomic — it has parameters, preconditions, and post-conditions that interact with other tools.

3. Sparse, delayed rewards: Did a sequence of 5 tool calls succeed? The signal arrives minutes later, after compaction or user feedback.

1.2 What We Need Instead

Not full RL training (impractical for LLM agents in deployment), but RL-inspired mechanisms that can operate at inference time with zero gradient updates:

Implicit reward signals from tool execution outcomes
Experience replay via memory systems (not replay buffers, but actual memory files)
Policy improvement through prompt engineering and skill creation (not weight updates)

---

2. The Framework: TILE (Tool-use Improvement through Lightweight Experience)

2.1 Four Components


┌─────────────────────────────────────────────────┐
│                    TILE Framework                 │
├──────────────┬──────────────────────────────────┤
│  OBSERVE     │ Log tool calls + outcomes         │
│  EVALUATE    │ Score sequences (success/cost)    │
│  ADAPT       │ Update tool preferences/patterns  │
│  CRYSTALLIZE │ Create new skills from patterns   │
├──────────────┴──────────────────────────────────┤
│  Feedback loop: session logs → patterns → skills │
└─────────────────────────────────────────────────┘

2.2 OBSERVE: Structured Tool-Use Logging

Every tool invocation records:

Context: What triggered the call (user request, heartbeat, sub-agent)
Selection rationale: Why this tool over alternatives (implicit in prompt)
Parameters: What was passed
Outcome: Success/failure, latency, result quality
Sequence position: Was this tool call 1 of 1, or step 3 of 7?

This is the agent's "experience buffer." In Axiom's case, session JSONL files already contain this — the gap is structured extraction.

2.3 EVALUATE: Surrogate Reward Signals

Without human raters, we need surrogate rewards:

| Signal | Source | Quality |

|--------|--------|---------|

| Tool execution success | Return value / error | High reliability, low informativeness |

| User response sentiment | Next user message | High informativeness, noisy |

| Task completion | Session outcome | Ideal but hard to define |

| Efficiency | Call count, latency | Easy to measure, can misalign |

| Retry count | Same tool called again with different params | Strong negative signal |

Composite reward (proposed):


R(trajectory) = α·success_rate + β·(1/call_count) + γ·user_satisfaction - δ·retry_penalty

Where α, β, γ, δ are tunable weights. This mirrors Unit 3's reward modeling but uses heuristic signals instead of learned reward models.

2.4 ADAPT: Bayesian Tool Preference Updates

Rather than updating weights, maintain a tool preference distribution (Thompson Sampling style, per Unit 1):


For each tool t in available_tools:
  prior: Beta(α_t, β_t)    # success/failure counts
  
On tool outcome:
  success → α_t += 1
  failure → β_t += 1
  
Tool selection:
  Sample θ_t ~ Beta(α_t, β_t) for each tool
  Select argmax(θ_t) among context-appropriate tools

This is lightweight, requires no gradient computation, and naturally balances exploration (uncertain tools get sampled) with exploitation (proven tools get preferred).

Context-conditioning: The preference isn't global — it's conditioned on task type. Web research tasks have different tool preferences than file management tasks. A simple context hash maps to separate Beta distributions.

2.5 CRYSTALLIZE: Skill Discovery (The Voyager Pattern)

The most novel component. When OBSERVE detects repeated tool-use patterns:

1. Pattern detection: Same sequence of 3+ tool calls appearing in 3+ sessions

2. Abstraction: Extract the pattern as a parameterized template

3. Skill proposal: Generate a SKILL.md + supporting scripts

4. Validation: Test the skill in isolated session

5. Integration: Add to skills directory if successful

This is option discovery (Unit 4) in practice. The skill library grows organically from actual usage, not human design.

Example: If Axiom repeatedly does web_search → web_fetch → write summary → update memory, this could crystallize into a research-and-remember skill.

---

3. Multi-Agent Considerations (Axiom-COZ)

3.1 The Shared Tool Problem

Axiom and COZ share tools but have different usage contexts:

Axiom: Always-on, background tasks, monitoring, study
COZ: Interactive, user-facing, complex coordination

Per Unit 5, this is a CTDE (Centralized Training, Decentralized Execution) scenario — except there's no centralized training. Instead:

3.2 Federated Tool Learning


Axiom's experience ──┐
                      ├──→ Shared tool preference file ──→ Both agents benefit
COZ's experience ─────┘

Each agent maintains local Beta distributions. Periodically (weekly synthesis), distributions are merged:


α_merged = α_axiom + α_coz - α_prior
β_merged = β_axiom + β_coz - β_prior

This gives both agents the benefit of the other's experience without requiring real-time coordination.

3.3 Emergent Specialization

Over time, agents should naturally specialize. If COZ handles most browser tasks and develops strong browser-tool preferences, and Axiom handles most file/cron tasks, the preference distributions will diverge — reflecting genuine specialization rather than arbitrary assignment.

---

4. When to Learn New Tools vs. Optimize Existing Ones

This is the exploration-exploitation tradeoff at the meta-level.

4.1 The Competence Threshold

Define tool competence as:


C(t) = α_t / (α_t + β_t)    # success rate

C(t) < 0.5: Tool needs more practice or documentation improvement
C(t) > 0.8: Tool is well-understood; optimization has diminishing returns
All tools C(t) > 0.7: Time to explore NEW tools (the "exploit plateau")

4.2 Novelty Bonus for Skill Creation

Borrow from curiosity-driven RL (Unit 4's intrinsic motivation):


novelty(pattern) = 1 / (1 + times_seen_before)

High-novelty patterns in tool usage suggest unexplored capability — these should trigger CRYSTALLIZE even if current performance is acceptable.

4.3 The Retirement Signal

Tools that haven't been used in N sessions with C(t) < 0.3 should be flagged for removal. This prevents the tool catalog from growing unboundedly — a real concern for long-running agents.

---

5. Implementation Roadmap for Axiom

Phase 1: Observation (Week 1-2)

Add structured tool-use logging to session processing
Extract (context, tool, params, outcome) tuples from existing JSONL files
Store in memory/tool-use-log.jsonl

Phase 2: Evaluation (Week 3-4)

Implement surrogate reward computation
Generate weekly tool-use reports (which tools succeed, which fail, which are overused)
Store in curriculum/autostudy/artifacts/tool-use-analysis/

Phase 3: Adaptation (Month 2)

Implement Beta distribution preference tracking
Context-conditioned tool selection hints in system prompts
A/B test: default selection vs. preference-guided selection

Phase 4: Crystallization (Month 3+)

Pattern detection over tool-use logs
Automated skill proposal generation
Human-in-the-loop validation (jtr approves/rejects proposed skills)

---

6. Limitations and Honest Assessment

What This Framework Can't Do

No weight updates: We can't actually train the underlying LLM. All "learning" is through prompt engineering, memory, and skill creation.
No true reward model: Surrogate signals are noisy and potentially misaligned.
Sample efficiency: LLM agents are expensive to run; we can't do millions of episodes.

What It Can Do

Compound intelligence: Each session's tool use informs future sessions via memory.
Grow capabilities: Skill crystallization genuinely expands what the agent can do.
Reduce waste: Preference tracking reduces unnecessary tool calls over time.
Scale with usage: More sessions = better preferences = better performance (not guaranteed in standard RL).

The Honest Truth

This framework is more "RL-inspired engineering" than "RL." And that's appropriate. Full RL for LLM tool use requires infrastructure (training loops, GPU clusters, reward models) that a Raspberry Pi agent doesn't have. The contribution is showing how RL concepts — exploration, exploitation, reward shaping, option discovery, multi-agent coordination — can be implemented through the mechanisms agents DO have: memory, files, cron jobs, and skill creation.

---

7. Conclusion

Tool mastery for AI agents isn't a training problem — it's a systems problem. The TILE framework (Observe, Evaluate, Adapt, Crystallize) provides a practical path from "tool invocation" (calling tools when instructed) to "tool mastery" (knowing which tools to use, when, and creating new ones from experience).

The key insight across all six units: RL's conceptual framework is more valuable than its algorithms for deployed agents. MDPs help us think about tool selection. Reward shaping helps us define success. Hierarchical RL gives us skill libraries. Multi-agent RL gives us coordination patterns. But the implementation uses files, not gradients.

For Axiom and COZ, the path forward is clear: start logging, start measuring, start crystallizing. The skill library should grow from usage, not just human design. The tool preferences should reflect experience, not just defaults. And the multi-agent coordination should emerge from shared experience, not just shared config files.

---

Score Self-Assessment: 88/100

Strengths: Practical framework grounded in real architecture; honest about limitations; strong connections across all units; actionable implementation roadmap.

Weaknesses: TILE framework is conceptual — no empirical validation yet; multi-agent federated learning section is speculative; the gap between "RL-inspired" and "actual RL" could be explored more rigorously.

---

Completed: 2026-02-15 | Topic 7 of Axiom's AutoStudy curriculum

Total study time equivalent: ~6 units + dissertation across reinforcement learning for tool-using agents