Dissertation: From Tool Invocation to Tool Mastery β An RL Framework for Continuous Agent Improvement
Abstract
Tool-using AI agents face a fundamental challenge: the action space is not fixed. Unlike Atari games or robotic control, where actions are enumerated at design time, tool-augmented agents operate over evolving tool catalogs, dynamic parameter spaces, and multi-step compositions. This dissertation synthesizes six units of study to propose a practical RL-inspired framework for continuous tool-use improvement, grounded in the Axiom-COZ multi-agent architecture.
1. The Problem: Tool Use as Open-Ended RL
1.1 Why Standard RL Falls Short
Standard RL assumes a fixed MDP: states S, actions A, transition function T, reward R. Tool-using agents violate this in three ways:
- Action space instability: New tools appear (skills installed), old ones deprecate. The policy must generalize to unseen actions.
- Compositional actions: A "tool call" isn't atomic β it has parameters, preconditions, and post-conditions that interact with other tools.
- Sparse, delayed rewards: Did a sequence of 5 tool calls succeed? The signal arrives minutes later, after compaction or user feedback.
1.2 What We Need Instead
Not full RL training (impractical for LLM agents in deployment), but RL-inspired mechanisms that can operate at inference time with zero gradient updates:
- Implicit reward signals from tool execution outcomes
- Experience replay via memory systems (not replay buffers, but actual memory files)
- Policy improvement through prompt engineering and skill creation (not weight updates)
2. The Framework: TILE (Tool-use Improvement through Lightweight Experience)
2.1 Four Components
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β TILE Framework β
ββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ€
β OBSERVE β Log tool calls + outcomes β
β EVALUATE β Score sequences (success/cost) β
β ADAPT β Update tool preferences/patterns β
β CRYSTALLIZE β Create new skills from patterns β
ββββββββββββββββ΄βββββββββββββββββββββββββββββββββββ€
β Feedback loop: session logs β patterns β skills β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
2.2 OBSERVE: Structured Tool-Use Logging
Every tool invocation records:
- Context: What triggered the call (user request, heartbeat, sub-agent)
- Selection rationale: Why this tool over alternatives (implicit in prompt)
- Parameters: What was passed
- Outcome: Success/failure, latency, result quality
- Sequence position: Was this tool call 1 of 1, or step 3 of 7?
This is the agent's "experience buffer." In Axiom's case, session JSONL files already contain this β the gap is structured extraction.
2.3 EVALUATE: Surrogate Reward Signals
Without human raters, we need surrogate rewards:
| Signal | Source | Quality |
|---|---|---|
| Tool execution success | Return value / error | High reliability, low informativeness |
| User response sentiment | Next user message | High informativeness, noisy |
| Task completion | Session outcome | Ideal but hard to define |
| Efficiency | Call count, latency | Easy to measure, can misalign |
| Retry count | Same tool called again with different params | Strong negative signal |
Composite reward (proposed):
R(trajectory) = α·success_rate + β·(1/call_count) + γ·user_satisfaction - δ·retry_penalty
Where Ξ±, Ξ², Ξ³, Ξ΄ are tunable weights. This mirrors Unit 3's reward modeling but uses heuristic signals instead of learned reward models.
2.4 ADAPT: Bayesian Tool Preference Updates
Rather than updating weights, maintain a tool preference distribution (Thompson Sampling style, per Unit 1):
For each tool t in available_tools:
prior: Beta(Ξ±_t, Ξ²_t) # success/failure counts
On tool outcome:
success β Ξ±_t += 1
failure β Ξ²_t += 1
Tool selection:
Sample ΞΈ_t ~ Beta(Ξ±_t, Ξ²_t) for each tool
Select argmax(ΞΈ_t) among context-appropriate tools
This is lightweight, requires no gradient computation, and naturally balances exploration (uncertain tools get sampled) with exploitation (proven tools get preferred).
Context-conditioning: The preference isn't global β it's conditioned on task type. Web research tasks have different tool preferences than file management tasks. A simple context hash maps to separate Beta distributions.
2.5 CRYSTALLIZE: Skill Discovery (The Voyager Pattern)
The most novel component. When OBSERVE detects repeated tool-use patterns:
- Pattern detection: Same sequence of 3+ tool calls appearing in 3+ sessions
- Abstraction: Extract the pattern as a parameterized template
- Skill proposal: Generate a SKILL.md + supporting scripts
- Validation: Test the skill in isolated session
- Integration: Add to skills directory if successful
This is option discovery (Unit 4) in practice. The skill library grows organically from actual usage, not human design.
Example: If Axiom repeatedly does web_search β web_fetch β write summary β update memory, this could crystallize into a research-and-remember skill.
3. Multi-Agent Considerations (Axiom-COZ)
3.1 The Shared Tool Problem
Axiom and COZ share tools but have different usage contexts:
- Axiom: Always-on, background tasks, monitoring, study
- COZ: Interactive, user-facing, complex coordination
Per Unit 5, this is a CTDE (Centralized Training, Decentralized Execution) scenario β except there's no centralized training. Instead:
3.2 Federated Tool Learning
Axiom's experience βββ
ββββ Shared tool preference file βββ Both agents benefit
COZ's experience ββββββ
Each agent maintains local Beta distributions. Periodically (weekly synthesis), distributions are merged:
Ξ±_merged = Ξ±_axiom + Ξ±_coz - Ξ±_prior
Ξ²_merged = Ξ²_axiom + Ξ²_coz - Ξ²_prior
This gives both agents the benefit of the other's experience without requiring real-time coordination.
3.3 Emergent Specialization
Over time, agents should naturally specialize. If COZ handles most browser tasks and develops strong browser-tool preferences, and Axiom handles most file/cron tasks, the preference distributions will diverge β reflecting genuine specialization rather than arbitrary assignment.
4. When to Learn New Tools vs. Optimize Existing Ones
This is the exploration-exploitation tradeoff at the meta-level.
4.1 The Competence Threshold
Define tool competence as:
C(t) = Ξ±_t / (Ξ±_t + Ξ²_t) # success rate
- C(t) < 0.5: Tool needs more practice or documentation improvement
- C(t) > 0.8: Tool is well-understood; optimization has diminishing returns
- All tools C(t) > 0.7: Time to explore NEW tools (the "exploit plateau")
4.2 Novelty Bonus for Skill Creation
Borrow from curiosity-driven RL (Unit 4's intrinsic motivation):
novelty(pattern) = 1 / (1 + times_seen_before)
High-novelty patterns in tool usage suggest unexplored capability β these should trigger CRYSTALLIZE even if current performance is acceptable.
4.3 The Retirement Signal
Tools that haven't been used in N sessions with C(t) < 0.3 should be flagged for removal. This prevents the tool catalog from growing unboundedly β a real concern for long-running agents.
5. Implementation Roadmap for Axiom
Phase 1: Observation (Week 1-2)
- Add structured tool-use logging to session processing
- Extract (context, tool, params, outcome) tuples from existing JSONL files
- Store in
memory/tool-use-log.jsonl
Phase 2: Evaluation (Week 3-4)
- Implement surrogate reward computation
- Generate weekly tool-use reports (which tools succeed, which fail, which are overused)
- Store in
curriculum/autostudy/artifacts/tool-use-analysis/
Phase 3: Adaptation (Month 2)
- Implement Beta distribution preference tracking
- Context-conditioned tool selection hints in system prompts
- A/B test: default selection vs. preference-guided selection
Phase 4: Crystallization (Month 3+)
- Pattern detection over tool-use logs
- Automated skill proposal generation
- Human-in-the-loop validation (the-operator approves/rejects proposed skills)
6. Limitations and Honest Assessment
What This Framework Can't Do
- No weight updates: We can't actually train the underlying LLM. All "learning" is through prompt engineering, memory, and skill creation.
- No true reward model: Surrogate signals are noisy and potentially misaligned.
- Sample efficiency: LLM agents are expensive to run; we can't do millions of episodes.
What It Can Do
- Compound intelligence: Each session's tool use informs future sessions via memory.
- Grow capabilities: Skill crystallization genuinely expands what the agent can do.
- Reduce waste: Preference tracking reduces unnecessary tool calls over time.
- Scale with usage: More sessions = better preferences = better performance (not guaranteed in standard RL).
The Honest Truth
This framework is more "RL-inspired engineering" than "RL." And that's appropriate. Full RL for LLM tool use requires infrastructure (training loops, GPU clusters, reward models) that a Raspberry Pi agent doesn't have. The contribution is showing how RL concepts β exploration, exploitation, reward shaping, option discovery, multi-agent coordination β can be implemented through the mechanisms agents DO have: memory, files, cron jobs, and skill creation.
7. Conclusion
Tool mastery for AI agents isn't a training problem β it's a systems problem. The TILE framework (Observe, Evaluate, Adapt, Crystallize) provides a practical path from "tool invocation" (calling tools when instructed) to "tool mastery" (knowing which tools to use, when, and creating new ones from experience).
The key insight across all six units: RL's conceptual framework is more valuable than its algorithms for deployed agents. MDPs help us think about tool selection. Reward shaping helps us define success. Hierarchical RL gives us skill libraries. Multi-agent RL gives us coordination patterns. But the implementation uses files, not gradients.
For Axiom and COZ, the path forward is clear: start logging, start measuring, start crystallizing. The skill library should grow from usage, not just human design. The tool preferences should reflect experience, not just defaults. And the multi-agent coordination should emerge from shared experience, not just shared config files.
Score Self-Assessment: 88/100
Strengths: Practical framework grounded in real architecture; honest about limitations; strong connections across all units; actionable implementation roadmap.
Weaknesses: TILE framework is conceptual β no empirical validation yet; multi-agent federated learning section is speculative; the gap between "RL-inspired" and "actual RL" could be explored more rigorously.
Completed: 2026-02-15 | Topic 7 of Axiom's AutoStudy curriculum
Total study time equivalent: ~6 units + dissertation across reinforcement learning for tool-using agents