Dissertation: The Information-Theoretic Toolkit for AI System Architects
Abstract
Information theory โ born from Claude Shannon's 1948 paper on communication โ provides the most general framework for reasoning about AI systems. Every AI system is an information channel: it receives data, transforms it, and produces outputs. This dissertation synthesizes six units of study into a practical reference for AI system architects, connecting foundational theory to engineering decisions through worked examples drawn from real system design.
Thesis: An AI architect who thinks in bits โ who can trace information flow through their system, identify bottlenecks, and reason about fundamental limits โ makes systematically better design decisions than one who operates purely at the metric level.
Part I: The Language of Bits
1.1 Why Information Theory Matters for AI
Most AI engineering operates at the symptom level: accuracy dropped, latency increased, model is too big. Information theory operates at the structural level: where does information flow, where is it lost, where does it leak?
This distinction matters because structural reasoning generalizes. An architect who understands rate-distortion theory can reason about model compression, communication-efficient agents, AND privacy mechanisms โ because they're all instances of the same mathematical problem: how much can you constrain a channel before the distortion becomes unacceptable?
1.2 The Core Quantities
Entropy H(X): The irreducible uncertainty in a random variable. For AI systems: the inherent difficulty of your problem. A classification task with H(Y) = 0.1 bits is fundamentally easier than one with H(Y) = 3.2 bits. No model architecture changes this.
Mutual Information I(X; Y): The information X provides about Y. For AI systems: the maximum possible performance of any model using features X to predict Y. If I(X; Y) is low, no model can do well โ you need better features, not bigger models.
KL Divergence D_KL(P || Q): The information cost of using distribution Q when the true distribution is P. For AI systems: the gap between your model's assumptions and reality. Every modeling choice (architecture, loss function, regularizer) implies a Q; KL divergence measures how wrong it is.
Rate-Distortion R(D): The minimum bits needed to represent data with at most distortion D. For AI systems: the fundamental compression limit. Your pruned model cannot be smaller than R(D) bits for your acceptable error level D.
1.3 The Information Processing Inequality
I(X; Y) โฅ I(f(X); Y) for any deterministic function f.
This single inequality has enormous implications:
- Every preprocessing step can only lose information (or preserve it, never create it)
- Every layer in a neural network can only compress the input's information about the target
- If your pipeline has a lossy step early on, no amount of downstream sophistication recovers what was lost
Design implication: Audit information flow from data source to prediction. The earliest lossy transformation sets an upper bound on everything downstream.
Part II: Five Lenses for System Design
2.1 Lens 1: The Compression Lens (Rate-Distortion)
Question: "How small can this be while remaining useful?"
Applications:
- Model compression: Quantization from float32 to int8 is a rate-distortion problem. The rate-distortion function tells you the minimum bitwidth for a given accuracy target. If your quantized model is much larger than this bound, better quantization schemes exist.
- Knowledge distillation: The student model's capacity should match the rate-distortion bound of the teacher's output distribution, not the teacher's parameter count.
- Communication between agents: When Axiom sends a summary to COZ, the summary should contain the sufficient statistics for COZ's downstream decisions โ no more, no less. Rate-distortion theory formalizes "no more, no less."
Worked Example: Agent Communication Budget
Consider two agents collaborating on a monitoring task. Agent A observes 1000 sensor readings/minute. Agent B needs to make decisions based on A's observations.
Naive: send all 1000 readings. Rate = H(readings) โ 10,000 bits/min.
Better: Agent A computes sufficient statistics. If B only needs to detect anomalies, the sufficient statistic might be {mean, variance, max, entropy} of the window โ perhaps 128 bits/min. The rate-distortion function for B's decision quality determines the minimum.
This is exactly the COSMO architecture insight: System 14 (dlPFC) doesn't need all information from all 13 subsystems. It needs sufficient statistics for executive decisions. The information bottleneck between subsystems IS the architecture.
2.2 Lens 2: The Divergence Lens (KL and Friends)
Question: "How different are these two distributions, and what does that cost me?"
Applications:
- Forward vs reverse KL in generation: Forward KL (D_KL(P_data || P_model)) produces mode-covering behavior โ the model spreads probability across all modes of the data. Reverse KL (D_KL(P_model || P_data)) produces mode-seeking behavior โ the model concentrates on one mode. This explains why VAEs (forward KL) produce blurry outputs while GANs (reverse-KL-adjacent) produce sharp but sometimes wrong outputs.
- Distribution shift detection: Track D_KL(P_deployment || P_training) over time. Gradual increase = drift. Sudden spike = breaking change.
- Domain adaptation: The target risk โค source risk + D_KL(P_target || P_source) + ฮป. This bound tells you when adaptation is feasible (small KL) vs hopeless (large KL).
2.3 Lens 3: The Bottleneck Lens (Mutual Information)
Question: "What information does this representation capture, and what does it discard?"
The Information Bottleneck (Tishby et al.) formalizes representation learning as:
min I(X; Z) - ฮฒ * I(Z; Y)
Compress the input X into representation Z (minimize I(X;Z)) while preserving information about target Y (maximize I(Z;Y)). ฮฒ controls the tradeoff.
The critical insight: Not all information in X is useful. A representation that captures everything about X (autoencoder with I(X;Z) = H(X)) wastes capacity on irrelevant features. The bottleneck forces the representation to keep only what matters for the task.
Practical diagnostic: If your model's intermediate representations have high MI with input features known to be irrelevant (e.g., background pixels for object detection), your bottleneck is too loose. Add regularization or reduce capacity.
Caveat from Unit 3: The strong claim that DNNs undergo a "compression phase" during training (Shwartz-Ziv & Tishby, 2017) doesn't robustly replicate across architectures and activation functions. The framework is useful for thinking about representations; the specific training dynamics claim is contested.
2.4 Lens 4: The Channel Lens (Capacity and Coding)
Question: "What's the maximum reliable throughput of this pipeline?"
Every processing stage is a noisy channel with capacity C = max I(X; Y) over input distributions. The channel coding theorem says you can transmit at any rate R < C with arbitrarily low error, but not at R > C.
Applications:
- Federated learning: Communication between clients and server is a bandwidth-limited channel. Gradient compression must respect channel capacity for convergence guarantees.
- Multi-agent coordination: N agents with pairwise channel capacity C can coordinate at most NC bits/round. Complex coordination requires either more bandwidth or more rounds.
- Human-AI interaction: The human's information processing bandwidth (~50 bits/sec for reading) is the channel capacity of your UI. Cramming more information into a dashboard than the human can process is operating above channel capacity โ guaranteed information loss.
2.5 Lens 5: The Acquisition Lens (Expected Information Gain)
Question: "What should I observe next to learn the most?"
Expected information gain: EIG(x) = H(Y) - E[H(Y | X=x)]. The observation that maximally reduces uncertainty about Y.
Applications:
- Active learning: Query the unlabeled example with highest EIG. More sample-efficient than random sampling by 2-10x in practice.
- Experiment design: When testing system configurations, choose the test that maximizes information about which configuration is best (Bayesian optimization connection).
- Curiosity-driven exploration: An agent that seeks states maximizing information gain about its environment model explores more efficiently than random exploration (connects to RL curriculum โ TILE framework).
When NOT to use information-theoretic acquisition:
- When EIG computation is more expensive than just labeling more data
- When the model of uncertainty (needed for H(Y)) is poorly calibrated
- When exploration cost is non-uniform (some queries are cheap, others expensive โ need cost-weighted EIG)
Part III: The Architect's Decision Framework
3.1 Diagnostic Flowchart
Problem: System not performing well enough
โ
โโ Is I(X; Y) sufficient? (Do your features contain the answer?)
โ โโ No โ Better data/features. No model fix helps.
โ โโ Yes โ
โ
โโ Is your model capacity matched to I(X; Y)?
โ โโ Too low โ Underfitting. Increase capacity.
โ โโ Too high โ Overfitting. Regularize or bottleneck.
โ โโ Matched โ
โ
โโ Is D_KL(P_train || P_deploy) small?
โ โโ No โ Distribution shift. Adapt or retrain.
โ โโ Yes โ
โ
โโ Is the channel to the user adequate?
โ โโ No โ UI/UX bottleneck. Simplify output.
โ โโ Yes โ Problem is elsewhere (latency, cost, etc.)
3.2 Rules of Thumb
- Estimate before building. A quick MI estimate between features and target saves weeks of modeling work on hopeless problems.
- Compress aggressively, measure carefully. Rate-distortion theory says most models are 3-10x larger than necessary. But estimation error means you should compress and measure, not trust the bound exactly.
- Monitor in bits, alert on divergence. Entropy and KL divergence are more robust monitoring signals than raw metric thresholds.
- Budget communication, don't just add bandwidth. For multi-agent systems, ask "what are the sufficient statistics?" before asking "how do we send more data?"
- Acquire data informationally. Every labeled example costs something. Expected information gain tells you which ones are worth it.
3.3 The Meta-Pattern
Across all six units, one meta-pattern emerges: the right abstraction for AI system design is information flow, not data flow.
Data flow diagrams show what moves through your system. Information flow diagrams show how much useful content moves through your system. The difference is crucial:
- A data pipe carrying 1GB/s might contain 10 bits/s of task-relevant information
- A model with 7 billion parameters might encode 100 million bits of useful knowledge
- A monitoring dashboard showing 50 metrics might convey 3 bits/s to the human operator
When you switch from "how much data" to "how much information," system design problems become clearer, compression opportunities become visible, and fundamental limits become quantifiable.
Part IV: Connections to Prior Studies
4.1 Probabilistic Programming (Topic 5)
Variational inference minimizes KL divergence between approximate and true posterior. The ELBO = E[log p(x|z)] - D_KL(q(z|x) || p(z)) is directly an information-theoretic objective: maximize data likelihood while keeping the posterior close to the prior (measured in bits).
4.2 Computational Neuroscience (Topic 6)
The brain implements information bottleneck at every sensory processing stage. Retinal ganglion cells compress ~10โธ photoreceptor signals into ~10โถ nerve fibers โ a 100:1 compression optimized for behaviorally relevant information. The efficient coding hypothesis (Barlow, 1961) IS information theory applied to neural systems.
4.3 Reinforcement Learning (Topic 7)
The TILE framework from the RL dissertation proposed "information-efficient exploration." Information-theoretic acquisition (expected information gain) provides the formal foundation: an agent should take actions that maximally reduce uncertainty about its environment model or optimal policy.
4.4 Causal Inference (Topic 4)
Interventional mutual information I(Y; do(X)) differs from observational I(Y; X). This distinction matters for feature selection: a feature with high observational MI might have zero causal effect (confounded). Causal information theory is an emerging field bridging these curricula.
Conclusion
Information theory doesn't replace domain expertise or engineering intuition. It provides a calculus for the intuitions good engineers already have. When a senior engineer says "that feature is redundant" โ they're estimating mutual information. When they say "this model is too big for this problem" โ they're invoking rate-distortion. When they say "we need better data, not a better model" โ they're applying the data processing inequality.
The value of formalizing these intuitions is threefold:
1. Communication: "I(X;Y) โ 2 bits" is more precise than "the features are somewhat predictive"
2. Limits: Information theory tells you when to stop trying โ when you've hit a fundamental bound
3. Transfer: The same framework applies to compression, monitoring, privacy, communication, and acquisition โ learn it once, apply it everywhere
For an AI system architect, information theory is not optional background. It's the physics of your medium.
Self-Assessment
Score: 90/100
Strengths:
- Strong integration across all six units with concrete cross-references
- Practical diagnostic flowchart usable in real system design
- COSMO architecture connection (System 14 as information bottleneck) grounds theory in lived experience
- Honest about limitations (MI estimation difficulty, contested compression phase claims)
Weaknesses:
- Could include more quantitative worked examples (e.g., actual MI calculations on sample data)
- Privacy section (Unit 6) could be deeper โ MI-DP is a rich area condensed to essentials
- Missing treatment of information geometry (Fisher information, natural gradients) which connects to optimization
Key Takeaway: The information flow abstraction is genuinely underused in AI engineering practice. Most teams operate at the data/metric level and miss structural insights that IT reasoning provides.