Dissertation — Topic #28: Operating System Internals for Application Developers
---
Every performance bug, every mysterious crash, every "it works on my machine" stems from a misunderstanding of what happens between your write() call and the actual disk platter spinning. Application developers don't need to be kernel developers, but they need a map — a mental model of the terrain their code traverses on every syscall, every allocation, every packet sent. This dissertation provides that map.
---
Your application lives in user space. The kernel lives in kernel space. The syscall interface is the only legal border crossing. Every interaction — opening a file, allocating memory, sending a packet, creating a thread — funnels through ~450 numbered system calls.
This boundary is not free. A syscall on modern x86-64 costs ~100-200ns (via syscall instruction + kernel entry/exit). On ARM64, similar. This means:
writev() beats ten write() calls)io_uring exists precisely to amortize this cost (submit batches, reap completions)vDSO accelerates read-only calls (gettimeofday, clock_gettime) by mapping kernel data into user spaceDecision tree: If your hot path makes >10K syscalls/sec for small operations, you're paying a tax. Consider batching, io_uring, or mmap.
The kernel doesn't see "threads" vs "processes" — it sees task_struct. clone() with different flags creates either. What changes is what's shared: address space, file descriptors, signal handlers.
CFS (Completely Fair Scheduler) maintains a red-black tree of runnable tasks, ordered by vruntime. The task with the smallest vruntime runs next. Nice values scale vruntime accumulation — nice -20 accumulates slowly (gets more CPU), nice 19 accumulates quickly.
Why this matters for apps:
taskset / sched_setaffinity() pins threads to CPUs — eliminates cache migrationSCHED_FIFO gives you real-time priority but can starve the system---
Every process believes it has 128TB of contiguous address space. It doesn't. The kernel maintains page tables mapping virtual → physical, with the TLB caching recent translations. A TLB miss costs ~10-100ns; a page fault (page not in RAM) costs ~1-10ms if it hits disk.
malloc(4096)
→ glibc allocator checks thread-local arena
→ arena has free chunk? Return it (no syscall)
→ arena exhausted? sbrk() or mmap() to get pages from kernel
→ kernel updates page tables (lazy: no physical page yet)
→ first write triggers page fault → kernel allocates physical page
This means malloc often doesn't touch the kernel. And the kernel often doesn't allocate physical memory until you write. Everything is lazy.
| Pattern | Kernel Behavior | App Impact |
|---------|----------------|------------|
| Small allocs (<128KB) | glibc arena (sbrk) | Fast, but fragmentation risk |
| Large allocs (>128KB) | mmap anonymous | Returned to OS on free, but TLB pressure |
| fork() | COW pages shared | Cheap until child writes (then page fault storm) |
| mmap file | Page cache backed | Great for read-heavy, watch out for random write |
| OOM | Kernel kills biggest RSS process | Your daemon is probably the biggest target |
Decision tree: Memory-related slowness? Check /proc/ for RSS vs VSZ. High page faults in perf stat? Your working set exceeds physical RAM or you have poor locality.
---
write(fd, buf, len)
→ VFS: dispatch to filesystem
→ Filesystem (ext4/xfs): journal + metadata
→ Page cache: mark page dirty (return to user — "done")
→ pdflush/writeback: async flush to block device
→ Block layer: merge, schedule (CFQ/mq-deadline/BFQ)
→ Device driver → actual disk write
When write() returns, your data is in the page cache, not on disk. It's durable against process crashes (kernel will flush it) but NOT against power loss.
| Call | Guarantees | Cost |
|------|-----------|------|
| write() | In page cache | ~μs |
| fdatasync() | Data on disk, metadata maybe not | ~ms |
| fsync() | Data + metadata on disk | ~ms |
| O_DIRECT + fsync() | Bypasses page cache entirely | Variable |
| O_SYNC | Every write is fsync'd | Expensive |
Decision tree: Database? Use fdatasync() after WAL writes. Log file? Buffered write() is fine. Config file update? Write to temp file + fsync() + rename() (the atomic replace pattern).
io_uring eliminates per-operation syscall overhead:
io_uring_enter() only needed to kick the kernel (and can be avoided with SQPOLL)fsync, openat, even spliceFor high-IOPS workloads (databases, storage engines), io_uring can deliver 2-5x throughput over traditional read/write loops.
---
send(sockfd, buf, len, 0)
→ Socket layer: copy to kernel buffer (sk_buff)
→ TCP: segmentation, sequence numbers, congestion window
→ IP: routing, fragmentation
→ Device driver: DMA to NIC
→ Wire
Key insight: send() returning doesn't mean the peer received it. It means the kernel accepted it into the socket buffer. TCP handles the rest asynchronously.
The fundamental pattern for high-connection-count servers:
epoll_fd = epoll_create1(0);
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listen_fd, &event);
while (1) {
n = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
for (i = 0; i < n; i++) handle(events[i]);
}
epoll is O(active_connections), not O(total_connections). This is why nginx handles 10K connections on one thread while a thread-per-connection model collapses at 1K.
Every send() copies data from user buffer to kernel sk_buff. For large transfers:
sendfile() — File → socket without user-space copysplice() — Pipe-based zero-copy between fdsMSG_ZEROCOPY — Avoid copy on send (kernel pins user pages)Decision tree: Serving static files? sendfile(). Proxying data? splice(). Sending large buffers repeatedly? MSG_ZEROCOPY (but watch for completion notification overhead).
---
Containers are not VMs. They're processes with restricted views:
| Namespace | What It Isolates |
|-----------|-----------------|
| PID | Process ID space (container sees its own PID 1) |
| NET | Network stack (own interfaces, routing, iptables) |
| MNT | Filesystem mounts (own root via pivot_root) |
| UTS | Hostname |
| IPC | System V IPC, POSIX message queues |
| USER | UID/GID mapping (root in container ≠ root on host) |
| CGROUP | cgroup visibility |
cgroups v2 enforces limits:
cpu.max — CPU bandwidth (e.g., 100000 100000 = 1 CPU)memory.max — Hard memory limit (OOM kill if exceeded)io.max — Block I/O bandwidth limitsseccomp-bpf filters syscalls — a container typically blocks dangerous calls like mount, reboot, kexec_load.
The practical lesson: Container overhead is near-zero for CPU/memory (it's just cgroups + namespaces). The overhead is in networking (veth + bridge + iptables NAT) and storage (OverlayFS copy-up on write).
---
| User-Space Primitive | Kernel Mechanism | Typical Cost |
|---------------------|-----------------|--------------|
| pthread_mutex | futex (fast user-space, kernel fallback) | ~25ns uncontended, ~μs contended |
| pthread_rwlock | futex-based | ~30ns uncontended |
| sem_wait | futex | ~25ns uncontended |
| atomic operations | CPU instructions only | ~5-20ns |
| pthread_spinlock | CPU spin loop | Near-zero uncontended, catastrophic contended |
The futex (fast userspace mutex) is the key optimization: in the uncontended case, locking is a single atomic compare-and-swap in user space — no syscall. Only when contention occurs does the thread call futex(FUTEX_WAIT) to sleep in the kernel.
This means: lock profiling in user space often misses the cost. The real cost is in the contended path — context switches, cache line bouncing, and scheduling delays. Use perf to trace futex syscalls or bpftrace to measure contention time.
x86 has a strong memory model (Total Store Order) — stores are seen in order by other CPUs. ARM64 has a weak model — you need explicit barriers (dmb, dsb) or acquire/release semantics.
Practical impact: Lock-free code that works on x86 may break on ARM64 (like the Raspberry Pi). Always use atomic operations with proper memory ordering, never raw loads/stores for shared data.
---
Performance problem
├── Which resource is constrained?
│ ├── Check PSI: cat /proc/pressure/{cpu,memory,io}
│ ├── CPU → perf top → flame graph
│ ├── Memory → /proc/<pid>/smaps + perf stat cache-misses
│ ├── I/O → iostat -x + bpftrace VFS latency
│ └── Network → ss -tnp + bpftrace TCP retransmits
├── Is it on-CPU or off-CPU?
│ ├── On-CPU → perf record + flame graph
│ └── Off-CPU → offcputime + off-CPU flame graph
└── Intermittent?
└── eBPF continuous tracing with conditional recording
1. PSI first — Is the system actually contended? avg10 > 10 = yes
2. perf stat — Hardware counter overview in 10 seconds
3. Targeted bpftrace — Ask specific questions ("what's the read latency distribution?")
4. Continuous eBPF — Deploy in production for ongoing visibility
No tools installed? /proc is always there:
# What's this process doing right now?
cat /proc/<pid>/stack # kernel stack
cat /proc/<pid>/wchan # what it's waiting on
ls -la /proc/<pid>/fd/ | wc -l # open file count
cat /proc/<pid>/io # total I/O counters
cat /proc/<pid>/status | grep Vm # memory usage
---
┌─────────────────────────────────────────────────┐
│ YOUR APPLICATION │
├─────────────────────────────────────────────────┤
│ malloc/free │ read/write │ send/recv │
│ ↕ │ ↕ │ ↕ │
│ glibc arena │ VFS layer │ socket layer │
│ ↕ │ ↕ │ ↕ │
│ brk/mmap │ page cache │ TCP/IP stack │
│ ↕ │ ↕ │ ↕ │
│ page tables │ block layer │ netfilter │
│ ↕ │ ↕ │ ↕ │
│ physical RAM │ disk/SSD │ NIC │
├─────────────────────────────────────────────────┤
│ CFS scheduler │ cgroups │ namespaces │
│ (who runs) │ (how much) │ (what's seen) │
├─────────────────────────────────────────────────┤
│ eBPF / perf / ftrace / /proc │
│ (observability layer) │
└─────────────────────────────────────────────────┘
Every box in this diagram is a place where performance can be lost, bugs can hide, and understanding pays dividends. The application developer doesn't need to modify any of these kernel subsystems — but knowing they exist, knowing their contracts, and knowing how to observe them transforms debugging from guesswork into engineering.
---
1. Slow app? → PSI → identify resource → targeted tool
2. Crash? → Core dump + bt, or eBPF signal tracing for intermittent
3. I/O durability? → Choose from the durability ladder based on failure tolerance
4. High connections? → epoll + non-blocking, never thread-per-connection
5. Container overhead? → It's almost always the network layer, not CPU/memory
6. Lock contention? → futex tracing, off-CPU analysis, consider lock-free
7. Works on x86, breaks on ARM? → Memory ordering. Use proper atomics.
---
Grade self-assessment: 93/100
Strengths: Comprehensive coverage linking all 8 units into a coherent mental model. Practical decision trees throughout. The "map" visualization ties everything together. ARM64 memory ordering discussion is particularly relevant given jtr's Pi infrastructure.
Weaknesses: Could go deeper on NUMA topology for multi-socket systems. io_uring coverage is adequate but could include more worked examples. Security implications of eBPF (privilege requirements, attack surface) deserve more attention.