The OpenBSD virtual memory layer is known as UVM. Long, long ago it was the original BSD VM (with parts from Mach at CMU), but it was mostly replaced with UVM by Chuck Cranor. More of its history and a detailed description is in the author’s USENIX paper, The UVM Virtual Memory System.
Since then, we’ve evolved it in various ways. Unfortunately, some of the changes I made long ago have a tendency to devolve because the features introduced are subtle. This is the history of one feature in particular, its various regressions, and their fixes.
UVM manages two resources: virtual address space and physical memory pages. Usually you need both, but sometimes you only need one or the other. If a needed resource isn’t available, the typical solution is to sleep until it is, if in process context. In an interrupt, that’s not possible. To serialize access to a vmspace, locking is used which can also cause a process to sleep (and which must also not be acquired in interrupt context).
On top of these constraints, UVM is very recursive in strange ways. Many of the UVM data structures are allocated via pool which in turn requests memory from UVM. The per architecture pmap layer which manages virtual to physical mappings also allocates page table entries from pool. In not even the worst case, faulting on a copy on write page may require UVM to allocate a new map entry, which requires pool to allocate a new pool page, which requires UVM to first find another page and then to enter it into the pmap, which starts all over again back in pool to get a PTE.
Because such operations have a tendency to go sideways exactly when it’s most important for them to succeed, various kernel systems evolved coping mechanisms such as preallocating all the necessary resources. This is both wasteful and limiting. Memory reserved for network buffers can’t be used to play fun 3D games, even if there’s no network traffic. On routers, the situation is reversed. The fixed amount of memory reserved for network buffers leads to dropped packets until one reboots with an increased reserve.
The solution I cooked up to not quite solve but alleviate many of these problems was the km_thread. This is a background kernel thread that allocates pages of memory and keeps them on a freelist, ready to use. This makes the uvm_km_getpage function very simple and very fast. It takes a page of the list and done. One the list runs low, the thread wakes up and does whatever work is necessary to refill it. The thread can sleep, acquire locks, etc., in its mission even as it is providing memory for consumers that can’t, such as network interrupts.
The km_thread was introduced in uvm_km.c rev 1.36 and fixed and revised in a few subsequent commits, notably 1.39 and 1.41.
Things stood for a while, but there were a few shortcomings. One was that if a userland process was in a tight loop allocating structures, it could completely drain the freelist. Suddenly there’d be nothing left for interrupts. Since the motivation for the km_thread was to make sure interrupts always had access to memory, we weren’t done.
Art tried to fix this in rev 1.67 by having the requester sleep (if possible), when memory was running low but before it completely ran out. This alleviated some circumstances, but not others because the sleeping process was holding on to the memory when it slept. The real fix came in rev 1.68 which was to add a slowdown back channel which tells the caller to yield. The important bits are actually in subr_pool.c rev 1.63 which only calls yield after adding the freshly allocated page to the pool freelist. This prevents the sleeping process from hoarding memory.
Nothing stands still. There was a lot more work going on to address other problems, but that’s another story for another day. Some of those changes included uvm_km.c rev 1.77 which pretty substantially rewrote the km_thread code. Unfortunately, not for the better. One recursion related bug was fixed in 1.81 but it wasn’t until rev 1.84 that everything was made right. Lesson learned: the km_thread exists because recursively calling into tangly uvm guts is bad news, so don’t try to bypass the thread and call into tangly uvm guts.
Let’s return now to the pool code, which was reworked quite a bit in subr_pool.c rev 1.158. The diff is pretty gnarly, but reading the code after and comparing to the code before, there’s a better separation of concerns. Unfortunately, the slowdown feature got misplaced in the shuffle, perhaps because its true function wasn’t clear. rev 1.166 put the yield back in the right spot.
Fixing up subtle features after refactoring work is a great way to increase one’s commit count.