a few realloc fixes

I recently spent a little time fixing and improving realloc in OpenBSD. In addition to the short commit messages, here’s a longer explanation of the changes that gives more background and a better understanding of both malloc and OpenBSD.

The current libc malloc implementation, internally called omalloc, is loosely coupled with the kernel’s VM system, uvm. The two are developed in sync, with the known and needed behavior of the other component in mind. The mmap abstraction is deliberately leaky without being too purpose specific. The general idea is that malloc uses mmap to allocate memory, then breaks it down for the application to consume. Meanwhile, the kernel’s allocation policy ensures that the allocated memory is spread out and randomized, making it harder to deterministically exploit.

fixing the cache

malloc keeps a free pages cache to avoid bouncing free memory to the kernel and then making a new request a only a few cycles later. But we don’t want too much free memory hanging around. The less free memory is kept in the cache, the less it will be reused, making it easier to catch use after free bugs. Once memory is sent back to the kernel, via munmap, any attempts to dereference a pointer will crash. Recycling that memory means the program will scribble over it instead and probably crash later, but in some less identifiable way.

One trick malloc uses is to randomly evict memory from the cache. malloc calls each segment of memory a region, so I’m going to start using that term, too. When we free a region, we start at a random location in the cache, and unmap previously freed and cached regions to make space. Then the newly free region goes in the cache. When we allocate a new region, we again pick a random offset and look for a contender. The random offsets ensure that we don’t recycle memory in a deterministic pattern regardless of the program’s allocation pattern.

Here’s what that code used to look like with a few extra comments added:

	offset = getrnibble();
	/* this loop makes room for the newly freed region by unmapping */
	/* a very similar loop was used to find a region in the cache
	   in the allocation code path */
	for (i = 0; tounmap > 0 && i < mopts.malloc_cache; i++) {
		/* jump ahead by offset, handling wraparound by masking */
		r = &d->free_regions[(i + offset) & (mopts.malloc_cache - 1)];
		if (r->p != NULL) {
...
	/* this loop inserts the freed region in the cache */
	for (i = 0; i < mopts.malloc_cache; i++) {
		r = &d->free_regions[i];
		if (r->p == NULL) {

A quick diversion here to explain the getrnibble function. malloc uses the arc4random generator to create random numbers, but there’s one performance issue. Every call to arc4random calls getpid to check that we’re the same process and haven’t forked. It’s very important that every forked process get a newly reseeded number generator. Unfortunately, calling getpid, possibly several times, for every malloc call is rather slow. In many cases, malloc only needs a few random bits. Instead, malloc maintains its own pool of random bits and has a function that returns half a byte of random data, 4 bits, at a time. That’s getrnibble. The malloc pool is reseeded when it hits empty, but not after fork. This is a performance security compromise.

Returning to the above code, we see that we’re using a nibble to offset into the cache. What the above code failed to account for is that 4 bits can only count to 15, but the cache can have as many as 256 entries in it, accommodating indices up to 255. We need 8 bits of random, or two nibbles, to get full coverage, otherwise we’re mostly churning around in the front of the cache and then linearly filling the back.

One other small fix is that the second loop above always starts from the beginning. This isn’t a problem, per se, since we’ll use a random offset when allocating, but it’s inefficient. The previous loop most likely made a nice hole for us to use and we know where that hole is. Let’s reuse the same indexing technique. Here’s the patch for both issues. cvsweb edition.

--- malloc.c	29 Feb 2012 08:44:14 -0000	1.141
+++ malloc.c	19 Jun 2012 02:14:58 -0000
@@ -317,7 +317,7 @@ unmap(struct dir_info *d, void *p, size_
 	rsz = mopts.malloc_cache - d->free_regions_size;
 	if (psz > rsz)
 		tounmap = psz - rsz;
-	offset = getrnibble();
+	offset = getrnibble() + getrnibble() << 4;
 	for (i = 0; tounmap > 0 && i < mopts.malloc_cache; i++) {
 		r = &d->free_regions[(i + offset) & (mopts.malloc_cache - 1)];
 		if (r->p != NULL) {
@@ -337,7 +337,7 @@ unmap(struct dir_info *d, void *p, size_
 	if (tounmap > 0)
 		wrterror("malloc cache underflow", NULL);
 	for (i = 0; i < mopts.malloc_cache; i++) {
-		r = &d->free_regions[i];
+		r = &d->free_regions[(i + offset) & (mopts.malloc_cache - 1)];
 		if (r->p == NULL) {
 			if (mopts.malloc_hint)
 				madvise(p, sz, MADV_FREE);
@@ -398,7 +398,7 @@ map(struct dir_info *d, size_t sz, int z
 		/* zero fill not needed */
 		return p;
 	}
-	offset = getrnibble();
+	offset = getrnibble() + getrnibble() << 4;
 	for (i = 0; i < mopts.malloc_cache; i++) {
 		r = &d->free_regions[(i + offset) & (mopts.malloc_cache - 1)];
 		if (r->p != NULL) {

Did you spot the bug? Operator precedence bytes again! It takes getrnibble() + (getrnibble() << 4) (note parens) to get a full random byte.

fixing realloc

The previous fixes affected regular malloc/free more than realloc. Now we get to the real realloc code. When realloc wants to grow a region, it provides a hint to mmap requesting that more memory be returned adjacent to the existing allocation. This way, we keep the current region and don’t have to copy anything. The downside is that if mmap returns a different address, it’s of no use to us. We have to unmap the new useless region, then start over with a new full sized allocation, copy over, and free the existing region. For reasons of efficiency, we’d like to avoid copying when possible, but we’d really like to avoid receiving memory we can’t use and will only have to send back.

The first thing we can do to help ensure success is to not get in our own way. mmap will only give us memory at our hint address if the requested length after it is available. If some of that memory is in use, there’s nothing we can do, but we can make sure that at least the cache is not in the way. There was a zapcacheregion function for this purpose.

But zapcacheregion has a bug. It would only consider zapping regions that immediately followed the existing allocation, not regions that fell anywhere inside the region about to be requested. Compare the old code
if (r->p == p) {
with the fixed version
if (r->p >= p && r->p <= (void *)((char *)p + len)) {

This makes a considerable difference because the kernel’s allocation policy meant that exact matches almost never occurred. We were deliberately trying to avoid having adjacent regions. So now our cache zapping function is working much better. We should have much more success getting the allocation we want.

One more improvement is to avoid actually mapping the new region if we know it will be unacceptable. OpenBSD added a new syscall mquery some time ago that’s basically a dummy mmap. It tells you what you’d get, but doesn’t deliver anything. Instead of calling mmap right away, we test for success with mquery, then call mmap. It’s still possible for the mmap call to return an address we’d rather not have, but it’s less likely. Making another syscall will slow things down in the best case, but speeds things up in the worst case. This is a literal implementation of measure twice, cut once.

This is the complete second diff. cvsweb edition.

--- malloc.c	18 Jun 2012 17:03:51 -0000	1.142
+++ malloc.c	19 Jun 2012 15:03:47 -0000
@@ -94,6 +94,9 @@
 #define MMAPA(a,sz)	mmap((a), (size_t)(sz), PROT_READ | PROT_WRITE, \
     MAP_ANON | MAP_PRIVATE, -1, (off_t) 0)
 
+#define MQUERY(a, sz)	mquery((a), (size_t)(sz), PROT_READ | PROT_WRITE, \
+    MAP_ANON | MAP_PRIVATE, -1, (off_t)0)
+
 struct region_info {
 	void *p;		/* page; low bits used to mark chunks */
 	uintptr_t size;		/* size for pages, or chunk_info pointer */
@@ -356,15 +359,17 @@ unmap(struct dir_info *d, void *p, size_
 }
 
 static void
-zapcacheregion(struct dir_info *d, void *p)
+zapcacheregion(struct dir_info *d, void *p, size_t len)
 {
 	u_int i;
 	struct region_info *r;
 	size_t rsz;
 
 	for (i = 0; i < mopts.malloc_cache; i++) {
 		r = &d->free_regions[i];
-		if (r->p == p) {
+		if (r->p >= p && r->p <= (void *)((char *)p + len)) {
 			rsz = r->size << MALLOC_PAGESHIFT;
 			if (munmap(r->p, rsz))
 				wrterror("munmap", r->p);
@@ -1283,20 +1288,26 @@ orealloc(void *p, size_t newsz, void *f)
 
 		if (rnewsz > roldsz) {
 			if (!mopts.malloc_guard) {
+				void *hint = (char *)p + roldsz;
+				size_t needed = rnewsz - roldsz;
+
 				STATS_INC(g_pool->cheap_realloc_tries);
-				zapcacheregion(g_pool, (char *)p + roldsz);
-				q = MMAPA((char *)p + roldsz, rnewsz - roldsz);
-				if (q == (char *)p + roldsz) {
-					malloc_used += rnewsz - roldsz;
+				zapcacheregion(g_pool, hint, needed);
+				q = MQUERY(hint, needed);
+				if (q == hint)
+					q = MMAPA(hint, needed);
+				else
+					q = MAP_FAILED;
+				if (q == hint) {
+					malloc_used += needed;
 					if (mopts.malloc_junk)
-						memset(q, SOME_JUNK,
-						    rnewsz - roldsz);
+						memset(q, SOME_JUNK, needed);
 					r->size = newsz;
 					STATS_SETF(r, f);
 					STATS_INC(g_pool->cheap_reallocs);
 					return p;
 				} else if (q != MAP_FAILED) {
-					if (munmap(q, rnewsz - roldsz))
+					if (munmap(q, needed))
 						wrterror("munmap", q);
 				}
 			}

Later, I came up with a test case to verify these changes.

Posted 22 Jun 2012 20:25 by tedu Updated: 04 Dec 2014 01:13
Tagged: c openbsd programming