guest - flak

firefox vs rthreads

Mistakes were made, but not by me.

Firefox is too slow. OpenBSD is too slow. The combination is too too slow. This situation was known for some time, but resolution was also slow for quite some reasons.

Many Firefox on OpenBSD users, particularly developers, only use OpenBSD so the extent of the performance gap between platforms went unnoticed. Web browsing would grow ever slower, but the only page that matters would continue to load as quickly as ever, once the slumbering lizard had awoken. Clearly the reason it takes me thirty seconds to view a single tweet was idiot kids and their infernal javascript frameworks.

A few changes were made which improved some of the worst cases, but made much less of an overall impact. A tweak to realloc to avoid a peculiar case in the X server where it would repeatedly resize a buffer. A tweak to malloc to reduce contention on the lock while one thread was making a system call. Another tweak to X’s socket buffers to reduce the number of system calls required to move images back and forth.

We struggled on for a while, muttering to our insular selves, until the releases of Firefox 40 and 41. Suddenly performance plummeted from unpleasant to unbearable. It was increasingly obvious that there was a real problem and that this was not, in fact, just the way things were. Firefox was the component that had changed, but it wasn’t possible to pretend they would release something this bad. They were triggering a preexisting condition.

Isolating the problem proved difficult. Firefox is not what one would call a minimal test case. Just downloading the source can take longer than rebuilding all of OpenBSD. Consequently, there was a lot of hypothesizing. My running joke was that you could ask a half dozen developers why Firefox was slow and receive a dozen theories in reply. It now seems possible that all of them were correct.

Most of the theories were similar. Firefox was calling some function or system call excessively, but this inefficiency was masked on other platforms because their implementation was better optimized. But the question remained, which function? mpi measured and saw lots of pthread_mutex_trylock calls, but when I measured, all I saw were gettimeofday calls.

Some other operating systems use a feature called vDSO to short circuit certain system calls, notably gettimeofday. Support for vDSO has never been popular in OpenBSD because of concerns about negative security implications. And there was the somewhat ideological stand that if a system call is too expensive, maybe you should fix your program to behave better. I’m willing to agree that some programs may benefit from cheaper gettimeofday calls, but it seems unreasonable for a program like a browser to require them. However, when software is developed on platforms that do have vDSO, it’s possible to accidentally add implicit dependencies on vDSO-like speed. Trying to identify the fault after the fact in a program the size of Firefox isn’t easy.

Fortunately, gettimeofday is merely one possible slow function. There are many more and maybe some of them can be fixed. kettenis identified suboptimal locking in malloc as another suspect. Despite earlier efforts to avoid holding the spinlock while making a system call, threads could still pile up spinning on the lock. Changing the lock to a mutex puts the waiters to sleep, reducing overall load.

OG malloc developer otto reentered the fray with a multi pool malloc diff, which opens the way to really reducing lock contention.

Perhaps also the scheduler could be improved. The recent rewrite by Michal Mazurek shows some promise. This inspired mpi to take another look at yield. As you’ll recall from page two of your Building a Multithreaded Kernel textbook, when a high priority thread waits on a lock, it’s supposed to gift its priority to the lock holder to ensure progress is made. We (I) never quite got around to implementing that, and for several years it seemed we just might get away with it. The history of rthreads is pretty much maybe tomorrow, maybe not.

Although it’s not directly related to rthreads or Firefox, stefan’s work on amap efficiency is another notable improvement. Memory randomization results in the average process having many tiny private mappings. But the design of UVM intended to use amaps to manage complex shared mappings.

Someday soon we can all hope to have a browser that is merely slow, and not too slow. A lot of thanks to kettenis and mpi , among others, for refusing to give up in the face of an overwhelming challenge. And landry for keeping Firefox up to date, such that whiners like me can even complain about its performace.

Posted 2016-04-11 04:39:10 by tedu Updated: 2016-04-11 20:03:00
Tagged: openbsd