Another little adventure in web page rewriting. I wanted to use a few more go features, and make something that would work on at least a few different sites via the Host header.
Consider you want to read the Considerations On Cost Disease that was making all the rounds recently. Like a lot of Slate Star Codex posts, it’s pretty long. In fact, you might read it for several minutes and glance at the scrollbar to discover you haven’t made any progress. You could be reading this post for weeks. But the situation isn’t really that bad, because like every Slate Star Codex post, it has a shitton of comments. More than 1000 in fact. This is not to say that the comments are bad (or good), but there certainly are a lot of them.
Other sites also suffer from the same affliction. Maybe you just want to read the 24 line ditty about Erdoğan that’s banned in Germany but without wading into arguments about whether “His gob smells of bad döner” is a great insult or the greatest insult.
I think it would be nice if there weren’t so many comments. Maybe I want to read them, maybe not, but that decision is definitely orthogonal to my decision to read the article. It’s easy to simply not read them, but their mere presence makes certain tasks difficult. Come back in two days and scroll to the midway point to revisit a section? Not a chance. I could set up an adblock rule, but I only know how to do that with some browsers (not with mobile safari). And, critically, if I’m on a mobile connection, hiding an element with a css filter doesn’t result in any bandwidth savings. Time for the big go flavored guns.
Deleting a chunk of an html tree is really simple. We could use CSS, but the manual traversal is easy and fast, too. I’m only going to say this once, but fortunately all these sites use wordpress and stick comments under the same element.
The new cost article is now less than 4% of its former size, and the Popehat page has also gone from tweet sized to poem sized. But without rewriting the entirety of the page HTML. I don’t want every web page to look the same, so everybody gets to keep their individual styling and layout. I’m not a monster!
This is considerably easier than writing a custom render function, giving us more time to add another feature. Image caching of a sort. Some posts have lots of images and graphs which are certainly helpful when reading the post, but if you return later or only want to scan the article, you don’t need every image every time. So let’s lazy load them.
This is just a little helper to download an image and parse it. One thing to note is that a bytes.Buffer can only be used as a reader once, and that resets the bytes slice, so we need to use a new reader for the image parser.
The monitor is the heart of the cache. We allow clients to make requests and if found, return the saved values. No downloading twice, no parsing twice. Images are saved (and will later be served) using a hash of the src attribute. The does intentionally serialize all image requests, but we get nicely sized rectangles, and assuming that the link from here to upstream is faster than the (mobile) link from the browser to here, not too noticeable.
And now for the hard part. Replacing parts of the tree. It’s a big function, but lots of lines are boilerplate.
The complete code is in cfc.go.
One tricky aspect is the network setup. We’re not quite at transparent proxy levels of convenience here. Because this code works for multiple hosts, it uses the Host header instead of hardcoding an upstream. We need some split view DNS to direct browsers to our server, but still allow the server to contact the original upstream. One way to do this is with unbound views if you’re using the right version. Or you need a python plugin. Or something. It’s unfortunately too involved for an easy copy and paste config.