I’ve been using the python lxml library for HTML parsing for ages. Seems to work pretty well. There’s actually a bunch of little one off scripts that share a similar skeleton, which is modified as needed. After all, the best code isn’t reusable, it’s reeditable. A little while ago that turned into a script to download Medium posts after I read them and save the important parts, so that sometime later when I want to read about the Riemann Hypothesis, it’s all still there in a place I can find it.
Back to the task at hand. After reading comments about the irony of bloat complaints, I thought I might take my medium download script, tune it up a bit and translate to go, then run it as a service, thus sparing all my computers. Not especially interesting or challenging, but a quick experiment. Would it be easy (fast) to write? And it would it run fast?
Super basic main function. We’re going to listen to anything and then refetch it.
Even that’s pretty simple. Whatever URL we are asked for, we ask Medium for the same thing. Curiously, Medium does user agent sniffing, but impersonating a browser at this point actually results in HTML that’s less fun to work with. We’re only interested in rewriting HTML pages. (Some other requests, like favicon, will also pass this way.)
Medium posts keep all the good stuff under section-content divs. If we wanted to be generic, maybe this algorithm or any of a dozen others. But I was making an apples to apples comparison with a bespoke python script, so it’s done simply.
We provide just enough styling to keep the page readable. Customize to taste. The meat of the matter, which we’re finally getting to, is in the clean function.
Recursively descend through the tree, printing out the permitted tags. These are generally very simple, only requiring special handling for a and img tags.
Unlike lxml, go’s html nodes have element and text nodes. The text isn’t attached as a field to the nodes, so it’s a little different. Also, there’s no easy way to get an attribute?
There’s really no better way to find an attribute than to loop through an array? A little weird, but maybe the space overhead of using a map is too much. Assuming only a few attributes, maybe even faster anyway.
The fact that sort.Search returns an index even for not found is one of those less is more worse is better go quirks. It’s not hard for me to test the index, but since the implementation already knew whether it was a match and threw away that information it feels wasteful.
Satisfied it works on localhost, you can tune things up with a few lines of nginx.conf and some DNS trickery to always point browsers to the proxy, but that’s getting into you’ll shoot your eye out kid territory.
As for results, it’s pretty good. Just the HTML for a post shrinks by a factor of six, to say nothing of the js and css left out. Where does this savings come from? The code above doesn’t copy attributes for most tags, so elements won’t have class information as in <em class="markup--em markup--p-em"> I haven’t spelunked the stylesheet, though I do find it a little curious that we’ve taken to replicating selectors in class names. You will also miss out on such gems as <div style="padding-bottom: 48.199999999999996%;"> resulting in some layout differences. A lot of page weight remains in the form of images, although for bonus points one can make them click to load.
I mentioned performance. Some benchmark code.
The go version runs about six times faster. I’ve never had major problems with lxml’s performance, but that’s primarily as a batch user. A fire and forget script that takes a second to run is fine. But for an inline rewriter, getting results fast is more important.
Python and lxml are still very handy for poking through a document interactively, but now that I’ve got my HTML rewriting skeleton translated to go, I’ll probably use that for all future final versions.