comment free codex

Another little adventure in web page rewriting. I wanted to use a few more go features, and make something that would work on at least a few different sites via the Host header.

Consider you want to read the Considerations On Cost Disease that was making all the rounds recently. Like a lot of Slate Star Codex posts, it’s pretty long. In fact, you might read it for several minutes and glance at the scrollbar to discover you haven’t made any progress. You could be reading this post for weeks. But the situation isn’t really that bad, because like every Slate Star Codex post, it has a shitton of comments. More than 1000 in fact. This is not to say that the comments are bad (or good), but there certainly are a lot of them.

Other sites also suffer from the same affliction. Maybe you just want to read the 24 line ditty about Erdoğan that’s banned in Germany but without wading into arguments about whether “His gob smells of bad döner” is a great insult or the greatest insult.

I think it would be nice if there weren’t so many comments. Maybe I want to read them, maybe not, but that decision is definitely orthogonal to my decision to read the article. It’s easy to simply not read them, but their mere presence makes certain tasks difficult. Come back in two days and scroll to the midway point to revisit a section? Not a chance. I could set up an adblock rule, but I only know how to do that with some browsers (not with mobile safari). And, critically, if I’m on a mobile connection, hiding an element with a css filter doesn’t result in any bandwidth savings. Time for the big go flavored guns.

Deleting a chunk of an html tree is really simple. We could use CSS, but the manual traversal is easy and fast, too. I’m only going to say this once, but fortunately all these sites use wordpress and stick comments under the same element.

func removecomments(node *html.Node) {
        if node.Type == html.ElementNode && node.Data == "div" {
                if getattr(node, "id") == "comments" {
                        node.FirstChild = nil
                        node.LastChild = nil
                }
        }
        for c := node.FirstChild; c != nil; c = c.NextSibling {
                removecomments(c)
        }
}

Results:

-rw-r--r--  1 tedu  wheel  2110482 Feb 24 09:28 cost.html
-rw-r--r--  1 tedu  wheel    81776 Feb 24 09:34 cost2.html
-rw-r--r--  1 tedu  wheel   458689 Feb 24 09:57 poem.html
-rw-r--r--  1 tedu  wheel    60752 Feb 24 09:58 poem2.html

The new cost article is now less than 4% of its former size, and the Popehat page has also gone from tweet sized to poem sized. But without rewriting the entirety of the page HTML. I don’t want every web page to look the same, so everybody gets to keep their individual styling and layout. I’m not a monster!

func rerender(w io.Writer, root *html.Node) {
        removecomments(root)
        cacheimages(root)
        html.Render(w, root)
}

This is considerably easier than writing a custom render function, giving us more time to add another feature. Image caching of a sort. Some posts have lots of images and graphs which are certainly helpful when reading the post, but if you return later or only want to scan the article, you don’t need every image every time. So let’s lazy load them.

func fetchimg(src string) (image.Image, []byte, error) {
        if src == "" {
                return nil, nil, errors.New("missing src")
        }
        log.Println("fetch img", src)
        resp, err := http.Get(src)
        if err != nil {
                return nil, nil, err
        }
        var buf bytes.Buffer
        defer resp.Body.Close()
        io.Copy(&buf, resp.Body)
        data := buf.Bytes()
        img, _, err := image.Decode(bytes.NewReader(data))
        if err != nil {
                return nil, nil, err
        }
        return img, data, nil
}

This is just a little helper to download an image and parse it. One thing to note is that a bytes.Buffer can only be used as a reader once, and that resets the bytes slice, so we need to use a new reader for the image parser.

type imagereq struct {
        src string
        srchash string
        data []byte
        x, y int
        reply chan<- *imagereq
}

var imagereqchan = make(chan *imagereq)
func imagemonitor() {
        var cache = make(map[string]*imagereq)

        for req := range imagereqchan {
                reply := cache[req.srchash]
                if reply == nil {
                        img, data, err := fetchimg(req.src)
                        if err != nil {
                                req.reply <- nil
                                continue
                        }
                        reply = &imagereq{
                                req.src,
                                req.srchash,
                                data,
                                img.Bounds().Max.X,
                                img.Bounds().Max.Y,
                                nil,
                        }
                        log.Println("img dimensions", reply.x, reply.y)
                        cache[req.srchash] = reply
                }
                req.reply <- reply
        }
}

The monitor is the heart of the cache. We allow clients to make requests and if found, return the saved values. No downloading twice, no parsing twice. Images are saved (and will later be served) using a hash of the src attribute. The does intentionally serialize all image requests, but we get nicely sized rectangles, and assuming that the link from here to upstream is faster than the (mobile) link from the browser to here, not too noticeable.

And now for the hard part. Replacing parts of the tree. It’s a big function, but lots of lines are boilerplate.

func cacheimages(node *html.Node) {
        if node.Type == html.ElementNode && node.Data == "img" {
                ch := make(chan *imagereq)
                req := &imagereq{
                        getattr(node, "src"),
                        hashsrc(getattr(node, "src")),
                        nil, 0, 0, ch,
                }
                imagereqchan <- req
                reply := <-ch
                if reply != nil {
                        fmt.Println("replacing img")
                        node.Data = "div"
                        node.DataAtom = atom.Div
                        node.Attr = nil
                        node.FirstChild = nil
                        node.LastChild = nil
                        innerDiv := &html.Node{
                                nil, nil, nil, nil, nil,
                                html.ElementNode, atom.Div,
                                "div",
                                "", nil,
                        }
                        innerText := &html.Node{
                                nil, nil, nil, nil, nil,
                                html.TextNode, 0,
                                fmt.Sprintf("image: %d x %d %d bytes",
                                        reply.x, reply.y, len(reply.data)),
                                "", nil,
                        }
                        innerDiv.AppendChild(innerText)
                        setattr(innerDiv, "style",
                                fmt.Sprintf("border: 4px solid black; width: %dpx; height: %dpx",
                                reply.x, reply.y))
                        setattr(innerDiv, "data-cfcimg", fmt.Sprintf(`<img src="%s">`, reply.srchash))
                        node.AppendChild(innerDiv)
                        setattr(node, "onclick", "cfcimgclick(this)")

                }
        }
        if node.Type == html.ElementNode && node.Data == "body" {
                script := &html.Node{
                        nil, nil, nil, nil, nil,
                        html.ElementNode, atom.Script,
                        "script",
                        "", nil,
                }
                js := &html.Node{
                        nil, nil, nil, nil, nil,
                        html.TextNode, 0,
`function cfcimgclick(elem) {
        elem.innerHTML = elem.children[0].dataset.cfcimg
        elem.onclick = undefined
}`,
                        "", nil,
                }
                script.AppendChild(js)
                node.AppendChild(script)
                }
        for c := node.FirstChild; c != nil; c = c.NextSibling {
                cacheimages(c)
        }
}

Whenever we see an img tag, we download the image (via the monitor cache). Then we strip the current node down to basics, and rebuild it as a div. We insert an inner div and some text which tells us how big the replaced image was, and we set a tiny javascript function so that clicking on the empty box will load the image. The effect is something like this.

image: 629 x 264 9970 bytes

The complete code is in cfc.go.

One tricky aspect is the network setup. We’re not quite at transparent proxy levels of convenience here. Because this code works for multiple hosts, it uses the Host header instead of hardcoding an upstream. We need some split view DNS to direct browsers to our server, but still allow the server to contact the original upstream. One way to do this is with unbound views if you’re using the right version. Or you need a python plugin. Or something. It’s unfortunately too involved for an easy copy and paste config.

Posted 24 Feb 2017 21:59 by tedu Updated: 24 Feb 2017 22:03
Tagged: go programming web