flak rss random

going paperless the hard way

A while back, I read about filling dumpsters and going paperless with some interest, but it didn’t move me to action because I don’t really have that much paperwork. Then came another DIY approach, something worthy of turning into a project.

My paperless needs are rather modest. All I really want is to have easy access to a couple documents (contracts, stock options) to lookup or verify the occasional number. Utility bills and sales receipts and so forth are not a priority since I’m not running a business. But having a copy of my lease agreement on my laptop so I don’t have to find it in the important files box under my bed would be convenient. Word of caution: this is probably not the step by step guide you’re looking for.

picking a scanner

My constraints are that I’d like this system to work on OpenBSD (almost guaranteeing no proprietary software) and maybe other platforms, at least be accessible from other platforms, and of course, cheap. I’m not too ardent about only using open source for this project, but I didn’t want to get tied down and forced to keep a separate scanning computer because of platform restrictions.

With that in mind, I could probably get by with jpgs from a digital camera, but I’d like a little more refinement. I skipped over both the ScanSnap and the Doxie scanners, settling on a very modest Canon CanoScan LiDE 110, available for about $55-60. It’s a flatbed scanner, not a document scanner, and the reviews even say it’s not intended for home office use. It apparently doesn’t include much in the way of software, but works great with sane-backends. It’s also very cheap and it’s powered by a single USB cable, no power adapter required or provided. It’s been a few years since I last owned a scanner, but it seems pretty fast. A full scan takes less than 30 seconds.

the scanning process

I first played with xsane to make sure everything works, then switched to scanimage. I like putting the paper in upside down, so I rotate it, then save to jpeg. Full command line for a letter sized document is scanimage -x 213 -y 280 | pamflip -r180 | pnmtojpeg > page1.jpg. (Wikipedia claims a letter is 216mm wide, but the scanner’s max width is only 213.) The scans come out fairly gray, which I post process using convert -white-threshold 70% -quality 60. That cuts the image size down substantially. I’m scanning at the default 300dpi, which is a little overkill, but has better OCR accuracy than 150dpi. A typical page of a contract weighs in at 500K. Maybe this could be optimized more (black-threshold, lower quality, rescale after OCR), but I don’t anticipate having enough documents to worry about storage.

For now, I’m entering file names by hand. The scanner does have four buttons on the front (PDF, Auto, Copy, E-mail) for computer free use. Currently, there’s no polling software included in sane, but I verified it can detect button state. It’s possible then to wire up Auto to scan a page and PDF to collate them all together.

text mode

I’m currently sticking with JPEGs as my archive format, but did do some investigation into OCR. Between tesseract and gocr, tesseract wins easily. It requires two commands to run. One to convert to tiff, jpegtopnm page1.jpg | pnmtotiff > page1.tiff. Another to get text, tesseract page1.tiff page1 (note tessaract will add a .txt extension). I don’t currently have plans to index my documents, so I’m not generating text except to verify it works. sqlite fulltext search works really well, so maybe I’ll use that.

mobile

I learned about JotNot from Steve’s blog post. It is/was on sale for only $1, so I did pick that up. It does a nice job of correcting for perspective and white balance. When I get serious about scanning everything, I suspect it will work great.

networking

For now, I’m keeping documents on one laptop and backing up with tarsnap. I’ll probably rsync to locally shared storage once the collection grows a little. Pie in the sky future, I’ll add a webdav server with a web app for the private cloud approach. But honestly, I have no need for 24/7 round the world access to these files.

comparison

Both blogs rely on OS X spotlight to perform indexing and searching. I wasn’t too keen on relying on spotlight, but I have few enough docs that manual sorting into directories is reasonable.

I did first consider the ScanSnap, but had serious concerns about that fact that it comes in separate Windows and Mac models and apparently does all the heavy lifting in software. As it turns out, there is sane support (not personally verified), but I’d still be paying a premium. The Canon doesn’t include, and hence doesn’t charge for, anything I’m not using.

As far as OCR goes, tesseract is quite accurate on the main body of text, mostly screwing up only on stylized text. No comparison to commercial software, but good enough for me.

I’m not shredding anything, all the originals are going to stay nice and cozy in their box.

My way is the hard way, and I’m still in the weeds a bit, but total cost so far: $61.

future

The most obvious thing to get working is the scanner buttons for an easier workflow. After that, I can think about OCR, indexing, and search more seriously. I do like Steve’s mobile workflow, even though I can’t see a use for it.

conclusion

If I wanted things to just work, I’d have gone with a product that just does what I want. By the time I recreate an automated workflow that others could use, I’ll have spent more time than I’ve saved money, but doing the project is half the fun.

Posted 20 Jun 2011 03:43 by tedu Updated: 29 Nov 2014 02:51
Tagged: paperless project