To get a single document into Paperless while accomplishing the goals listed in part 2 requires several steps:
- Scan documents to PDF files
- Process PDFs through OCR engine
- Import PDFs into Paperless.app
The ScanSnap comes with ScanSnap Manager. SSM allows you to create scan profiles, which specify scan quality, destination (file, OCR app, Paperless.app, etc), and simplex versus duplex (one or both sides of the paper). There are 4 quality choices: Normal (150dpi, 18ppm), Better (200dpi, 12ppm), Best (300dpi, 6ppm), and Excellent (600dpi, 0.6ppm). Excellent is very slow. I use it only for scanning photos.
I ran a few OCR tests to determine what settings would result in the highest degree of OCR accuracy. Adobe suggests scanning in Black & White at 300 or 600 dpi. The tips I found on Abbyy’s site suggested that the higher quality the scan, the better the results. After experimenting, I reached the following conclusions:
- Acrobat Pro goes a good job on high quality text documents
- Acrobat does a poor job on halftone text (i.e., receipts printed on thermal printers, dot-matrix, faded documents)
- On documents that Acrobat does poorly at, it does no better if the document is scanned in B&W versus color.
- When scanning in B&W, faded receipts are often illegible. Scan in color instead.
- Abbyy FineReader does a great job on most any legible scans
- The difference between 600dpi and 300dpi is not significant
Factor in that disk is cheap, my time is not, and I’ll never be able to scan these documents again, and the settings I use for all documents is: Best, Duplex, and Color. Color is easy to desaturate to greyscale, the ScanSnap will omit the back side of the page if blank, and I can downsample images later if I need to. With Best quality, it takes about the same amount of time to scan a stack of documents as it takes me to sort, organize, and jog a fresh stack for the document feeder.
ScanSnap Manager will feed new scans directly to an application. So, I tested by configuring it to scan the document directly to FineReader for OCR processing. While that may work well for day-to-day SOHO needs, it is too slow when staring down a mountain of papers. For that, raw speed is required and nothing will get you through step 1 faster than saving to files.
There is one more setting to deal with. This dialog box describes the choice:
If I have a stack of 50 sheets in the document feeder, do I want them to end up as 50 PDF files, or one PDF with 50 pages? If the 50 pages are each individual receipts, then I want them each as individual files. When I have more than one page in a document, such as a phone bill or American Express cardholder statements, a multipage PDF is perfect. The ScanSnap can’t really know my intent so I must tell it. I do so by creating two profiles, one called ‘Standard’ which is a multipage PDF. It takes everything I drop in the hopper and outputs a single PDF. My second profile is named ‘One file per page,’ and it does just that.
Next time, “the shortest path between 8 file drawers of paper and 10,000 PDFs”