Scanning and OCR on Linux with gscan2pdf

Posted by JD 04/04/2011 at 22:00

When you run a business, you will probably need to scan documents and store them into a document management system. Often, those document scans become completely unsearchable since the text is not included for the DMS to index. Entering metadata for each document becomes critical, especially keywords that someone else will likely use to find the document later.

Xsane for Scanning

I’ve been scanning using xsane on Ubuntu/Lubuntu for a few years. The Brother All-in-1 MFC-240C in my home office is used for faxing and scans. It was found and worked just as expected. It runs as a normal user, not root and no sudo needed. It is a great, home-use sheet fed scanner.

Improved Scanning + OCR With gscan2pdf

Installation was uneventful. The standard install method for Ubuntu/APT worked and brought in necessary dependencies.

sudo apt-get install gscan2pdf

Next I tried to run the application as a normal user – I wasn’t hopeful, since whenever you connect to hardware, there are probably group permissions that need to be worked out. Since I’d already been scanning with the same user using xsane, I was cautiously optimistic. It didn’t work – got stuck scanning for the scanner hardware and properties. Ok, so perhaps it needs the first run to setup the hardware as root –

sudo gscan2pdf

It found the scanner, set some properties (I guess), so I dropped a 6 pg document into the sheet feeder, set the resolution to 600 dpi, greyscale and told the program to scan all the pages. I heard the sheet feeder pull the first page and heard the scanner go. As the 2nd page was pulled in to be scanned, some of the applications brought in due to dependencies were spawned and notifications that they were running displayed. The scanning continued, uninterrupted.

As each page was scanned, a thumbnail was displayed in the left border and the main page area showed the scan for page 1.

OCR – Optical Character Recognition

I don’t recall whether the OCR was a checkbox or automatically included in the job. I do recall there were choices for where the text would be placed inside the resulting PDF. I chose to place the text under the image for the page, other options were before or after. With that choice, text searches would locate the correct page. At the bottom of each scanned page is an area with the text results of the OCR process. For the first page, the word accuracy was about 90% with many consistent mistakes. 90% accuracy sounds better than it turns out to be. To correct 10% of the words on a page takes longer than I would have liked. There is no spell checker built into this tool, so I copied each page of text into LibreOffice and used that spell checker to correct the problems. Some of the OCR created words that are in the dictionary, but didn’t make any sense in context. This is a common issue for OCR. The good news is that the PDF file has the fairly high resolution scan which definitely shows the words just as you’d expect.

The Results

I’ve found a new scanning tool. It works and creates image-based PDF files. At this point, the only drawback is that running this tool without elevated privileges doesn’t work, at least not so far. For most home users, this is a minor issue.

I forgot to mention that I ran this program over an ssh -X connection. No issues.

Trackbacks

Use the following link to trackback from your own site:
https://blog.jdpfu.com/trackbacks?article_id=1052