Late to the OCR Party

I’m embarrassed to admit that I don’t use OCR for converting documents into plain text as much as I probably should. It is a very handy utility, and it is one that computers have been doing for a long time. Indeed, I remember using OCR in college, at the computer lab where I worked, scanning a single page of print and watching the software read it and turn it into plain text with astonishing accuracy. It seemed like magic.

And what did I do with that magical text? I took that text, put it in a Word document, and printed it out.

Today, there’s many more useful things to do with OCR, particularly for scholars and academics. One example is to share the text of historical primary documents instead of an image files of the documents.¹ For years, I have been sharing with my students readings as PDF files, but in the mobile-first era of the web, it makes much more sense to share a webpage that someone can easily read on a mobile device, instead of a PDF that they have to pinch-and-zoom—or even print out—to read.

Earlier this week, I began sharing with my students plain text files, instead of PDF scans, of readings not available in their textbooks. Doing this yields some benefits:

They can read the text on mobile devices.
Visually impaired students can use a screen reading device to “read” the document.
They can search the text.
They can resize the text, either bigger or smaller.
They can parse the text to read with a browser utility like Apple’s Safari Reader or a read-later application like Instapaper.
They will appreciate the much smaller file size, like 100 times smaller, especially for students using a mobile device.

If sharing readings as plain text instead of PDF files makes so much sense, what took me so long?

Honestly, I didn’t know what tool I should use. I can’t remember the software I first used in 1997, but it’s safe to assume it doesn’t exist anymore. Acrobat offers OCR, but I haven’t had a Creative Cloud license since the days of Creative Suite 3. Although I have a lot of apps that can scan and convert to text, such as the one for a Doxie scanner or PDFPen+Scan for iOS, most of these readings are in PDF already. I don’t want to print and scan them just to do OCR.

Lo and behold, Google Drive converts PDF to text. I just learned about this yesterday, and I like the results. To use Google Drive for OCR, follow these three steps:

Upload your PDF file to Google Drive, if it’s not there already.
Right-click on the file
Select Open With > Google Docs

After a few minutes, depending on the size of your document, you can see the converted text. The results are pretty good. Obviously, the clearer and better your text, the more accurate the OCR will be. One cool feature is that it “respects” the pagination and hyphenation of your original document. If your document has page headers or page footers, those will appear. Since I’m interested in capturing only the text—not the pagination or hyphenation—of the document, I have to remove those from my final text document.

The nice thing about having a plain text document is that you can lightly format it as needed. Since I use Markdown, I recommend using a Markdown-capable text editor to parse the text. You’ll have a relatively unadulterated text file and can export it to any format you want from there. You can export to PDF, unstyled HTML, or RTF. And as I did with my first try at OCR in 1997, you can even print it.

One of my big complaints about #kidstoday is that are keen to share screenshots of a website—or worse, a photos of computer display with the browser window—instead of sharing the URL of the site. ↩

Juan Monroy

Late to the OCR Party

Related

Leave a CommentCancel reply

Search

Top Posts

Recent Posts

Recent Movies Watched

Flickr Photos

Archives

Categories

Rights