Google OCR.

aimeeandbeatles

watermelon
Joined
Apr 5, 2007
Messages
20,112
Question: I've been going through PDFs. I noticed some dont have any OCR'd (selectable) text when you load it in a PDF viewer yet they appear in the search results. So I assume Google is doing the OCR at their end.

What engine do they use? Having issues trying to Google it :mischief:

Thanks.
 
Answer: Not sure since Google seems to have sponsored different OCR projects (Tesseract, OCRopus). I'd bet it is proprietary since OCR has been sort of a holy grail in computing for a long time. The US postal service paid lots for theirs to automate the processing of mail.

You might try the programs reviewed here (most are free or free trial):
http://www.makeuseof.com/tag/top-5-free-ocr-software-tools-to-convert-your-images-into-text-nb/

MS One Note used to be free to College Students.


EDIT: You might also try this advice:
This morning, I opened the original image in Picasa for some simple tweaks. First, I cropped out all irrelevant, surrounding text, and then brightened the image and heightened the contrast. The result is a more white background and darker, clearer text.
http://diyivorytower.wordpress.com/2011/01/14/ocr-in-google-docs-makes-transcription-simple/


I also wonder if there's a way to identify the font in the PDF and instruct the OCR software what font to scan for. That might help the OCR process.
 
Well its not really for converting text, its for searching big batches of PDFs that are just images...
 
Well its not really for converting text, its for searching big batches of PDFs that are just images...

That's odd. Looking for pictures of your favorite rockstars? That's not really OCR since the C= character which implies text.


If you're trying to train a computer to search graphics for specific images, then I think that type of software only exists in t.v. dramas and in Federal agencies.
 
No, I mean sometimes I find a lot of newspaper scan PDFs and Google Search hasnt indexed them so I download and go through them manually. If theres text I use dnGREP but sometimes there isn't any selectable text. Im looking for articles.
 
Ocr = ?
 
Optical Character Recognition.

Posting just for the sake of posting = :(
 
You don't get to over 30,000 posts if you don't post for the sake of posting.

But honestly I didn't know what OCR stood for.
 
Back
Top Bottom