4.03.2008:
Anyone know anything about OCR (and how to make it not suck)
9:05 AMOk, so my printer sent me the PDF of the image file from which they print my book. Great.
What's not so great is that the entire thing is just a huge gigantic image. And that sucks, because what I really need is the actual text (so i can avoid having to literally redo from scratch the entire layout process including edits and whatnot). So, I tried doing OCR with Adobe Acrobat Pro 8.1, and... Well.. It sucks.
It recognizes like every 7th word, which is pretty damn useless. So, I'm wondering if anyone out there knows anything about how to make this not suck, and if so, what they recommend.
Anyone? Anyone? Beuler?
***UPDATE 4.15.08***
They sent me the version I sent them. They rule, I drool. Things is fixed now. Thanks everyone who emailed and IM'ed about this.
* * *
Archives
02.2003 03.2003 04.2003 05.2003 06.2003 10.2003 11.2003 12.2003 01.2004 04.2004 05.2004 06.2004 07.2004 08.2004 09.2004 10.2004 11.2004 12.2004 01.2005 02.2005 03.2005 04.2005 05.2005 06.2005 07.2005 08.2005 09.2005 10.2005 11.2005 12.2005 01.2006 02.2006 03.2006 04.2006 05.2006 06.2006 07.2006 08.2006 10.2006 11.2006 12.2006 02.2007 03.2007 04.2007 05.2007 06.2007 07.2007 08.2007 09.2007 10.2007 11.2007 12.2007 01.2008 02.2008 03.2008 04.2008 05.2008 06.2008 07.2008 08.2008 09.2008 10.2008 11.2008

- The Etymology of Human Male Non-Verbal Communications (or, Why Men Fist-Bump)
- How to Actually Win A Fist Fight
- Notes During A Teleconference
- The Rules of the Gym
- How To Actually Talk To Atheists (If You're Christian)

6 Comments:
We use this at work because it has done the best job of converting PDF to DOC out of all the ones we tried, free and paid for. There's a trial you can get as well:
http://www.investintech.com/prod_a2e_pro.htm
I *think* Foxit PDF Reader has a built in text-grabber if you want to try it manually. I've never done anything with it myself, however. Also, ry looking around the Distributed Proofreaders site (for example: http://www.pgdp.net/c/faq/scanning.php). If you are *really* having trouble, try signing up and asking around in their forums. Some of these people OCR for fun. Wish I could be of more help.
Not sure what you mean by "PDF of the image file". By "image file" I suspect you mean the original layout file (XPress, InDesign or similar)?
In most PDFs you can select the text and copy and paste it. If you have access to Acrobat Pro I thought there was a way to dump the text from the PDF as indicated here (see last bullet under "Adobe Reader 7"). Seems like you can do this from Reader as well.
http://www.adobe.com/products/acrobat/access_overview.html
Not sure how the text output is produced if there are things like sidebars and whatnot.
I wrote primitive OCR tool for screen-scraping a while back. I dug it up just for fun.
Unfortunately, even after I successfully broke this page image down into lines and located the baseline for the text, I had to tell it practically every letter on the page.
Sadly, it doesn't work well with low-res, anti-aliased text. The results were pretty unimpressive:
Forsevenofthosetenyears,hecticplanet.comheldafew
coloredframeswiththewords''HecticPlanet,Baby!''Itwasnothi_ng.It
wascompletelyinert.Uselessin_everyway.
Then,ImetMichalWallace.
MichalandIworkedtogetheratoneofthemany''Greatidea!
Let_'spourmillion_s_nventurecapitalintoitandwatchitfest_eran_d
fail!"com__paniesdur_ingthelate90'san__deatly2000's.Wewer_ework
acquaintan_cesattthetimm_e,thesortofpeoplewh_oarepleasanttoon_e
anotherandshareoccasionaldiscourseaboutpsychologyorpolitics
duringserveroutages(butnow,he'soneofmybestfriendsand
biggestsu__pporters).On_eday,hement_ion_edtomethath_ehadstarted
ah_ostingbusinnesscahledCor_nerhost.To??elphi?nsta?_ttoh?ui?dhis
business.Ibo??gt?hanaccoun?_tan_dtn_ansfer_r_edthehecticplan_et.com
dan_ai?nnametohisservenrs,whereIpromptlyforgotitexisted.
Afewweeks?ater,afr_iendofmineaskedmetodoafavorand
checkot?tson_r?econter?t?_nanagenmen_?tsoftwa?_e?hewasinte?ested??h
tn?yjng,soIt?_h?re?Mitontt?e?hecticpla?het,com_siteandnoodledan-ou_ncl
withit,Ipostedsom__eit?mktoit;crapwhichreserrhblesiustabaut
everythingyau'veeverseenonanybloganywhere.,,Afewrecipes
forclhicke?n,somerantsabouttheNewYorkRangers'H?osingseason
(amazingwhatHASN'Tchangedintt?_h_reeyears..,Seriol_sly,MSG,you
n_eedtogetth__ethelt?outoftheb?_usinnessofrunningab??ackeytean?nM,
asyouappa?enth?yh?_ave?nocluel_h?owtodoit),an?__dsomepictu?_resand
crap.In?neverinten?ndedanyon?netoevenbotherlooki?ngatthesite,Ijust
stuckitallupthereasaproofofconcept,
T?__en,onenight,Itoldt?hissilh?ysl??oryatapartyIwasat
(sonmethingIan_m!knownt?__rougho??tt!heMetraAtlantaareafom_doing)
abouttheti?merMypare?_tsfaundapornotapethatInevereverngot
towatch.EveryoneenioyedItandmuchmerrimentwashad,andthen
weallwenthome,Afewdayslater,oneofthefolkswhoheardtln?e
storyemailedmeandaskedifIcoullwriteitoutforacoworkerofhis,
asthewast??yi?ngtoretehlitandcou?dn?tq?uite"qetitright.'"SoIdid.
Iguessthatcoworkerlikedit,becauseheforwardeditto
severalofhisfrien?nds,whotlhenforwardeditootheirfriends.Igot
afewemailsbackfr?_on?m?mran?domstrangerswlh?osaidtlhattheyreally
enjoyedthestory.AH?lttlewhilelater,_Igotanne?mailfromsomeor?newho
wantedtosh?howittosomeofh?h?erfrien?nds,but:h?h_addeletedtlheee?mail.
SIheaskedifIcouldresendittoher.SoIdidonebetter-Ipostedit
tomysillyhecticplanet.comtestsiteandpolntedherthere.
An?ndso.wi?hawhisper?_andg1531436480442060027439801504687020051072=uuiteunintentionally,itbegan?_?.
Asanaside,IhavetgsaythatIunderstandwhythatstory
nevermadeitthroughtheyoting.ButIfeeiit'dbeareaishameto
publishthisbooka?ndnotinchudetheveryfirststoryIeverwrote,
SooooooI'mgoingtoshhareittwithyounow.Consider_itatr_eat
I might be able to do something if your PDF is high res. No promises though.
Also, google has a much more advanced open source OCR engine called tesseract:
http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html
You might try getting that working.
Here's page 48, taken from google book search and run though tesseract:
by going to eat what they affectionately and repeatedly referred to as
"bait".
^This plate looks like it landed here after being kicked out of
Japan by Gadzilla."
'Shhhhhl Joe, be nlte please"'
"But sweetie, LOOK at ltl This is?"
"TH!5 is an effort by my grandmother and aunt to do something
that we rnlght nnd fun." Her voice was hushed but her tone was stern.
"BE APPRECIATIVE."
^Fine, line. Ijust hope everyone isn't kung-fu fighting-"
'lFine "
We entered the building and were greeted by a tall white girl
dressed in a kmmono who looked exactly as a traditional Japanese
woman would lfAbertromble and Fvtth existed in Feudal Japan.
"KoYNeeeeechy Wahl"
I immediately looked over at Mike, noticing the battle he was
waging against the urge to break into an uncontrollable fit of laughter.
"Welcome to The Sushi Palate! How many are in your party?"
"Six," I replied through clinched teeth, fighting the corners of
my mouth to keep them from upturning in a smlrk.
"Ahh-so! Verrrry goad, sin Right this way!"
The only thing that could have possibly been more degrading to
Japanese people everywhere would have been if they had Charlle Chan
vldeos playing frorn monitors in the lobby and a scale model of a World
War II interhrnent Camp made of Lego's.
I let the ladies go in front of us, dropping back to swap a few
snlde Comments with rny best friend.
`~]esus, you think the JACL knows about this place?"
^I'rn sure if they did, they'd send a track team of ninjas to take
the place out "
We cackled as we made our way to our table which rested on
the lloor. We notnted everyone pausing for a moment as they walked
in, bending down to pick up the shoes they just kicked off and placing
them into a cubby hale Just on the other side of the wall.
^Uhh. . NO. You are NOT doing that," I told Mlke.
`~Yeah, I don't want to. I doubt anyone in here does."
You see, Mike suffers from Hyperhydrosis, a Condition that
Causes a person to sweat excessively in localized areas of the body.
Not great, but a whole lot better than what my engine came up with after training manually on the first two pages of the story.
Send me the PDF. I'd like to try this with the high res version.
Way back when I was a church secretary I used OCR alot.. alot lot. Anyway, I found the HP Scanning to text thing that came with our HP OfficeJet to be really good. Especially if you jack up the resolution and are working with an origonal that was printed in good quality. Downside is you have to have an HP OfficeJet. Good side is its not expensive.. =\
Post a Comment