Syndication: Validated XML RSS  |  Add to Google  |  Add to My Yahoo!  |  Twitter  |  Facebook  |  LJ   More Joe:  |  Mentally Incontinent  |  Automotivetry  |  Could I Have Made It?  

The Journal of Joe The Peacock. Yay.

Oh, yay... The journal of an internet author and professional dork. Hope it's what you wanted when you clicked that link you clicked.

 

4.03.2008:

Anyone know anything about OCR (and how to make it not suck)

9:05 AM

Ok, so my printer sent me the PDF of the image file from which they print my book. Great.

What's not so great is that the entire thing is just a huge gigantic image. And that sucks, because what I really need is the actual text (so i can avoid having to literally redo from scratch the entire layout process including edits and whatnot). So, I tried doing OCR with Adobe Acrobat Pro 8.1, and... Well.. It sucks.

It recognizes like every 7th word, which is pretty damn useless. So, I'm wondering if anyone out there knows anything about how to make this not suck, and if so, what they recommend.

Anyone? Anyone? Beuler?


***UPDATE 4.15.08***

They sent me the version I sent them. They rule, I drool. Things is fixed now. Thanks everyone who emailed and IM'ed about this.


* * *




        StumbleUpon Toolbar




6 Comments:

Blogger phunkysai said...

We use this at work because it has done the best job of converting PDF to DOC out of all the ones we tried, free and paid for. There's a trial you can get as well:
http://www.investintech.com/prod_a2e_pro.htm

4/03/2008 10:28 AM  

Anonymous Zarshir said...

I *think* Foxit PDF Reader has a built in text-grabber if you want to try it manually. I've never done anything with it myself, however. Also, ry looking around the Distributed Proofreaders site (for example: http://www.pgdp.net/c/faq/scanning.php). If you are *really* having trouble, try signing up and asking around in their forums. Some of these people OCR for fun. Wish I could be of more help.

4/03/2008 7:02 PM  

Anonymous Anonymous said...

Not sure what you mean by "PDF of the image file". By "image file" I suspect you mean the original layout file (XPress, InDesign or similar)?

In most PDFs you can select the text and copy and paste it. If you have access to Acrobat Pro I thought there was a way to dump the text from the PDF as indicated here (see last bullet under "Adobe Reader 7"). Seems like you can do this from Reader as well.

http://www.adobe.com/products/acrobat/access_overview.html

Not sure how the text output is produced if there are things like sidebars and whatnot.

4/04/2008 6:18 PM  

Blogger sabren said...

I wrote primitive OCR tool for screen-scraping a while back. I dug it up just for fun.

Unfortunately, even after I successfully broke this page image down into lines and located the baseline for the text, I had to tell it practically every letter on the page.

Sadly, it doesn't work well with low-res, anti-aliased text. The results were pretty unimpressive:

Forsevenofthosetenyears,hecticplanet.comheldafew
coloredframeswiththewords''HecticPlanet,Baby!''Itwasnothi_ng.It
wascompletelyinert.Uselessin_everyway.
Then,ImetMichalWallace.
MichalandIworkedtogetheratoneofthemany''Greatidea!
Let_'spourmillion_s_nventurecapitalintoitandwatchitfest_eran_d
fail!"com__paniesdur_ingthelate90'san__deatly2000's.Wewer_ework
acquaintan_cesattthetimm_e,thesortofpeoplewh_oarepleasanttoon_e
anotherandshareoccasionaldiscourseaboutpsychologyorpolitics
duringserveroutages(butnow,he'soneofmybestfriendsand
biggestsu__pporters).On_eday,hement_ion_edtomethath_ehadstarted
ah_ostingbusinnesscahledCor_nerhost.To??elphi?nsta?_ttoh?ui?dhis
business.Ibo??gt?hanaccoun?_tan_dtn_ansfer_r_edthehecticplan_et.com
dan_ai?nnametohisservenrs,whereIpromptlyforgotitexisted.
Afewweeks?ater,afr_iendofmineaskedmetodoafavorand
checkot?tson_r?econter?t?_nanagenmen_?tsoftwa?_e?hewasinte?ested??h
tn?yjng,soIt?_h?re?Mitontt?e?hecticpla?het,com_siteandnoodledan-ou_ncl
withit,Ipostedsom__eit?mktoit;crapwhichreserrhblesiustabaut
everythingyau'veeverseenonanybloganywhere.,,Afewrecipes
forclhicke?n,somerantsabouttheNewYorkRangers'H?osingseason
(amazingwhatHASN'Tchangedintt?_h_reeyears..,Seriol_sly,MSG,you
n_eedtogetth__ethelt?outoftheb?_usinnessofrunningab??ackeytean?nM,
asyouappa?enth?yh?_ave?nocluel_h?owtodoit),an?__dsomepictu?_resand
crap.In?neverinten?ndedanyon?netoevenbotherlooki?ngatthesite,Ijust
stuckitallupthereasaproofofconcept,
T?__en,onenight,Itoldt?hissilh?ysl??oryatapartyIwasat
(sonmethingIan_m!knownt?__rougho??tt!heMetraAtlantaareafom_doing)
abouttheti?merMypare?_tsfaundapornotapethatInevereverngot
towatch.EveryoneenioyedItandmuchmerrimentwashad,andthen
weallwenthome,Afewdayslater,oneofthefolkswhoheardtln?e
storyemailedmeandaskedifIcoullwriteitoutforacoworkerofhis,
asthewast??yi?ngtoretehlitandcou?dn?tq?uite"qetitright.'"SoIdid.
Iguessthatcoworkerlikedit,becauseheforwardeditto
severalofhisfrien?nds,whotlhenforwardeditootheirfriends.Igot
afewemailsbackfr?_on?m?mran?domstrangerswlh?osaidtlhattheyreally
enjoyedthestory.AH?lttlewhilelater,_Igotanne?mailfromsomeor?newho
wantedtosh?howittosomeofh?h?erfrien?nds,but:h?h_addeletedtlheee?mail.
SIheaskedifIcouldresendittoher.SoIdidonebetter-Ipostedit
tomysillyhecticplanet.comtestsiteandpolntedherthere.
An?ndso.wi?hawhisper?_andg1531436480442060027439801504687020051072=uuiteunintentionally,itbegan?_?.
Asanaside,IhavetgsaythatIunderstandwhythatstory
nevermadeitthroughtheyoting.ButIfeeiit'dbeareaishameto
publishthisbooka?ndnotinchudetheveryfirststoryIeverwrote,
SooooooI'mgoingtoshhareittwithyounow.Consider_itatr_eat


I might be able to do something if your PDF is high res. No promises though.

Also, google has a much more advanced open source OCR engine called tesseract:

http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html

You might try getting that working.

4/05/2008 11:07 PM  

Blogger sabren said...

Here's page 48, taken from google book search and run though tesseract:


by going to eat what they affectionately and repeatedly referred to as
"bait".
^This plate looks like it landed here after being kicked out of
Japan by Gadzilla."
'Shhhhhl Joe, be nlte please"'
"But sweetie, LOOK at ltl This is?"
"TH!5 is an effort by my grandmother and aunt to do something
that we rnlght nnd fun." Her voice was hushed but her tone was stern.
"BE APPRECIATIVE."
^Fine, line. Ijust hope everyone isn't kung-fu fighting-"
'lFine "
We entered the building and were greeted by a tall white girl
dressed in a kmmono who looked exactly as a traditional Japanese
woman would lfAbertromble and Fvtth existed in Feudal Japan.
"KoYNeeeeechy Wahl"
I immediately looked over at Mike, noticing the battle he was
waging against the urge to break into an uncontrollable fit of laughter.
"Welcome to The Sushi Palate! How many are in your party?"
"Six," I replied through clinched teeth, fighting the corners of
my mouth to keep them from upturning in a smlrk.
"Ahh-so! Verrrry goad, sin Right this way!"
The only thing that could have possibly been more degrading to
Japanese people everywhere would have been if they had Charlle Chan
vldeos playing frorn monitors in the lobby and a scale model of a World
War II interhrnent Camp made of Lego's.
I let the ladies go in front of us, dropping back to swap a few
snlde Comments with rny best friend.
`~]esus, you think the JACL knows about this place?"
^I'rn sure if they did, they'd send a track team of ninjas to take
the place out "
We cackled as we made our way to our table which rested on
the lloor. We notnted everyone pausing for a moment as they walked
in, bending down to pick up the shoes they just kicked off and placing
them into a cubby hale Just on the other side of the wall.
^Uhh. . NO. You are NOT doing that," I told Mlke.
`~Yeah, I don't want to. I doubt anyone in here does."
You see, Mike suffers from Hyperhydrosis, a Condition that
Causes a person to sweat excessively in localized areas of the body.


Not great, but a whole lot better than what my engine came up with after training manually on the first two pages of the story.

Send me the PDF. I'd like to try this with the high res version.

4/06/2008 2:36 AM  

Anonymous Gary said...

Way back when I was a church secretary I used OCR alot.. alot lot. Anyway, I found the HP Scanning to text thing that came with our HP OfficeJet to be really good. Especially if you jack up the resolution and are working with an origonal that was printed in good quality. Downside is you have to have an HP OfficeJet. Good side is its not expensive.. =\

4/07/2008 3:07 PM  

Post a Comment




<< Home

Archives

02.2003   03.2003   04.2003   05.2003   06.2003   10.2003   11.2003   12.2003   01.2004   04.2004   05.2004   06.2004   07.2004   08.2004   09.2004   10.2004   11.2004   12.2004   01.2005   02.2005   03.2005   04.2005   05.2005   06.2005   07.2005   08.2005   09.2005   10.2005   11.2005   12.2005   01.2006   02.2006   03.2006   04.2006   05.2006   06.2006   07.2006   08.2006   10.2006   11.2006   12.2006   02.2007   03.2007   04.2007   05.2007   06.2007   07.2007   08.2007   09.2007   10.2007   11.2007   12.2007   01.2008   02.2008   03.2008   04.2008   05.2008   06.2008   07.2008   08.2008   09.2008   10.2008   11.2008  

This page is powered by Blogger. Isn't yours?

Creative Commons License
Most Famousest:

- The Etymology of Human Male Non-Verbal Communications (or, Why Men Fist-Bump)

- How to Actually Win A Fist Fight

- Notes During A Teleconference

- The Rules of the Gym

- How To Actually Talk To Atheists (If You're Christian)

Joe's Twitter follow me on Twitter