If this really works...

Story: Journalist creates open source solution to extract data from PDFsTotal Replies: 7
Author Content
caitlyn

Feb 07, 2013
11:57 AM EDT
If this really works it's a huge big deal. pdfedit can do it in theory, but in practice it doesn't do a good job at all. I know this is one tool I could really use.
Steven_Rosenber

Feb 07, 2013
12:19 PM EDT
As a journalist myself, even having a PDF based on electronic records is a pain to deal with. I find it better to try to persuade the agency to submit a spreadsheet because that's usually the source of the material in the first place.
dinotrac

Feb 07, 2013
12:27 PM EDT
Ditto on the REAL BIG DEAL.

PDFs have become a sort of lingua franca -- in spite of being a proprietary format -- and a free facility for processing them would make a number of things possible.
caitlyn

Feb 07, 2013
12:40 PM EDT
I don't know how many times it would have been great to extract some piece of documentation or some information from a pdf and I really just couldn't do it with the tools available to me. dino, you're right,, pdfs are ubiquitous nowadays.
Bob_Robertson

Feb 07, 2013
1:39 PM EDT
Gee, I thought my being happy when I found a PDF from which I could copy text was kind of silly.

Glad I'm not the only one.

PDF itself fills a need, for printing and presentation. But as a transmitter of data, no.
jdixon

Feb 07, 2013
1:50 PM EDT
Most of the PDF's I get are scanned images, so there's no text to recover. :( You have to ocr them. From the article, that's what they're doing here too. Fortunately, the freeocr program for Windows (it uses the tesseract engine) seems to work halfway well, though it's very poor at maintaining formatting. I expect tesseract would work equally well under Linux, but I don't know if there's a good gui front end for it or not.
mrbobeau

Feb 07, 2013
8:18 PM EDT
There are some GUIs for tesseract. The one I like is gimagereader.
jdixon

Feb 08, 2013
10:00 AM EDT
Thanks for the feedback, mrbobeau.

Posting in this forum is limited to members of the group: [ForumMods, SITEADMINS, MEMBERS.]

Becoming a member of LXer is easy and free. Join Us!