There's a big difference between searchable, and machine readable. For example, one set of PDFs I've worked with pretty extensively is the House of Representative's statement of disbursements<http://disbursements.house.gov/>(how the House spends its money). The House releases these PDFs in a fully searchable form - they're not images, they contain all the text displayed in the PDF. But what they're releasing is really a database - it's expenses! - and if you want to do any sort of basic analysis<http://sunlightfoundation.com/blog/2012/02/06/turnover-in-the-house/>(like summing numbers together), you need more than a searchable PDF. A couple of coworkers and I have figured out a Python script<https://github.com/sunlightlabs/disbursements/blob/master/process_new_release/1_parse_disbursements/parse-disbursements.py>that does a pretty good job at generating a CSV (spreadsheet) from the PDF, and so my organization, the Sunlight Foundation, has published these CSVs<http://sunlightfoundation.com/projects/expenditures/>as a public service for a few years. That Python script may look small, but it's quite specific and brittle<https://github.com/sunlightlabs/disbursements/pull/1>, is the result of many hours of collective work, and I cross my fingers every quarter that the House not change a single thing. We're very lucky that the original PDF is neatly tabular, with one entry per row. The Senate, on the other hand, started publishing<http://www.senate.gov/legislative/common/generic/report_secsen.htm>similarly searchable PDFs at the end of 2011 -- but simply because individual expenditures span more than one row, it makes writing a parser much harder<http://sunlightfoundation.com/blog/2011/11/30/senate-finally-publishes-its-spending-online-but-could-do-much-better/>, and it's so far dissuaded us from trying. PDFs are often quite suitable for documents, and most these days are searchable, but they are not machine readable. On Sat, Jan 19, 2013 at 9:25 AM, Marianne van den Boomen < M.V.T.vandenBoomen@uu.nl> wrote:
On 15/01/2013 16:28, Burcu Bakioglu wrote:
Also as danah was saying, PDF is not a machine readable format, so search
engines can only index by the title, not the content of the article. So those who are searching it won't readily find it.
While this was true in the 90s, most PDFs are now not just images but parsed by OCR and therefore as full text indexable by search engines. Try this on Google with any obscure pdf you have uploaded - it will pop up.
kind regards
Marianne van den Boomen
Media and Culture Studies | University Utrecht Office: Kromme Nieuwegracht 20 (room T2.13A) Mail: Muntstraat 2a | 3512 EV UTRECHT Phone: +31 (0)30 253 9607 M.V.T.vandenBoomen@uu.nl | www.hum.uu.nl www.newmediastudies.nl | www.vandenboomen.org ______________________________**_________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/** listinfo.cgi/air-l-aoir.org<http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org>
Join the Association of Internet Researchers: http://www.aoir.org/