Dear Colleagues, In what way is PDF proprietary? It was officially released as an open standard on July 1, 2008. I applaud those who are leading the #pdftribute movement. Cordially, Jason G. Karlin, Ph.D. Associate Professor University of Tokyo Interfaculty Initiative in Information Studies 7-3-1 Hongo, Bunkyo-ku Tokyo 113-0033 JAPAN URL: http://individuals.iii.u-tokyo.ac.jp/~karlin/ Email: ukarlin@mail.ecc.u-tokyo.ac.jp
Dear Jason, I don't think anyone is dismissing the #pdftribute initiative. My concern was that one couldn't retrieve relevant information unless the site is indexed properly and the files are categorized better. Dumping information does not necessarily mean that it will be usable, open-access. Unless you know the Twitter handles you have no way of recognizing what pdf could be of use to you. Also as danah was saying, PDF is not a machine readable format, so search engines can only index by the title, not the content of the article. So those who are searching it won't readily find it. And the format was originally proprietary, but was later open sourced. #pdftribute is a nice idea for a short term protest. I say it made noise and brought to the foreground the issues that the academic community have been discussing for awhile. But if we are hoping to initiate a long term change, we need something more lasting. Hence the discussion on publishing on open access journals. By all means, my initial e-mail wasn't to dismiss the initiative, but to open up a discussion about what this may mean and its shortcomings to figure out what more we could do to make the best out of this unfortunate incident. All the best. BsB On Tue, Jan 15, 2013 at 9:04 AM, Jason G. Karlin < ukarlin@mail.ecc.u-tokyo.ac.jp> wrote:
Dear Colleagues,
In what way is PDF proprietary? It was officially released as an open standard on July 1, 2008. I applaud those who are leading the #pdftribute movement.
Cordially,
Jason G. Karlin, Ph.D. Associate Professor University of Tokyo Interfaculty Initiative in Information Studies 7-3-1 Hongo, Bunkyo-ku Tokyo 113-0033 JAPAN
URL: http://individuals.iii.u-tokyo.ac.jp/~karlin/ Email: ukarlin@mail.ecc.u-tokyo.ac.jp
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
-- Thanks, Burcu S. Bakioglu, Ph.D. Postdoctoral Fellow in New Media Lawrence University http://www.palefirer.com http://palefirer.com/blog/
On 15/01/2013 16:28, Burcu Bakioglu wrote:
Also as danah was saying, PDF is not a machine readable format, so search engines can only index by the title, not the content of the article. So those who are searching it won't readily find it.
While this was true in the 90s, most PDFs are now not just images but parsed by OCR and therefore as full text indexable by search engines. Try this on Google with any obscure pdf you have uploaded - it will pop up. kind regards Marianne van den Boomen Media and Culture Studies | University Utrecht Office: Kromme Nieuwegracht 20 (room T2.13A) Mail: Muntstraat 2a | 3512 EV UTRECHT Phone: +31 (0)30 253 9607 M.V.T.vandenBoomen@uu.nl | www.hum.uu.nl www.newmediastudies.nl | www.vandenboomen.org
There's a big difference between searchable, and machine readable. For example, one set of PDFs I've worked with pretty extensively is the House of Representative's statement of disbursements<http://disbursements.house.gov/>(how the House spends its money). The House releases these PDFs in a fully searchable form - they're not images, they contain all the text displayed in the PDF. But what they're releasing is really a database - it's expenses! - and if you want to do any sort of basic analysis<http://sunlightfoundation.com/blog/2012/02/06/turnover-in-the-house/>(like summing numbers together), you need more than a searchable PDF. A couple of coworkers and I have figured out a Python script<https://github.com/sunlightlabs/disbursements/blob/master/process_new_release/1_parse_disbursements/parse-disbursements.py>that does a pretty good job at generating a CSV (spreadsheet) from the PDF, and so my organization, the Sunlight Foundation, has published these CSVs<http://sunlightfoundation.com/projects/expenditures/>as a public service for a few years. That Python script may look small, but it's quite specific and brittle<https://github.com/sunlightlabs/disbursements/pull/1>, is the result of many hours of collective work, and I cross my fingers every quarter that the House not change a single thing. We're very lucky that the original PDF is neatly tabular, with one entry per row. The Senate, on the other hand, started publishing<http://www.senate.gov/legislative/common/generic/report_secsen.htm>similarly searchable PDFs at the end of 2011 -- but simply because individual expenditures span more than one row, it makes writing a parser much harder<http://sunlightfoundation.com/blog/2011/11/30/senate-finally-publishes-its-spending-online-but-could-do-much-better/>, and it's so far dissuaded us from trying. PDFs are often quite suitable for documents, and most these days are searchable, but they are not machine readable. On Sat, Jan 19, 2013 at 9:25 AM, Marianne van den Boomen < M.V.T.vandenBoomen@uu.nl> wrote:
On 15/01/2013 16:28, Burcu Bakioglu wrote:
Also as danah was saying, PDF is not a machine readable format, so search
engines can only index by the title, not the content of the article. So those who are searching it won't readily find it.
While this was true in the 90s, most PDFs are now not just images but parsed by OCR and therefore as full text indexable by search engines. Try this on Google with any obscure pdf you have uploaded - it will pop up.
kind regards
Marianne van den Boomen
Media and Culture Studies | University Utrecht Office: Kromme Nieuwegracht 20 (room T2.13A) Mail: Muntstraat 2a | 3512 EV UTRECHT Phone: +31 (0)30 253 9607 M.V.T.vandenBoomen@uu.nl | www.hum.uu.nl www.newmediastudies.nl | www.vandenboomen.org ______________________________**_________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/** listinfo.cgi/air-l-aoir.org<http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org>
Join the Association of Internet Researchers: http://www.aoir.org/
Dear all, On 01/15/2013 04:04 PM, Jason G. Karlin wrote:
In what way is PDF proprietary? It was officially released as an open standard on July 1, 2008.
with no presumption to be exhaustive, to criticize #pdftribute initiative or to talk on Swartz's behalf, the 'PDF issue' is subtler and more complex than it seems. Yes, PDF v 1.7 specifications (along with prior versions) were released by Adobe and approved by ISO as a standard in 2008 [1]. However, ISO formally certifies "standards" only, not "Open standards". Indeed, as of today, no widely and legally-recognized definition of an open standard exists [2]. Usually (and simply speaking) an open standard is a standard whose specifications can be implemented royalty-free. This requires that there are either no patents covering the specifications or that such patents are granted on a royalty-free basis. PDF specifications are covered by several patents. Most of these (not all) are held by Adobe. PDF v1.7 is considered an open standard because, contextually to ISO standardization process, Adobe released a royalty-free license covering all the patents it held on PDF v1.7 specifications [3]. Since 2008 Adobe continued developing its own specifications for PDF later versions, but only a few of these PDF 'updates' were either submitted to ISO or accepted by ISO as standards. This resulted in two different sets of PDF specifications which are out-of-sync with each other (i.e. ISO PDF specs. and Adobe PDF specs.) Furthermore, Adobe never renewed/extended (at least not until recently, not sure if things have changed) the royalty-free license for its patents to cover also the newer PDF specifications. Therefore, yes *PDF v1.7* can be considered an open standard, but strictly speaking later versions are either "not open" (ISO ones) or "not standards" (Adobe ones). Basically, this is why only Adobe can afford (efficiently) implementing a PDF reader which provides all shiny & advanced features such as editing, highlighting, revision comparison etc etc (and why you have to pay for its Pro version). These may appear nitpickings to most of us, but I suspect (my guesswork here) they were not marginal to an activist hacker. Sorry for the lengthy-techy and slight off-topic reply, just provided some clarifications. Best regards, Giacomo [1] http://pdfreaders.org/os.en.html (Self-disclosure, yes I was loosely involved with this initiative, that is why I'm acquainted with the issue and why I referenced this resource website) [2] http://fsfe.org/activities/os/def.html [3] Since this royalty-free license only applies to Adobe's patents on PDF, then technically there is `the Sword of Damocles' pending on PDF v1.7: if the other patents' owners decide to enforce their patents on PDF v1.7, this would become just 'another standard', no longer an open one. -- PhD Candidate Doctoral School in Sociology and Social Research - Information Systems and Organizations University of Trento, Italy
participants (5)
-
Burcu Bakioglu -
Eric Mill -
Giacomo Poderi -
Jason G. Karlin -
Marianne van den Boomen