Seeking solutions to small text search mystery
Colleagues, a graduate student and I could use your help solving a mystery related to computerized text searching/coding of online documents. We are examining documents (all saved as .pdf files) using the advanced search tool in Adobe Reader. While that tool generally works fine, it does not seem to recognize certain fairly standard statistical/mathematical symbols (such as the p used in statistical significance testing and symbols such as <, >, or =) in numerous documents. This is true even when we directly cut and paste the symbol in question into the search tool (surprisingly, it still does not recognize that symbol in the document). The problem occurs only with certain sources (such as all articles from certain journals), even when the rest of the article is fully searchable. This is happening with very recent documents published after 2000 (we are not searching older ones). We suspect these symbols might be part of some equation editor or specially formatted text, but we don't know. Has anyone else encountered and solved a similar problem? Do you have any other suggestions on a search tool for .pdf documents that might be superior? We would also welcome any suggestions on other ways to save these documents and search them that would address this (I think we could do optical character recognition, but fear that may create other accuracy problems). Thanks for any suggestions/thoughts you have related to helping us solve this frustrating little mystery. Craig Craig R. Scott, Ph.D., Associate Professor, Department of Communication & Director, Ph.D. Program School of Communication & Information Rutgers University 4 Huntington Street, New Brunswick, NJ 08901 Voice: 732-932-7500 x8142; Fax: 732-932-3756 Office in 201 DeWitt (185 College Avenue) Web: <http://comminfo.rutgers.edu/directory/crscott/index.html> http://comminfo.rutgers.edu/directory/crscott/index.html <https://www.scils.rutgers.edu/directory/crscott/index.html> Linked in: <http://www.linkedin.com/pub/11/b83/241> http://www.linkedin.com/pub/11/b83/241
The free PDF-XChange viewer might be your answer. I use this application for grading as it has extensive markup tools. It also has superior find and search capabilities including the ability to create concordances. Charlie Charles V. Balch PhD Business Faculty Northern Arizona University - Yuma -----Original Message----- From: air-l-bounces@listserv.aoir.org [mailto:air-l-bounces@listserv.aoir.org] On Behalf Of Craig Scott Sent: Monday, January 10, 2011 1:25 PM To: air-l@listserv.aoir.org Subject: [Air-L] Seeking solutions to small text search mystery Colleagues, a graduate student and I could use your help solving a mystery related to computerized text searching/coding of online documents. We are examining documents (all saved as .pdf files) using the advanced search tool in Adobe Reader. While that tool generally works fine, it does not seem to recognize certain fairly standard statistical/mathematical symbols (such as the p used in statistical significance testing and symbols such as <, >, or =) in numerous documents. This is true even when we directly cut and paste the symbol in question into the search tool (surprisingly, it still does not recognize that symbol in the document). The problem occurs only with certain sources (such as all articles from certain journals), even when the rest of the article is fully searchable. This is happening with very recent documents published after 2000 (we are not searching older ones). We suspect these symbols might be part of some equation editor or specially formatted text, but we don't know. Has anyone else encountered and solved a similar problem? Do you have any other suggestions on a search tool for .pdf documents that might be superior? We would also welcome any suggestions on other ways to save these documents and search them that would address this (I think we could do optical character recognition, but fear that may create other accuracy problems). Thanks for any suggestions/thoughts you have related to helping us solve this frustrating little mystery. Craig Craig R. Scott, Ph.D., Associate Professor, Department of Communication & Director, Ph.D. Program School of Communication & Information Rutgers University 4 Huntington Street, New Brunswick, NJ 08901 Voice: 732-932-7500 x8142; Fax: 732-932-3756 Office in 201 DeWitt (185 College Avenue) Web: <http://comminfo.rutgers.edu/directory/crscott/index.html> http://comminfo.rutgers.edu/directory/crscott/index.html <https://www.scils.rutgers.edu/directory/crscott/index.html> Linked in: <http://www.linkedin.com/pub/11/b83/241> http://www.linkedin.com/pub/11/b83/241 _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org Join the Association of Internet Researchers: http://www.aoir.org/
Another possible solution (depending on what operating system you are on) is pdftotext (http://linux.die.net/man/1/pdftotext) in combination with grep (on Linux or OSX). This makes more sense if you're planning to script the process -- when working manually the mentioned tools are probably more comfortable to use. Best, Cornelius On Mon, Jan 10, 2011 at 9:24 PM, Craig Scott <crscott@rutgers.edu> wrote:
Colleagues, a graduate student and I could use your help solving a mystery related to computerized text searching/coding of online documents. We are examining documents (all saved as .pdf files) using the advanced search tool in Adobe Reader. While that tool generally works fine, it does not seem to recognize certain fairly standard statistical/mathematical symbols (such as the p used in statistical significance testing and symbols such as <, >, or =) in numerous documents. This is true even when we directly cut and paste the symbol in question into the search tool (surprisingly, it still does not recognize that symbol in the document). The problem occurs only with certain sources (such as all articles from certain journals), even when the rest of the article is fully searchable. This is happening with very recent documents published after 2000 (we are not searching older ones). We suspect these symbols might be part of some equation editor or specially formatted text, but we don't know.
Has anyone else encountered and solved a similar problem? Do you have any other suggestions on a search tool for .pdf documents that might be superior? We would also welcome any suggestions on other ways to save these documents and search them that would address this (I think we could do optical character recognition, but fear that may create other accuracy problems). Thanks for any suggestions/thoughts you have related to helping us solve this frustrating little mystery.
Craig
Craig R. Scott, Ph.D.,
Associate Professor, Department of Communication &
Director, Ph.D. Program
School of Communication & Information
Rutgers University
4 Huntington Street, New Brunswick, NJ 08901
Voice: 732-932-7500 x8142; Fax: 732-932-3756
Office in 201 DeWitt (185 College Avenue)
Web: <http://comminfo.rutgers.edu/directory/crscott/index.html> http://comminfo.rutgers.edu/directory/crscott/index.html <https://www.scils.rutgers.edu/directory/crscott/index.html>
Linked in: <http://www.linkedin.com/pub/11/b83/241> http://www.linkedin.com/pub/11/b83/241
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
-- Dr. Cornelius Puschmann, M.A. Department for English Language and Linguistics Heinrich-Heine-Universität Düsseldorf Building 23.11, Level 1, Room 21 Universitätsstrasse 1 40225 Düsseldorf Germany +49 211 81 15927 (office) Nachwuchsforschergruppe "Wissenschaft und Internet" / Junior Researchers Group "Science and the Internet" http://nfgwin.uni-duesseldorf.de
participants (3)
-
Charlie Balch -
Cornelius Puschmann -
Craig Scott