Re: [Air-l] Sofware to capture content
Eulàlia, *I was wondering if any of you know about software to *capture website content specifically, to capture online *news outlets (CNN, The Washington Post, The New York *Times ) as well as blog-types news. *We are about to engage in a research involving content *coding these sites and were wondering if anybody has *information on costs (any free out there?), ease of use, *effectiveness in capturing content, time needed to capture *content at a point in time, time needed to capture 24-hour *content, and any other pertinent information that you may *want to share. Have a look on this book available online for a recent discussion about websites archiving: http://cfi.imv.au.dk/pub/boeger/bruegger_archiving.pdf and, for concrete applications, I could point you (if you can read Catalonian as I imagine) to a recent final report for a project based on a similar methodology (but different goals, for sure) making some use of Atlas TI to the content analysis: http://www.uoc.edu/in3/psinet/docs/publicaciones/tecnico01.pdf Obviously, its necessary to clarify the specific goals for later considerations about the software solution to use. In this sense, it's not the same to collect some information, to collect all the websites or just to track changes in some specific web pages. In any case, have a look at some software catalogs as Snapfiles (http://www.snapfiles.com/) or Tucows (http://www.tucows.com/). They have some restrictions for thematic software and specific conditions to search only under the freeware pieces. I hope it helps! *Thanks in advance to ya all! Eulàlia Puig Abril *_______________________________________________ *The air-l@listserv.aoir.org mailing list *is provided by the Association of Internet Researchers *http://aoir.org *Subscribe, change options or unsubscribe at: *http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org * *Join the Association of Internet Researchers: *http://www.aoir.org/ _________________________________________________________ Julio Meneses (blog: http://www.zanadoria.com) Dept. Psychology and Educational Sciences. The Open University of Catalonia http://www.uoc.edu/ Internet Interdisciplinary Institute (IN3 - UOC) Research Staff. Project Internet Catalonia (PIC) - Schools on the Network Society http://www.uoc.edu/in3/pic/
On Mar 1, 2006, at 1:27 PM, Julio Meneses Naranjo wrote:
Eulàlia,
*I was wondering if any of you know about software to *capture website content – specifically, to capture online *news outlets (CNN, The Washington Post, The New York *Times ) as well as blog-types news. *We are about to engage in a research involving content *coding these sites and were wondering if anybody has *information on costs (any free out there?),
I've been spidering websites for data collection (on open source software development) and then content analysis for a while. We use perl with WWW::Mechanize gathering pages into a MySQL database, then parsing the content we actually need into other database tables and outputting it in a text format suitable for coding in Atlas TI, Hyperesearch etc. It's far from point-and-click but has worked quite well for our research needs and is of course free (just add labour!). It's also open source: http:// ossmole.sourceforge.net/ and my colleagues there have java code that does similar work. Check the CVS tree, it would require a lot of customization but demonstrates the process. If the sites have an RSS, or similar, feed that would be a lot easier to collect and parse for the text content you need, compared to spidering and saving raw html pages.
ease of use,
WWW::Mechanize is a programming module, but has a remarkably easy to use API capable of simulating a browser clicking links, filling forms, storing cookies etc. http://search.cpan.org/~petdance/WWW-Mechanize-1.18/lib/WWW/Mechanize.pm This desktop gui software also looks good, although I haven't tried it (and it costs $99): http://www.metafy.com/index.html "Visually construct Spiders and Scrapers without scripting"
*effectiveness in capturing content, time needed to capture *content at a point in time, time needed to capture 24-hour *content, and any other pertinent information that you may *want to share.
We wrote a paper about the perils and pitfalls in such web-mining that might be of use (http://floss.syr.edu/publications/ howison04msr.pdf), and have some materials from a workshop I gave on how to do it (available on request). Spiders can capture content very quickly, but beware of two things. Check the robots.txt file for areas they don't want spidered, and especially if it is a small server build a sleep cycle into your grabs to spare their servers (WWW::Mechanize::Sleepy does this automatically). Also it nevers hurts to write first asking for access to the backend database, before going to the substantial effort of spidering, right? Cheers, James http://james.howison.name http://floss.syr.edu
participants (2)
-
James Howison -
Julio Meneses Naranjo