I admit I haven't used much of the software mentioned above, although I have substantial experience with spidering and scripting. 200 protest group websites is a significant amount to archive even with wget. If you are serious about archiving such a massive amount of content I would recommend getting acquainted with some heavy duty tools, such as perl, or python. These are generally easier to access through a mac or Gnu/linux box than a windows one, but of course, you can code these things under windows. Its also the case that such tools can work with wget to automate scripting on a higher level. If you are using a mac, the automator (which is free if you have the latest OS*), can actually get a lot of this done for you if you set it up correctly. My best recommendation right off is to purchase the book 'spidering hacks' published by orielly. Most of the scripts are written in perl, but some are in python (which is generally understood to be more readable and beginner friendly). Be careful when you scrape. Check the robots.txt file at the domain level for example http://www.google.com/robots.txt. If your aren't allowed to spider it, then perhaps you need some sort of ethics approval to capture it for academic purposes [if not, I feel you should require this approval, and to open a can of worms - I think the AoIR guidelines should reflect this]. If a site doesn't want you to scrape it (as indicated in the robots.txt), you might consider actually contacting these people and maybe even asking to host a mirror (which would be ideal, and respectful). In return for mirroring the site, of course you get your data. Take Care, BERNiE *P.S. An addenda about Mac's latest OS - Tiger does NOT run SPSS, so if you depend on it (as I currently do :( - be ready to switch to STATA or R for quant work, as SPSS seems to be slack on their Mac development cycles. Bernie Hogan PhD Student Department of Sociology NetLab, Knowledge Media Design Institute University of Toronto I received a message from s.vicari@reading.ac.uk at approximately 6/5/05 8:59 AM. Above is my reply.
Hi,
I am a PhD student at the University of Reading, Uk. I am running a study on 200 protest group websites. Would you suggest any good SW to store whole websites offline?
Thanks a lot, at the moment I am a bit lost in links and buttons... ste
Stefania Vicari PhD student in Sociology University of Reading PO Box 218, Reading, RG6 6AA, United Kingdom. _______________________________________________ The Air-l-aoir.org@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
--