Bernie Hogan wrote:
I admit I haven't used much of the software mentioned above, although I have substantial experience with spidering and scripting. 200 protest group websites is a significant amount to archive even with wget.
It really depends on the web presence of these sites. Back in 1998, I captured 500 (!) New Age websites (with, err, wget ;-) ) and they all fitted zipped on 50 FDD (that's 70MB!!). Now, granted that things have changed since then, but it might still be possible to do a reasonable job, given how poor many movement sites are (I assume NOW, attac or some other professional SMOs are not part of the sample ;-) ). [ack. snip]
Be careful when you scrape. Check the robots.txt file at the domain level for example http://www.google.com/robots.txt. If your aren't allowed to spider it, then perhaps you need some sort of ethics approval to capture it for academic purposes [if not, I feel you should require this approval, and to open a can of worms - I think the AoIR guidelines should reflect this].
wget actually has an option to ignore "robots.txt" and even an option to pose as IE or any other browser for that matter (as do HTTrack and WebCopier ;-) ). I personally wouldn't have any problems to activate that option. But that's a political decision I feel is better decided by national or supranational polities than a voluntary associations such as aoir. -- thomas koenig, ph.d. http://www.lboro.ac.uk/research/mmethods/staff/thomas/