Re: [Air-l] SW to store pages - scraping and robots

6 Jun 2005

      Bernie Hogan wrote:
...
I admit I haven't used much of the software mentioned above, although I have
substantial experience with spidering and scripting. 200 protest group
websites is a significant amount to archive even with wget.
It really depends on the web presence of these sites. Back in 1998, I 
captured 500 (!) New Age websites (with, err, wget ;-) ) and they all 
fitted zipped on 50 FDD (that's 70MB!!). Now, granted that things have 
changed since then, but it might still be possible to do a reasonable 
job, given how poor many movement sites are (I assume NOW, attac or some 
other professional SMOs are not part of the sample ;-) ).

[ack. snip]
...
Be careful when you scrape. Check the robots.txt file at the domain level
for example http://www.google.com/robots.txt. If your aren't allowed to
spider it, then perhaps you need some sort of ethics approval to capture it
for academic purposes [if not, I feel you should require this approval, and
to open a can of worms - I think the AoIR guidelines should reflect this].
wget actually has an option to ignore "robots.txt" and even an option to 
pose as IE or any other browser for that matter (as do HTTrack  and 
WebCopier ;-) ). I personally wouldn't have any problems to activate 
that option. But that's a political decision I feel is better decided by 
national or supranational polities than a voluntary associations such as 
aoir.

-- 
thomas koenig, ph.d.
http://www.lboro.ac.uk/research/mmethods/staff/thomas/

Re: [Air-l] SW to store pages - scraping and robots

Thomas Koenig