Re: [Air-l] SW to store pages - scraping and robots - Air-L

7 Jun 2005

      On 5 Jun 2005, Thomas Koenig wrote:
...
Bernie Hogan wrote:
...
Be careful when you scrape. Check the robots.txt file at the domain
level
for example http://www.google.com/robots.txt. If your aren't allowed to
spider it, then perhaps you need some sort of ethics approval to capture
it
for academic purposes [if not, I feel you should require this approval,
and
to open a can of worms - I think the AoIR guidelines should reflect
this].
wget actually has an option to ignore "robots.txt" and even an option to 
pose as IE or any other browser for that matter (as do HTTrack  and 
WebCopier ;-) ). I personally wouldn't have any problems to activate 
that option. But that's a political decision I feel is better decided by 
national or supranational polities than a voluntary associations such as 
aoir.
Those who attended Gove Allen's presentation on academic use of automated
data retrieval 'bots at the Toronto conference will recall that this is
both a legal and an ethical problem.  Gove, Charles Ess, Gordon Davis, and
I have papers on both aspects forthcoming.  Given Charles' involvement in
the project, I know that there is a very keen interest in having the AoIR
guidelines at some point, probably not too far in the future, address at
least the ethical side of the problem.

Oh, and don't just check the robots.txt file -- probably better check any
written TOS pages as well.  The machine-readable and the human-readable
prohibitions aren't always congruent. 

Dan L. Burk
Visiting Professor
Cornell Law School
Myron Taylor Hall
Ithaca, NY 14853 USA

Oppenheimer, Wolff & Donnelly Professor
University of Minnesota Law School
229 19th Avenue South
Minneapolis, MN 55455 USA
***************************************
Voice: 612-626-8726
Fax: 612-625-2011

Re: [Air-l] SW to store pages - scraping and robots

Dan L Burk

tags

participants (1)