Re: [Air-l] SW to store pages - scraping and robots
On 5 Jun 2005, Thomas Koenig wrote:
Bernie Hogan wrote:
Be careful when you scrape. Check the robots.txt file at the domain level for example http://www.google.com/robots.txt. If your aren't allowed to spider it, then perhaps you need some sort of ethics approval to capture it for academic purposes [if not, I feel you should require this approval, and to open a can of worms - I think the AoIR guidelines should reflect this].
wget actually has an option to ignore "robots.txt" and even an option to pose as IE or any other browser for that matter (as do HTTrack and WebCopier ;-) ). I personally wouldn't have any problems to activate that option. But that's a political decision I feel is better decided by national or supranational polities than a voluntary associations such as aoir.
Those who attended Gove Allen's presentation on academic use of automated data retrieval 'bots at the Toronto conference will recall that this is both a legal and an ethical problem. Gove, Charles Ess, Gordon Davis, and I have papers on both aspects forthcoming. Given Charles' involvement in the project, I know that there is a very keen interest in having the AoIR guidelines at some point, probably not too far in the future, address at least the ethical side of the problem. Oh, and don't just check the robots.txt file -- probably better check any written TOS pages as well. The machine-readable and the human-readable prohibitions aren't always congruent. Dan L. Burk Visiting Professor Cornell Law School Myron Taylor Hall Ithaca, NY 14853 USA Oppenheimer, Wolff & Donnelly Professor University of Minnesota Law School 229 19th Avenue South Minneapolis, MN 55455 USA *************************************** Voice: 612-626-8726 Fax: 612-625-2011
participants (1)
-
Dan L Burk