On 8 Jun 2005, Alex Halavais wrote:
I guess I will weigh in on the liberal collecting side here:
COPYRIGHT
Agreeing with Dan's post, suggesting that fair use is about as far from clear cut as possible, it strikes me that there is potential harm in being over-cautious on this. I guess I am living (ever so slightly) dangerously when I make private copies of websites as part of my research, but I also think it is necessary, fair, and just. I would hate to see us as a community give up our rights to make use of copyrighted material. Our research does (or at least ought to) serve the public welfare, and we should assert our rights to use copyrighted works appropriately. By shying away from this, I think we adversely affect the future enforcement of copyright.
This is a sentiment that I heartily endorse, and is largely the rationale behind Laura Gurak's reliance on fair use in her book "Cyberliteracy," which Kirsten Foot points out. However, I think we have to recognize that Laura was graced with a publisher who was willing to back her principled stance -- many publishers, not to mention research institutions, will tend to be more risk averse (or, if you prefer, cowardly) and do not necessarily have the interests of the research community, at least not the long term interests, in mind. Having said that, I think we also need to recognize that there is something of a chicken-and-egg problem involved; publishers and universities are more likely to back the fair use of authors and researchers if we demand that they do so. Consequently, someone has to be willing to take the initial risk of asserting fair use. I would hope that the research community would be willing to step forward and take that initial risk, but it's necessary to understand the potential downside before doing so.
ROBOTS.TXT / TOS
There needs to be some balance here in terms of coverage. While I may be in the minority, I don't think that it is vital that robots.txt *always* be followed. (For practical reasons, especially with dynamic sites, it may be a good idea, but I don't think it is an absolute.) If my robot behaves in such a way that it is indistinguishable from a gaggle of humans loading stuff on their browsers and saving, then I see no reason I shouldn't be able to use a robot.
Most robot.txt prohibitions exist because web authors are looking for a way to shape the search engines' report of their site. They have not predicted the use of the content by researchers.
On the other hand, some robots (built-in IE engine mentioned above, iirc, as well as Acrobat) seem to behave not at all like a human, delivering a huge number of simultaneous or sequential requests, without appropriate delays. Here, the harm (or potential harm) is much clearer.
This is a rather different kettle of fish, as the problem here arises not in the context of copyright, but under a theory of trespass to computers -- in essence, under the theory that contact with the server in some manner not countenanced by the proprietor of the server interferes with her physical property, that is, with the server itself. (It is worth noting that this claim has, so far as I can tell, yet to appear anywhere outside the U.S., although there is the potential for it to do so in the E.U. member states and of course in other common law countries.) In any event, here the fair use argument disappears, as trespass law has no such doctrine. The claim does require some type of harm, but courts have mostly been willing to recognize the unwanted contact *itself* as a harm. The robots.txt and TOS files serve as an indicator of what type of contact is agreeable to the web site owner; unwanted contact will tend to constitute "harm." Neither is the dividing line between human contact and equivalent robotic contact, as unwanted human contact may equally well consitute trespass (as an example from a recent case: an airline reservation web site TOS stating that individual customers are welcome to search the site but travel agents are not). And there is, again, the question of research ethics, quite apart from the legality. Your reasoning above is based largely on a balancing of harms, but it may be that the question is one of informed consent to the research, regardless of harm. I tend to think that this is the wrong approach, as most web site "scraping" is unlikely to implicate any individual autonomy or personal interests. However, resaonable minds could differ. Charles Ess points out that under a European, particularly Scandinavian approach, the answer may be that informed consent is necessary. Dan L. Burk Visiting Professor Cornell Law School Myron Taylor Hall Ithaca, NY 14853 USA Oppenheimer, Wolff & Donnelly Professor University of Minnesota Law School 229 19th Avenue South Minneapolis, MN 55455 USA *************************************** Voice: 612-626-8726 Fax: 612-625-2011