Re: [Air-l] SW to store webpages

8 Jun 2005

      On 8 Jun 2005, Alex Halavais wrote:
...
I guess I will weigh in on the liberal collecting side here:
COPYRIGHT
Agreeing with Dan's post, suggesting that fair use is about as far
from clear cut as possible, it strikes me that there is potential harm
in being over-cautious on this. I guess I am living (ever so slightly)
dangerously when I make private copies of websites as part of my
research, but I also think it is necessary, fair, and just. I would
hate to see us as a community give up our rights to make use of
copyrighted material. Our research does (or at least ought to) serve
the public welfare, and we should assert our rights to use copyrighted
works appropriately. By shying away from this, I think we adversely
affect the future enforcement of copyright.
This is a sentiment that I heartily endorse, and is largely the rationale
behind Laura Gurak's reliance on fair use in her book "Cyberliteracy,"
which Kirsten Foot points out.  However, I think we have to recognize that
Laura was graced with a publisher who was willing to back her principled
stance -- many publishers, not to mention research institutions, will tend
to be more risk averse (or, if you prefer, cowardly) and do not necessarily
have the interests of the research community, at least not the long term
interests, in mind.

Having said that, I think we also need to recognize that there is something
of a chicken-and-egg problem involved; publishers and universities are more
likely to back the fair use of authors and researchers if we demand that
they do so.  Consequently, someone has to be willing to take the initial
risk of asserting fair use.  I would hope that the research community would
be willing to step forward and take that initial risk, but it's necessary
to understand the potential downside before doing so.
...
ROBOTS.TXT / TOS
There needs to be some balance here in terms of coverage. While I may
be in the minority, I don't think that it is vital that robots.txt
*always* be followed. (For practical reasons, especially with dynamic
sites, it may be a good idea, but I don't think it is an absolute.) If
my robot behaves in such a way that it is indistinguishable from a
gaggle of humans loading stuff on their browsers and saving, then I
see no reason I shouldn't be able to use a robot.
Most robot.txt prohibitions exist because web authors are looking for
a way to shape the search engines' report of their site. They have not
predicted the use of the content by researchers.
On the other hand, some robots (built-in IE engine mentioned above,
iirc, as well as Acrobat) seem to behave not at all like a human,
delivering a huge number of simultaneous or sequential requests,
without appropriate delays. Here, the harm (or potential harm) is much
clearer.
This is a rather different kettle of fish, as the problem here arises not
in the context of copyright, but under a theory of trespass to computers --
in essence, under the theory that contact with the server in some manner
not countenanced by the proprietor of the server interferes with her
physical property, that is, with the server itself.  (It is worth noting
that this claim has, so far as I can tell, yet to appear anywhere outside
the U.S., although there is the potential for it to do so in the E.U.
member states and of course in other common law countries.)

In any event, here the fair use argument disappears, as trespass law has no
such doctrine.  The claim does require some type of harm, but courts have
mostly been willing to recognize the unwanted contact *itself* as a harm. 
The robots.txt and TOS files serve as an indicator of what type of contact
is agreeable to the web site owner; unwanted contact will tend to
constitute "harm."  Neither is the dividing line between human contact and
equivalent robotic contact, as unwanted human contact may equally well
consitute trespass (as an example from a recent case: an airline
reservation web site TOS stating that individual customers are welcome to
search the site but travel agents are not).

And there is, again, the question of research ethics, quite apart from the
legality.  Your reasoning above is based largely on a balancing of harms,
but it may be that the question is one of informed consent to the research,
regardless of harm.  I tend to think that this is the wrong approach, as
most web site "scraping" is unlikely to implicate any individual autonomy
or personal interests.  However, resaonable minds could differ.  Charles
Ess points out that under a European, particularly Scandinavian approach,
the answer may be that informed consent is necessary.

Dan L. Burk
Visiting Professor
Cornell Law School
Myron Taylor Hall
Ithaca, NY 14853 USA

Oppenheimer, Wolff & Donnelly Professor
University of Minnesota Law School
229 19th Avenue South
Minneapolis, MN 55455 USA
***************************************
Voice: 612-626-8726
Fax: 612-625-2011

Re: [Air-l] SW to store webpages

Dan L Burk